0% found this document useful (0 votes)
5 views41 pages

35 Cse DWM

project
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views41 pages

35 Cse DWM

project
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

NAME-Anand Singh SCHOLAR_NO.

– 22U02035

IndIan InstItute
Of InfOrmatIOn technOlOgy BhOpal

DATAWAREHOUSEAND DATAMINING
(CSE-322)

Submitted by : Submitted to :
Anand Singh Dr. Ya t e n d r a S a h u
22U02035
NAME-Anand Singh SCHOLAR_NO. – 22U02035

INDEX

S.No. Assignments Submission Date

Laboratory Assignments
1 Lab Experiment 1 30-1-2025

2 Lab Experiment 2 4-2-2025

3 Lab Experiment 3 6-4-2025

4 Lab Experiment 4 6-4-2025

5 Lab Experiment 5 6-4-2025

6 Lab Experiment 6 9-4-2025

7 Lab Experiment 7 9-4-2025

8 Lab Experiment 8 9-4-2025

9 Lab Experiment 9 27-4-2025

10 Lab Experiment 10 27-4-2025


NAME-Anand Singh SCHOLAR_NO. – 22U02035

ASSIGNMENT: 1
D ataset D e s c r i p t i o n :
· Dataset Name: Phone Usage Dataset

· Source: Kaggle

· Domain/Context: The dataset is related to mobile phone usage, containing details


about users’ demographics, phone brand, operating system, and usage patterns (screen
time, data consumption, call duration, etc.).

· Number of Data Objects: 15,000 rows (records)

· Number of Features: 10 attributes (columns)

Feature Descriptions:

Feature Name Description Data Type

User ID Unique identifier for each user Categorical (Nominal)

Age Age of the user in years Real-valued (Continuous)

Gender Gender of the user (Male/Female/Other) Categorical (Nominal)

Location City of the user Categorical (Nominal)

Phone Brand Brand of the mobile phone used by the user Categorical (Nominal)

OS Operating System (Android/iOS) Categorical (Nominal)

Screen Time (hrs/day) Average screen time in hours per day Real-valued (Continuous)

Data Usage (GB/month) Internet data consumed per month (GB) Real-valued (Continuous)

Calls Duration (mins/day) Average duration of calls per day (minutes) Real-valued (Continuous)

Number of Apps Installed Number of apps installed on the phone Real-valued (Continuous)
NAME-Anand Singh SCHOLAR_NO. – 22U02035

Classification of Attributes :
Identifiers:

· User ID: Unique for each individual, used solely for reference and not for analysis.
Categorical Attributes:

1. Gender: A limited set of values (Male, Female, Other), making it a


categorical variable useful for grouping.
2. Location: Represents a discrete set of cities, making it a categorical
feature for regional analysis.
3. Phone Brand: Limited to specific brands (Nokia, OnePlus, etc.), useful
for segmentation analysis.
4. OS: Contains only two values (Android, iOS), making it a categorical variable
useful for user behavior segmentation.

Real-Valued Attributes:
1. Age: A continuous numerical variable, as users can have any age value within a
range.
2. Screen Time (hrs/day): A continuous variable indicating how much time users
spend on their devices daily.
3. Data Usage (GB/month): A continuous numeric variable representing the
amount of mobile data consumed per month.
4. Calls Duration (mins/day): Continuous numeric variable measuring call
duration in minutes.
5. Number of Apps Installed: A numeric variable reflecting user app preferences
and installation habits.

5. Justification for Classification


The classification into categorical and real-valued attributes is based on:

· Discrete vs. Continuous Values: If the attribute has predefined categories (e.g.,
OS, Gender), it is categorical. If it represents measurable numeric data with a
broad range, it is real-valued.
· Analytical Usage: Categorical attributes are useful for segmentation and
comparison, whereas real-valued attributes are key for trend analysis and
regression modeling.
NAME-Anand Singh SCHOLAR_NO. – 22U02035

ASSIGNMENT: 2

Dataset Pre-processing and Exploration


1: Load the Dataset In the WEKA Explorer, click on the "Preprocess" tab. Click
the "Open file" button to load your dataset. Browse to the location of the dataset
(used in your previous assignment) and load it into WEKA.

2: Inspect the Dataset Once the dataset is loaded, the main display will show the
attributes (columns), their types (numeric, nominal, etc.), and a summary of the
dataset. Look for any missing values or anomalies in the data. You can use the
"Statistics" button to get more detailed information about each attribute.
NAME-Anand Singh SCHOLAR_NO. – 22U02035

3: Handle Missing Values If your dataset contains missing values, we will apply a
filter to handle them.Click on the "Filter" button in the Preprocess tab. From the
"Choose" list, select the following filter: Supervised -> Attribute ->
ReplaceMissingValues. Apply the filter by clicking "Apply". This will replace the
missing values with either the mean (for numeric attributes) or the mode (for
nominal attributes).
NAME-Anand Singh SCHOLAR_NO. – 22U02035

5: Remove Irrelevant or Redundant Attributes Evaluate the dataset and identify any
irrelevant or redundant attributes. You can manually remove attributes from the
"Attributes" panel by selecting th attribute and pressing the "Remove" button.
Alternatively, use the "Filter" menu to automate attribute removal if you want to
remove attributes with low variance or those that do not contribute meaningfully to
the analysis.
NAME-Anand Singh SCHOLAR_NO. – 22U02035

6: Normalize Data Normalization is the process of scaling numeric attributes to a


specific range, usually between 0 and 1. To normalize the data, go to the "Filter"
section again. Choose the filter: Supervised
-> Attribute -> Normalize. Click "Apply" to scale the numeric values in

the dataset. This is especially useful if you are planning to use algorithms sensitive
to the scale of data (such as k-nearest neighbors or neural netwo
NAME-Anand Singh SCHOLAR_NO. – 22U02035

7: Convert Data Types If any of your numeric attributes need to be converted to


nominal (categorical) or vice versa, you can use WEKA's filter options. For
example, to convert a numeric attribute to nominal, use the filter: Supervised ->
Attribute -> NumericToNominal. To convert nominal attributes to binary, use the
filter: Supervised -> Attribute -> NominalToBinary.

8: Save the Pre-processed Dataset After completing the pre-processing tasks, it’s
important to save the modified dataset for future use. Click on the "Save" button to
save the dataset. Choose the file format (e.g., arff, .csv) and give the file a name.
Save it in a location on your comput.
NAME-Anand Singh SCHOLAR_NO. – 22U02035
NAME-Anand Singh SCHOLAR_NO. – 22U02035

ASSIGNMENT: 3
NAME-Anand Singh SCHOLAR_NO. – 22U02035

PYTHON CODE :

import pandas as pd
from sklearn.model_selection import train_test_split

from sklearn.preprocessing i m p o r t LabelEncoder

from sklearn.naive_bayes i m p o r t G a u s s i a n N B

from sklearn.metrics import accuracy_score, classification_report

# L o a d dataset

df = pd.read_csv("Electric_Vehicle_Population_Data.csv")
NAME-Anand Singh SCHOLAR_NO. – 22U02035

# D r o p irrelevant c o l u m n s

df = df.drop(columns=[

'VIN (1-10)', 'Vehicle Location', 'Electric Utility', 'DOL Vehicle ID',

'2020 C e n s u s Tract', 'City', 'State'

])

df = df.dropna()

# E n c o d e categorical features

label_encoders = {}

for col in df.select_dtypes(include='object').columns:

le = LabelEncoder()
df[col] = le.fit_transform(df[col])

label_encoders[col] = le

# S e t features a n d target
X = df.drop(columns=['Electric Vehicle Type'])

y = df['Electric Vehicle Type']

# Split the datas et

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Naive Bayes model

nb_model = GaussianNB()

nb_model.fit(X_train, y_train)

# Predictions a n d evaluation

y_pred = nb_model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))

print("\nClassification Report:\n", classification_report(y_test, y_pred))


NAME-Anand Singh SCHOLAR_NO. – 22U02035

ASSIGNMENT: 4
NAME-Anand Singh SCHOLAR_NO. – 22U02035

PYTHON CODE :

import pandas as pd
NAME-Anand Singh SCHOLAR_NO. – 22U02035
from sklearn.model_selection i m p o r t train_test_split

from sklearn.tree import DecisionTreeClassifier, plot_tree


from sklearn.metrics import classification_report, accuracy_score

from sklearn.preprocessing i m p o r t LabelEncoder

import matplotlib.pyplot a s plt

# L o a d dataset

df = pd.read_csv("Electric_Vehicle_Population_Data.csv")

# D r o p n o n -informative o r unique identifiers

df = df.drop(columns=[

'VIN (1-10)', 'Vehicle Location', 'Electric Utility', 'DOL Vehicle ID',

'2020 C e n s u s Tract', 'City', 'State'

])

# D r o p rows with missing values

df = df.dropna()

# Encode categorical variables

label_encoders = {}

for col in df.select_dtypes(include='object').columns:

le = LabelEncoder()
df[col] = le.fit_transform(df[col])

label_encoders[col] = le

# S e t target a n d features

X = df.drop(columns=['Electric Vehicle Type']) # Features

y = df['Electric Vehicle Type'] # Target

# Train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree


model = DecisionTreeClassifier(random_state=42)

model.fit(X_train, y_train)
NAME-Anand Singh SCHOLAR_NO. – 22U02035

# Predictions a n d Evaluation

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

print("Classification Report:\n", classification_report(y_test, y_pred))

# Visualize the Decision Tree

plt.figure(figsize=(20, 10))

plot_tree(model, filled=True, feature_names=X.columns, class_names=label_encoders['Electric Ve h i c l e


Type'].classes_)

plt.show()
NAME-Anand Singh SCHOLAR_NO. – 22U02035

ASSIGNMENT: 5

PYTHON CODE :
NAME-Anand Singh SCHOLAR_NO. – 22U02035

import pandas as pd

from sklearn.model_selection import train_test_split

f r o m sklearn.preprocessing i m p o r t LabelEncoder, S t a n d a r d S c a l e r

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import classification_report, accuracy_score

# L o a d dataset

df = pd.read_csv("Electric_Vehicle_Population_Data.csv")

# D r o p irrelevant c o l u m n s

df = df.drop(columns=[

'VIN (1-10)', 'Vehicle Location', 'Electric Utility', 'DOL Vehicle ID',

'2020 C e n s u s Tract', 'City', 'State'

])

df = df.dropna()

# Encode categorical variables

label_encoders = {}

for col in df.select_dtypes(include='object').columns:

le = LabelEncoder()
df[col] = le.fit_transform(df[col])

label_encoders[col] = le

# S e t features a n d target
X = df.drop(columns=['Electric Vehicle Type'])

y = df['Electric Vehicle Type']

# Feature scaling
scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

# Split dataset

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


NAME-Anand Singh SCHOLAR_NO. – 22U02035

# Initialize and train k-N N classifier


knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train, y_train)

# Evaluate model

y_pred = knn.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))

print("\nClassification Report:\n", classification_report(y_test, y_pred))


NAME-Anand Singh SCHOLAR_NO. – 22U02035

ASSIGNMENT: 6

PYTHON CODE :

import pandas as

pd import n u m p y

as np

from sklearn.preprocessing import


NAME-Anand Singh SCHOLAR_NO. – 22U02035
StandardScaler from sklearn.cluster

import KMeans
from sklearn.metrics import silhouette_score
NAME-Anand Singh SCHOLAR_NO. – 22U02035

import

matplotlib.pyplot as plt

import seaborn as sns

# Step 1: Load dataset from C S V file

df = pd.read_csv("iris_dataset.csv")

# Step 2: Pre-processing

# - Removing the 'target' column for unsupervised learning

X = df.drop('target', axis=1)

# - Handling missing values (if any)

X.fillna(X.mean(), inplace=True)

# - Standardizing the data

scaler =

StandardScaler()

X_scaled =

scaler.fit_transform(X)

# Step 3: Apply K -Means Clustering

k = 3 # Number of clusters

kmeans = KMeans(n_clusters=k, random_state=42,

n_init=10) kmeans.fit(X_scaled)

# Step 4: Evaluate Clustering

labels = kmeans.labels_
sil_score =
NAME-Anand Singh SCHOLAR_NO. – 22U02035

silhouette_score(X_scaled, labels)

print(f"Silhouette Score:

{sil_score:.3f}")

# Step 5: Visualization (First two principal components)

plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_scaled[:, 0], y=X_scaled[:, 1],

hue=labels, palette="viridis") plt.title('K-Means

Clustering (First t w o features)')

plt.xlabel('Feature 1 (standardized)')

plt.ylabel('Feature 2

(standardized)')

plt.legend(title='Cluster')

plt.grid(True)
plt.show()
NAME-Anand Singh SCHOLAR_NO. – 22U02035
NAME-Anand Singh SCHOLAR_NO. – 22U02035

ASSIGNMENT: 7

PYTHON CODE :

import pandas

as pd import

n u m p y as np

from sklearn.datasets import load_iris


from sklearn.preprocessing import

StandardScaler from sklearn.cluster


NAME-Anand Singh SCHOLAR_NO. – 22U02035

import AgglomerativeClustering

from sklearn.metrics import

silhouette_score

import

matplotlib.pyplot as plt

import seaborn as sns


from scipy.cluster.hierarchy import dendrogram, linkage

# Step 1: Load

dataset iris =

load_iris()
NAME-Anand Singh SCHOLAR_NO. – 22U02035

df = pd.DataFrame(data=iris.data,

columns=iris.feature_names) df['target'] =

iris.target # Optional

# Step 2: Pre-processing
X = df.drop('target',

axis=1)

X.fillna(X.mean(),

inplace=True) scaler =

StandardScaler()

X_scaled =

scaler.fit_transform(X)

# Step 3: Agglomerative Clustering (using 'metric' instead of 'affinity')


model = AgglomerativeClustering(n_clusters=3,

metric='euclidean', linkage='ward') labels =

model.fit_predict(X_scaled)

# Step 4: Evaluation
sil_score =

silhouette_score(X_scaled, labels)

print(f"Silhouette Score:

{sil_score:.3f}")

# Step 5: Dendrogram
linked = linkage(X_scaled,
NAME-Anand Singh SCHOLAR_NO. – 22U02035
method='ward') plt.figure(figsize=(10,

6))
dendrogram(linked, orientation='top', distance_sort='descending',

show_leaf_counts=True) plt.title("Hierarchical Clustering

Dendrogram")

plt.xlabel("Data

Points")

plt.ylabel("Distance")

plt.grid(True)

plt.show()

# Step 6: Visualize Clusters (first 2

features) plt.figure(figsize=(8, 6))

sns.scatterplot(x=X_scaled[:, 0 ] , y=X_scaled[:, 1],

hue=labels, palette='Set2') plt.title('Agglomerative

Clustering (First Tw o Features)')

plt.xlabel('Feature 1')
plt.ylabel('Featu

r e 2')

plt.grid(True)

plt.show()
NAME-Anand Singh SCHOLAR_NO. – 22U02035
NAME-Anand Singh SCHOLAR_NO. – 22U02035

ASSIGNMENT: 8

PYTHON CODE :

# Import necessary libraries

import n u m p y as n p import

pandas as pd

import

matplotlib.pyplot as plt

import seaborn as sns

from sklearn.datasets import load_iris


NAME-Anand Singh SCHOLAR_NO. – 22U02035

from sklearn.preprocessing import

StandardScaler f r o m sklearn.cluster

import D B S C A N
NAME-Anand Singh SCHOLAR_NO. – 22U02035
from sklearn.metrics import silhouette_score

# Step 1: Load Dataset

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)

# Step 2: Preprocessing (Standardization)

scaler =

StandardScaler()

X_scaled =

scaler.fit_transform(X)

# Step 3: DBSCAN Clustering

# eps = m a x distance b e t w e e n t w o s a m p l e s for t h e m to b e considered a s neighbors #

m i n _ s a m p l e s = m i n i m u m n u m b e r of points required to form a de ns e region dbs c a n =

DBSCAN(eps=0.6, min_samples=4, metric='euclidean')

labels = dbscan.fit_predict(X_scaled)

# Step 4: Evaluation

n_clusters = len(set(labels)) - (1 if -1 in labels

else 0) print(f"Number of clusters (excluding

noise): {n_clusters}") print("Cluster labels:",

labels)

# Check if silhouette score can be calculated

if len(set(labels)) > 1 and -1 not

in labels: sil_score =

silhouette_score(X_scaled,
NAME-Anand Singh SCHOLAR_NO. – 22U02035
labels) print(f"Silhouette

Score: {sil_score:.3f}")

else:
print("Silhouette Score cannot be calculated due to noise or only one
cluster.")

# Step 5: Visualization

plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_scaled[:, 0], y=X_scaled[:, 1],

hue=labels, palette='Set2') plt.title('DBSCAN


Clustering Result (First 2 Features)')
plt.xlabel('Feature 1 (Standardized)')
plt.ylabel('Feature 2

(Standardized)') plt.grid(True)
plt.show()
NAME-Anand Singh SCHOLAR_NO. – 22U02035
NAME-Anand Singh SCHOLAR_NO. – 22U02035

ASSIGNMENT: 9
Lab Experiment 9:
Instructions
Lab Experiment 9: Apriori Algorithm for Association Rule Mining Using Weka/Python
Libraries

Objective: To implement and evaluate the Apriori Algorithm for mining association rules on a
transactional dataset (take an sample dataset) using Weka and Python libraries. The experiment will involve
data pre-processing steps and the application of the Apriori Algorithm to generate meaningful association
rules with support 75% and confidence 85%.

PYTHON CODE :

import pandas as pd

f r o m mlxtend.frequent_patterns i m p o r t apriori, association_rules

# S a m p l e Transactional D a t a

dataset = [

['milk', 'bread', 'butter'],


['bread', 'butter'],

['milk', 'bread'],

['milk', 'bread', 'butter', 'jam'],

['bread', 'butter', 'jam'], ]

# C o n v e r t i n g dataset into a dataframe o f 1 s a n d 0 s


from mlxtend.preprocessing i m p o r t TransactionEncoder

te = TransactionEncoder()

te_ary = te.fit(dataset).transform(dataset)

df = pd.DataFrame(te_ary, columns=te.columns_)

# Display the D a t a F r a m e

print("Transaction D a t a F r a m e : " )

print(df)
NAME-Anand Singh SCHOLAR_NO. – 22U02035
# Apply Apriori Algorithm
frequent_itemsets = apriori(df, min_support=0.75, use_colnames=True)

print("\nFr equent Itemsets:")

print(frequent_itemsets)

# Generate Association Rules

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.85)

print("\nAssociation Rules:")

print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

OUTPUT :

· Bread appears in 100% of transactions!


· If someone buys butter, they always buy bread too (100% confidence).
NAME-Anand Singh SCHOLAR_NO. – 22U02035
· Lift = 1.0 → It’s a strong but "expected" association (no huge surprise).
NAME-Anand Singh SCHOLAR_NO. – 22U02035

ASSIGNMENT: 10

Lab Experiment 10:


Instructions
Lab Experiment 10: FP-Growth Algorithm for Association Rule Mining Using
Weka/Python Libraries
Objective:
To implement and evaluate the FP-Growth (Frequent Pattern Growth) Algorithm for mining association
rules on a transactional dataset (use sample dataset) using Weka and Python libraries. The experiment will
involve data pre-processing steps and the application of the FP-Growth Algorithm to generate frequent
itemsets and association rules efficiently.

PYTHON CODE :

# Install nece ssary libraries

# pip install p a n d a s m l x t e n d

import pandas as pd

f r o m mlxtend.frequent_patterns import fpgrowth, association_rules

# S a m p l e transaction dataset

dataset = [

['milk', 'bread', 'butter'],

['bread', 'butter'],

['milk', 'bread'],
['milk', 'bread', 'butter', 'jam'],

['bread', 'butter', 'jam']

# C o n v e r t dataset to dataframe

from mlxtend.preprocessing i m p o r t TransactionEncoder

te = TransactionEncoder()

te_ary = te.fit(dataset).transform(dataset)
NAME-Anand Singh SCHOLAR_NO. – 22U02035
df = pd.DataFrame(te_ary, columns=te.columns_)

print("Transaction D a t a F r a m e : " )

print(df)

# Generate frequent itemsets using F P -G r o w t h

frequent_itemsets = fpgrowth(df, min_support=0.6, use_colnames=True)


print("\n F r e q u e n t Itemsets:")

print(frequent_itemsets)

# G e n e r a t e association rules

rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.7)

print("\nAssociation Rules:")

print(rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']])

OUTPUT :
NAME-Anand Singh SCHOLAR_NO. – 22U02035

You might also like