0% found this document useful (0 votes)
14 views24 pages

Data Mining 1 Practical File-1

This document outlines practical exercises for a Data Mining course at AryAbhAttA College, University of Delhi, focusing on data cleaning, pre-processing, and classification techniques using datasets such as the wine dataset. It includes detailed instructions for applying various data mining algorithms and techniques, including handling missing values, outlier detection, and implementing classification algorithms like Naive Bayes and K-means clustering. The document also provides code snippets for each practical task, demonstrating the application of these techniques in Python.

Uploaded by

ranjeet verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views24 pages

Data Mining 1 Practical File-1

This document outlines practical exercises for a Data Mining course at AryAbhAttA College, University of Delhi, focusing on data cleaning, pre-processing, and classification techniques using datasets such as the wine dataset. It includes detailed instructions for applying various data mining algorithms and techniques, including handling missing values, outlier detection, and implementing classification algorithms like Naive Bayes and K-means clustering. The document also provides code snippets for each practical task, demonstrating the application of these techniques in Python.

Uploaded by

ranjeet verma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

AryAbhAttA college

University of delhi

DATA MINING I PRACTICAL


SEMESTER IV

sUbMitted by : PriyAnshU KUMAr


ROLL NO : CSC/23/54 45
University roll no : 23059570007
23059570039
sUbMitted to : dr. sonAl lindA
COURSE : B.sc (H) Computer science
INDEX
S.no. Practicals Page
no.

1. Apply data cleaning techniques on any dataset 4-6


(e.g, wine dataset). Techniques may include
handling missing values, outliers, inconsistent
values. A set of validation rules can be prepared
based on the dataset and validations can be
performed.
2. Apply data pre-processing techniques such as 7-14
standardization/normalization, transformation,
aggregation, discretization/binarization,
sampling etc. on any dataset
3. Run Apriori algorithm to find frequent item sets 15-16
and association rules on 2 real datasets and use
appropriate evaluation measures to compute
correctness of obtained patterns
a) Use minimum support as 50% and minimum
confidence as 75%
b) Use minimum support as 60% and minimum
confidence as 60 %
4. Use Naive bayes, K-nearest, and Decision tree 16-20
classification algorithms and build classifiers on
any two datasets. Divide the data set into
training and test set. Compare the accuracy of
the different classifiers under the following
situations:
I.
a) Training set = 75% Test set = 25%
b) Training set = 66.6% (2/3rd of total), Test set =
33.3%
II. Training set is chosen by
a) hold out method
b) Random subsampling
c) Cross-Validation. Compare the
accuracy of the classifiers obtained.
Data needs to be scaled to standard
format.
5. Use Simple K-means algorithm for clustering on 21-26
any dataset. Compare the performance of
clusters by changing the parameters involved in
the algorithm. Plot MSE computed after each
iteration using a line plot for any set of
parameters.
Practical 1
Question :
Apply data cleaning techniques on any dataset
(e.g, wine dataset). Techniques may include
handling missing values, outliers, inconsistent
values. A set of validation rules can be prepared
based on the dataset and validations can be
performed.

CODE :

# Step 1: Import Libraries

import pandas as pd

import numpy as np

from sklearn.datasets import load_wine

import seaborn as sns

import matplotlib.pyplot as plt

# Step 2: Load Dataset

wine = load_wine()

df = pd.DataFrame(wine.data, columns=wine.feature_names)

df['target'] = wine.target

print("Original Dataset:")

print(df.head())

# Step 3: Introduce Some Missing Values for Practice

df.loc[5:10, 'ash'] = np.nan

df.loc[15:18, 'alcalinity_of_ash'] = np.nan


# Step 4: Handling Missing Values

# Option 1: Fill with Mean

df['ash'].fillna(df['ash'].mean(), inplace=True)

df['alcalinity_of_ash'].fillna(df['alcalinity_of_ash'].median(), inplace=True)

# Step 5: Handling Outliers (Using IQR)

def remove_outliers(col):

Q1 = df[col].quantile(0.25)

Q3 = df[col].quantile(0.75)

IQR = Q3 - Q1

lower = Q1 - 1.5 * IQR

upper = Q3 + 1.5 * IQR

df[col] = np.where((df[col] < lower) | (df[col] > upper), df[col].median(), df[col])

# Apply on few numeric columns

for col in ['alcohol', 'malic_acid', 'color_intensity']:

remove_outliers(col)

# Step 6: Inconsistent Values

# Suppose we mistakenly type a wrong value in 'target'

df.loc[0, 'target'] = 5 # Invalid, valid target = 0,1,2

# Fix: Replace with mode or set a rule

df['target'] = df['target'].apply(lambda x: x if x in [0, 1, 2] else df['target'].mode()[0])

# Step 7: Validation Rules

validation_results = {

'no_negative_values': (df.select_dtypes(include=[np.number]) >= 0).all().all(),

'target_in_range': df['target'].isin([0, 1, 2]).all(),

'ash_not_null': df['ash'].isnull().sum() == 0,

print("Validation Results:")

print(validation_results)
# Optional Visualization

sns.boxplot(x=df['alcohol'])

plt.title("Boxplot for Alcohol after Outlier Handling")

plt.show()

Screenshot :
(i)

(ii)
Practical 2
Question :
Apply data pre-processing techniques such as
standardization/normalization, transformation, aggregation,
discretization/binarization, sampling etc. on any dataset

CODE :

import pandas as pd

import numpy as np

from sklearn.datasets import load_wine

from sklearn.preprocessing import StandardScaler, MinMaxScaler, KBinsDiscretizer, Binarizer

from sklearn.model_selection import train_test_split

# Load dataset

wine = load_wine()

df = pd.DataFrame(wine.data, columns=wine.feature_names)

df['target'] = wine.target

print("Original Dataset:")

print(df.head())

# 1. Standardization (mean=0, std=1)

scaler = StandardScaler()

standardized = scaler.fit_transform(df.iloc[:, :-1]) # without target

df_standardized = pd.DataFrame(standardized, columns=wine.feature_names)

print("\nStandardized Data:")

print(df_standardized.head())

# 2. Normalization (min-max scaling)

minmax = MinMaxScaler()

normalized = minmax.fit_transform(df.iloc[:, :-1])


df_normalized = pd.DataFrame(normalized, columns=wine.feature_names)

print("\nNormalized Data:")

print(df_normalized.head())

# 3. Transformation (log transformation on skewed column)

df['log_proline'] = np.log(df['proline'] + 1) # +1 to avoid log(0)

print("\nLog Transformed 'proline':")

print(df[['proline', 'log_proline']].head())

# 4. Aggregation (mean values by class)

aggregated = df.groupby('target').mean()

print("\nAggregated Mean by Target Class:")

print(aggregated)

# 5. Discretization (binning alcohol into 3 categories)

discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')

df['alcohol_bin'] = discretizer.fit_transform(df[['alcohol']])

print("\nDiscretized 'alcohol':")

print(df[['alcohol', 'alcohol_bin']].head())

# 6. Binarization (proline high/low based on threshold)

binarizer = Binarizer(threshold=750)

df['proline_bin'] = binarizer.fit_transform(df[['proline']])

print("\nBinarized 'proline':")

print(df[['proline', 'proline_bin']].head())

# 7. Sampling (random 20% of data)

sampled_df = df.sample(frac=0.2, random_state=42)

print("\nSampled 20% of Dataset:")

print(sampled_df.head())
OUTPUT :-

Original Dataset:

alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols \

0 14.23 1.71 2.43 15.6 127.0 2.80

1 13.20 1.78 2.14 11.2 100.0 2.65

2 13.16 2.36 2.67 18.6 101.0 2.80

3 14.37 1.95 2.50 16.8 113.0 3.85

4 13.24 2.59 2.87 21.0 118.0 2.80

flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue \

0 3.06 0.28 2.29 5.64 1.04

1 2.76 0.26 1.28 4.38 1.05

2 3.24 0.30 2.81 5.68 1.03

3 3.49 0.24 2.18 7.80 0.86

4 2.69 0.39 1.82 4.32 1.04

od280/od315_of_diluted_wines proline target

0 3.92 1065.0 0

1 3.40 1050.0 0

2 3.17 1185.0 0

3 3.45 1480.0 0

4 2.93 735.0 0

Standardized Data:

alcohol malic_acid ash alcalinity_of_ash magnesium \

0 1.518613 -0.562250 0.232053 -1.169593 1.913905

1 0.246290 -0.499413 -0.827996 -2.490847 0.018145

2 0.196879 0.021231 1.109334 -0.268738 0.088358

3 1.691550 -0.346811 0.487926 -0.809251 0.930918

4 0.295700 0.227694 1.840403 0.451946 1.281985

total_phenols flavanoids nonflavanoid_phenols proanthocyanins \

0 0.808997 1.034819 -0.659563 1.224884


1 0.568648 0.733629 -0.820719 -0.544721

2 0.808997 1.215533 -0.498407 2.135968

3 2.491446 1.466525 -0.981875 1.032155

4 0.808997 0.663351 0.226796 0.401404

color_intensity hue od280/od315_of_diluted_wines proline

0 0.251717 0.362177 1.847920 1.013009

1 -0.293321 0.406051 1.113449 0.965242

2 0.269020 0.318304 0.788587 1.395148

3 1.186068 -0.427544 1.184071 2.334574

4 -0.319276 0.362177 0.449601 -0.037874

Normalized Data:

alcohol malic_acid ash alcalinity_of_ash magnesium \

0 0.842105 0.191700 0.572193 0.257732 0.619565

1 0.571053 0.205534 0.417112 0.030928 0.326087

2 0.560526 0.320158 0.700535 0.412371 0.336957

3 0.878947 0.239130 0.609626 0.319588 0.467391

4 0.581579 0.365613 0.807487 0.536082 0.521739

total_phenols flavanoids nonflavanoid_phenols proanthocyanins \

0 0.627586 0.573840 0.283019 0.593060

1 0.575862 0.510549 0.245283 0.274448

2 0.627586 0.611814 0.320755 0.757098

3 0.989655 0.664557 0.207547 0.558360

4 0.627586 0.495781 0.490566 0.444795

color_intensity hue od280/od315_of_diluted_wines proline

0 0.372014 0.455285 0.970696 0.561341

1 0.264505 0.463415 0.780220 0.550642

2 0.375427 0.447154 0.695971 0.646933

3 0.556314 0.308943 0.798535 0.857347

4 0.259386 0.455285 0.608059 0.325963


Log Transformed 'proline':

proline log_proline

0 1065.0 6.971669

1 1050.0 6.957497

2 1185.0 7.078342

3 1480.0 7.300473

4 735.0 6.601230

Aggregated Mean by Target Class:

alcohol malic_acid ash alcalinity_of_ash magnesium \

target

0 13.744746 2.010678 2.455593 17.037288 106.338983

1 12.278732 1.932676 2.244789 20.238028 94.549296

2 13.153750 3.333750 2.437083 21.416667 99.312500

total_phenols flavanoids nonflavanoid_phenols proanthocyanins \

target

0 2.840169 2.982373 0.290000 1.899322

1 2.258873 2.080845 0.363662 1.630282

2 1.678750 0.781458 0.447500 1.153542

color_intensity hue od280/od315_of_diluted_wines proline \

target

0 5.528305 1.062034 3.157797 1115.711864

1 3.086620 1.056282 2.785352 519.507042

2 7.396250 0.682708 1.683542 629.895833

log_proline

target

0 6.998383

1 6.212565

2 6.430818

Discretized 'alcohol':
alcohol alcohol_bin

0 14.23 2.0

1 13.20 1.0

2 13.16 1.0

3 14.37 2.0

4 13.24 1.0

Binarized 'proline':

proline proline_bin

0 1065.0 1.0

1 1050.0 1.0

2 1185.0 1.0

3 1480.0 1.0

4 735.0 0.0

Sampled 20% of Dataset:

alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols \

19 13.64 3.10 2.56 15.2 116.0 2.70

45 14.21 4.04 2.44 18.9 111.0 2.85

140 12.93 2.81 2.70 21.0 96.0 1.54

30 13.73 1.50 2.70 22.5 101.0 3.00

67 12.37 1.17 1.92 19.6 78.0 2.11

flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue \

19 3.03 0.17 1.66 5.10 0.96

45 2.65 0.30 1.25 5.24 0.87

140 0.50 0.53 0.75 4.60 0.77

30 3.25 0.29 2.38 5.70 1.19

67 2.00 0.27 1.04 4.68 1.12

od280/od315_of_diluted_wines proline target log_proline alcohol_bin \

19 3.36 845.0 0 6.740519 2.0

45 3.33 1080.0 0 6.985642 2.0

140 2.31 600.0 2 6.398595 1.0


30 2.71 1285.0 0 7.159292 2.0

67 3.48 510.0 1 6.236370 1.0

proline_bin

19 1.0

45 1.0

140 0.0

30 1.0

67 0.0

Screenshot :
(i)
(ii)

(iii)
Practical 3
QUESTION :
Run Apriori algorithm to find frequent item sets and
association rules on 2 real datasets and use appropriate
evaluation measures to compute correctness of obtained patterns
a) Use minimum support as 50% and minimum confidence as 75%
b) Use minimum support as 60% and minimum confidence as 60 %

CODE :

import pandas as pd

from mlxtend.frequent_patterns import apriori, association_rules

# Dataset 1: Groceries (example style)

data = [['milk', 'bread', 'eggs'],

['milk', 'bread'],

['milk', 'eggs'],

['bread', 'eggs'],

['milk', 'bread', 'eggs', 'butter'],

['bread', 'butter']]

# Convert to dataframe (one-hot encoded format)

from mlxtend.preprocessing import TransactionEncoder

te = TransactionEncoder()

te_data = te.fit(data).transform(data)

df = pd.DataFrame(te_data, columns=te.columns_)

def run_apriori(df, min_support, min_confidence):

print(f"\nRunning Apriori with min_support={min_support}, min_confidence={min_confidence}")

freq_items = apriori(df, min_support=min_support, use_colnames=True)

rules = association_rules(freq_items, metric="confidence", min_threshold=min_confidence)


# Evaluation: Show Lift, Confidence, Support

rules = rules[['antecedents', 'consequents', 'support', 'confidence', 'lift']]

print("\nFrequent Itemsets:\n", freq_items)

print("\nAssociation Rules:\n", rules)

return rules

# a) Support=50%, Confidence=75%

run_apriori(df, min_support=0.5, min_confidence=0.75)

# b) Support=60%, Confidence=60%

run_apriori(df, min_support=0.6, min_confidence=0.6)

Screenshot :
(i)
Practical 4
QUESTION :
Use Naive bayes, K-nearest, and Decision tree
classification algorithms and build classifiers on any two
datasets. Divide the data set into training and test set.
Compare the accuracy of the different classifiers under
the following situations:
I.
a) Training set = 75% Test set = 25%
b) Training set = 66.6% (2/3rd of total), Test set =
33.3%

II. Training set is chosen by

i) hold out method


ii) Random subsampling
iii) Cross-Validation. Compare the accuracy of
the classifiers obtained. Data needs to be
scaled to standard format.

CODE :

import pandas as pd

from sklearn import datasets

from sklearn.model_selection import train_test_split, cross_val_score

from sklearn.preprocessing import StandardScaler

from sklearn.naive_bayes import GaussianNB

from sklearn.neighbors import KNeighborsClassifier

from sklearn.tree import DecisionTreeClassifier


from sklearn.metrics import accuracy_score

# Load Dataset 1: Iris

iris = datasets.load_iris()

X1 = pd.DataFrame(iris.data, columns=iris.feature_names)

y1 = pd.Series(iris.target)

# Load Dataset 2: Wine

wine = datasets.load_wine()

X2 = pd.DataFrame(wine.data, columns=wine.feature_names)

y2 = pd.Series(wine.target)

# Standardize the data

scaler = StandardScaler()

X1_scaled = scaler.fit_transform(X1)

X2_scaled = scaler.fit_transform(X2)

# Classifiers

models = {

'Naive Bayes': GaussianNB(),

'KNN': KNeighborsClassifier(),

'Decision Tree': DecisionTreeClassifier(random_state=0)

# Function to evaluate models

def evaluate_models(X, y, dataset_name):

print(f"\n Results for {dataset_name} Dataset:\n")

# I. a) Train 75% / Test 25%

print("I. a) 75% Train / 25% Test:")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

for name, model in models.items():

model.fit(X_train, y_train)

y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)

print(f" {name}: {acc:.2f}")

# I. b) Train 66.6% / Test 33.3%

print("\nI. b) 66.6% Train / 33.3% Test:")

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.333, random_state=42)

for name, model in models.items():

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

acc = accuracy_score(y_test, y_pred)

print(f" {name}: {acc:.2f}")

# II. Hold-out (already done above)

# II. Random Subsampling (Repeat holdout multiple times)

print("\nII. Random Subsampling (10 runs):")

for name, model in models.items():

scores = []

for _ in range(10):

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

scores.append(accuracy_score(y_test, y_pred))

print(f" {name}: Avg Accuracy: {sum(scores)/len(scores):.2f}")

# II. Cross-validation

print("\nII. Cross Validation (5-fold):")

for name, model in models.items():

scores = cross_val_score(model, X, y, cv=5)

print(f" {name}: Avg Accuracy: {scores.mean():.2f}")

# Run evaluation on both datasets

evaluate_models(X1_scaled, y1, "Iris")

evaluate_models(X2_scaled, y2, "Wine")


Screenshot :

(i)
Practical 5
Question :
Use Simple K-means algorithm for clustering
on any dataset. Compare the performance of clusters by
changing the parameters involved in the algorithm. Plot
MSE computed after each iteration using a line plot for
any set of parameters.

CODE :

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Load Iris dataset


iris = datasets.load_iris()
X = iris.data

# Scale the data


scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Try different values of k (clusters)


cluster_range = [2, 3, 4, 5, 6]

print(" Cluster Performance (Silhouette Score):\n")

for k in cluster_range:
kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, random_state=42)
kmeans.fit(X_scaled)
labels = kmeans.labels_
silhouette = silhouette_score(X_scaled, labels)
print(f"k = {k} → Silhouette Score = {silhouette:.4f}, Inertia (MSE) =
{kmeans.inertia_:.2f}")

# --------------------------
# MSE vs Iteration Plot for k=3

print("\n Plotting MSE per iteration for k=3")


kmeans = KMeans(n_clusters=3, init='random', max_iter=10, n_init=1, verbose=1,
random_state=42)
kmeans.fit(X_scaled)

# Plot inertia over iterations (Using verbose=True helps print it)


# Note: sklearn doesn't expose inertia per iteration, so we'll do manual tracking
inertias = []
X_input = X_scaled
for i in range(1, 11):
kmeans = KMeans(n_clusters=3, init='random', max_iter=i, n_init=1, random_state=42)
kmeans.fit(X_input)
inertias.append(kmeans.inertia_)

# Plotting the inertia (MSE) after each iteration


plt.figure(figsize=(8, 5))
plt.plot(range(1, 11), inertias, marker='o', linestyle='-')
plt.title("MSE (Inertia) after each iteration (k=3)")
plt.xlabel("Iteration")
plt.ylabel("MSE (Inertia)")
plt.grid(True)
plt.show()

Screenshot:
(i)

(ii)
(iii)

Thank You

You might also like