Data Mining 1 Practical File-1
Data Mining 1 Practical File-1
University of delhi
CODE :
import pandas as pd
import numpy as np
wine = load_wine()
df = pd.DataFrame(wine.data, columns=wine.feature_names)
df['target'] = wine.target
print("Original Dataset:")
print(df.head())
df['ash'].fillna(df['ash'].mean(), inplace=True)
df['alcalinity_of_ash'].fillna(df['alcalinity_of_ash'].median(), inplace=True)
def remove_outliers(col):
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
remove_outliers(col)
validation_results = {
'ash_not_null': df['ash'].isnull().sum() == 0,
print("Validation Results:")
print(validation_results)
# Optional Visualization
sns.boxplot(x=df['alcohol'])
plt.show()
Screenshot :
(i)
(ii)
Practical 2
Question :
Apply data pre-processing techniques such as
standardization/normalization, transformation, aggregation,
discretization/binarization, sampling etc. on any dataset
CODE :
import pandas as pd
import numpy as np
# Load dataset
wine = load_wine()
df = pd.DataFrame(wine.data, columns=wine.feature_names)
df['target'] = wine.target
print("Original Dataset:")
print(df.head())
scaler = StandardScaler()
print("\nStandardized Data:")
print(df_standardized.head())
minmax = MinMaxScaler()
print("\nNormalized Data:")
print(df_normalized.head())
print(df[['proline', 'log_proline']].head())
aggregated = df.groupby('target').mean()
print(aggregated)
df['alcohol_bin'] = discretizer.fit_transform(df[['alcohol']])
print("\nDiscretized 'alcohol':")
print(df[['alcohol', 'alcohol_bin']].head())
binarizer = Binarizer(threshold=750)
df['proline_bin'] = binarizer.fit_transform(df[['proline']])
print("\nBinarized 'proline':")
print(df[['proline', 'proline_bin']].head())
print(sampled_df.head())
OUTPUT :-
Original Dataset:
0 3.92 1065.0 0
1 3.40 1050.0 0
2 3.17 1185.0 0
3 3.45 1480.0 0
4 2.93 735.0 0
Standardized Data:
Normalized Data:
proline log_proline
0 1065.0 6.971669
1 1050.0 6.957497
2 1185.0 7.078342
3 1480.0 7.300473
4 735.0 6.601230
target
target
target
log_proline
target
0 6.998383
1 6.212565
2 6.430818
Discretized 'alcohol':
alcohol alcohol_bin
0 14.23 2.0
1 13.20 1.0
2 13.16 1.0
3 14.37 2.0
4 13.24 1.0
Binarized 'proline':
proline proline_bin
0 1065.0 1.0
1 1050.0 1.0
2 1185.0 1.0
3 1480.0 1.0
4 735.0 0.0
proline_bin
19 1.0
45 1.0
140 0.0
30 1.0
67 0.0
Screenshot :
(i)
(ii)
(iii)
Practical 3
QUESTION :
Run Apriori algorithm to find frequent item sets and
association rules on 2 real datasets and use appropriate
evaluation measures to compute correctness of obtained patterns
a) Use minimum support as 50% and minimum confidence as 75%
b) Use minimum support as 60% and minimum confidence as 60 %
CODE :
import pandas as pd
['milk', 'bread'],
['milk', 'eggs'],
['bread', 'eggs'],
['bread', 'butter']]
te = TransactionEncoder()
te_data = te.fit(data).transform(data)
df = pd.DataFrame(te_data, columns=te.columns_)
return rules
# a) Support=50%, Confidence=75%
# b) Support=60%, Confidence=60%
Screenshot :
(i)
Practical 4
QUESTION :
Use Naive bayes, K-nearest, and Decision tree
classification algorithms and build classifiers on any two
datasets. Divide the data set into training and test set.
Compare the accuracy of the different classifiers under
the following situations:
I.
a) Training set = 75% Test set = 25%
b) Training set = 66.6% (2/3rd of total), Test set =
33.3%
CODE :
import pandas as pd
iris = datasets.load_iris()
X1 = pd.DataFrame(iris.data, columns=iris.feature_names)
y1 = pd.Series(iris.target)
wine = datasets.load_wine()
X2 = pd.DataFrame(wine.data, columns=wine.feature_names)
y2 = pd.Series(wine.target)
scaler = StandardScaler()
X1_scaled = scaler.fit_transform(X1)
X2_scaled = scaler.fit_transform(X2)
# Classifiers
models = {
'KNN': KNeighborsClassifier(),
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
scores = []
for _ in range(10):
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
scores.append(accuracy_score(y_test, y_pred))
# II. Cross-validation
(i)
Practical 5
Question :
Use Simple K-means algorithm for clustering
on any dataset. Compare the performance of clusters by
changing the parameters involved in the algorithm. Plot
MSE computed after each iteration using a line plot for
any set of parameters.
CODE :
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
for k in cluster_range:
kmeans = KMeans(n_clusters=k, init='k-means++', max_iter=300, random_state=42)
kmeans.fit(X_scaled)
labels = kmeans.labels_
silhouette = silhouette_score(X_scaled, labels)
print(f"k = {k} → Silhouette Score = {silhouette:.4f}, Inertia (MSE) =
{kmeans.inertia_:.2f}")
# --------------------------
# MSE vs Iteration Plot for k=3
Screenshot:
(i)
(ii)
(iii)
Thank You