Data Mining Lab Manual
Data Mining Lab Manual
Tools/Technologies to be used:
• Python (pandas, scikit-learn, matplotlib, seaborn)
• R / RStudio
Experiment 1: Create a Star and Snowflake schema for a sample sales dataset
using SQL
Objective:
To create a star and snowflake schema for a sample sales dataset using sql.
Tools:
Python / SQL / R / WEKA / Orange, as applicable.
Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results
Code:
-- Star Schema
CREATE TABLE sales (...);
CREATE TABLE product (...);
CREATE TABLE store (...);
-- Snowflake Schema
CREATE TABLE region (...);
CREATE TABLE store (...); -- referencing region
Output:
Star schema has denormalized dimensions. Snowflake schema normalizes them.
Conclusion:
Successfully created both schemas demonstrating differences in structure.
Experiment 2: Perform OLAP operations (Roll-up, Drill-down, Slice, Dice, Pivot)
using SQL
Objective:
To perform olap operations (roll-up, drill-down, slice, dice, pivot) using sql.
Tools:
Python / SQL / R / WEKA / Orange, as applicable.
Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results
Code:
-- Roll-up
SELECT region, SUM(amount) FROM sales GROUP BY region;
-- Slice
SELECT * FROM sales WHERE year = 2024;
Output:
Output shows aggregated and filtered sales data as per OLAP operations.
Conclusion:
OLAP queries allow multidimensional analysis through SQL operations.
Experiment 3: Import a CSV dataset and perform data cleaning, missing value
handling, and normalization
Objective:
To import a csv dataset and perform data cleaning, missing value handling, and normalization.
Tools:
Python / SQL / R / WEKA / Orange, as applicable.
Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results
Code:
import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(df.mean(), inplace=True)
from sklearn.preprocessing import MinMaxScaler
df[['col']] = MinMaxScaler().fit_transform(df[['col']])
Output:
Missing values filled and numeric columns normalized to [0,1].
Conclusion:
Cleaned and prepared dataset for further analysis.
Objective:
To implement apriori algorithm to find frequent itemsets and generate association rules.
Tools:
Python / SQL / R / WEKA / Orange, as applicable.
Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results
Code:
from mlxtend.frequent_patterns import apriori, association_rules
frequent = apriori(df, min_support=0.5, use_colnames=True)
rules = association_rules(frequent, metric='lift', min_threshold=1.0)
Output:
Generated rules showing strong associations like {bread} → {milk}.
Conclusion:
Apriori helps identify frequent patterns and strong item associations.
Objective:
To use fp-growth algorithm for mining frequent patterns from a retail dataset.
Tools:
Python / SQL / R / WEKA / Orange, as applicable.
Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results
Code:
from mlxtend.frequent_patterns import fpgrowth
fpgrowth(df, min_support=0.5, use_colnames=True)
Output:
Identified frequent itemsets without candidate generation.
Conclusion:
FP-Growth is efficient for large datasets with less computation.
Objective:
To implement naïve bayes classifier and evaluate it using accuracy, precision, and recall.
Tools:
Python / SQL / R / WEKA / Orange, as applicable.
Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results
Code:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score
Output:
Accuracy: 85%, Precision: 80%, Recall: 82%
Conclusion:
Naïve Bayes gives good performance on text/classification data.
Experiment 7: Build a Decision Tree using ID3 or C4.5 algorithm and visualize
the result
Objective:
To build a decision tree using id3 or c4.5 algorithm and visualize the result.
Tools:
Python / SQL / R / WEKA / Orange, as applicable.
Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results
Code:
from sklearn.tree import DecisionTreeClassifier, plot_tree
clf = DecisionTreeClassifier(criterion='entropy')
plot_tree(clf.fit(X, y))
Output:
Tree structure showing decision splits based on information gain.
Conclusion:
Decision Trees provide interpretable and accurate classification models.
Objective:
To perform classification using k-nearest neighbors (knn) and analyze the results.
Tools:
Python / SQL / R / WEKA / Orange, as applicable.
Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results
Code:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
Output:
KNN accuracy: 88%
Conclusion:
KNN is simple and effective for small to medium datasets.
Objective:
To apply k-means clustering on a dataset and visualize cluster separation.
Tools:
Python / SQL / R / WEKA / Orange, as applicable.
Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results
Code:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
plt.scatter(..., c=kmeans.labels_)
Output:
Visual clusters with clear boundaries among 3 groups.
Conclusion:
K-Means effectively groups similar data points.
Objective:
To use hierarchical clustering and dendrogram visualization.
Tools:
Python / SQL / R / WEKA / Orange, as applicable.
Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results
Code:
from scipy.cluster.hierarchy import dendrogram, linkage
dendrogram(linkage(data, method='ward'))
Output:
Dendrogram showing hierarchy of merged clusters.
Conclusion:
Hierarchical clustering reveals natural data structure.
Objective:
To perform principal component analysis (pca) on a high-dimensional dataset.
Tools:
Python / SQL / R / WEKA / Orange, as applicable.
Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results
Code:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
Output:
Variance explained: PC1 - 70%, PC2 - 20%
Conclusion:
PCA reduces dimensions while retaining variance.
Objective:
To mini-project: apply classification/clustering/association on a real-world dataset and present
findings.
Tools:
Python / SQL / R / WEKA / Orange, as applicable.
Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results
Code:
# Example: Titanic dataset - classification using Decision Tree
Output:
Achieved 82% accuracy with insights on key features like age and class.
Conclusion:
Applied end-to-end workflow from data cleaning to model evaluation.