0% found this document useful (0 votes)
20 views8 pages

Data Mining Lab Manual

The document is a Data Mining Lab Manual detailing various experiments using tools like Python, R, and SQL. It covers creating schemas, performing OLAP operations, data cleaning, and implementing algorithms such as Apriori, FP-Growth, Naïve Bayes, and K-Means. Each experiment includes objectives, procedures, code snippets, outputs, and conclusions demonstrating the effectiveness of different data mining techniques.

Uploaded by

ranga raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views8 pages

Data Mining Lab Manual

The document is a Data Mining Lab Manual detailing various experiments using tools like Python, R, and SQL. It covers creating schemas, performing OLAP operations, data cleaning, and implementing algorithms such as Apriori, FP-Growth, Naïve Bayes, and K-Means. Each experiment includes objectives, procedures, code snippets, outputs, and conclusions demonstrating the effectiveness of different data mining techniques.

Uploaded by

ranga raj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Data Mining Lab Manual with Answers

Tools/Technologies to be used:
• Python (pandas, scikit-learn, matplotlib, seaborn)

• R / RStudio

• WEKA / Orange Data Mining Tool

• SQL for OLAP operations

• Jupyter Notebook / Google Colab

Experiment 1: Create a Star and Snowflake schema for a sample sales dataset
using SQL

Objective:
To create a star and snowflake schema for a sample sales dataset using sql.

Tools:
Python / SQL / R / WEKA / Orange, as applicable.

Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results

Code:
-- Star Schema
CREATE TABLE sales (...);
CREATE TABLE product (...);
CREATE TABLE store (...);
-- Snowflake Schema
CREATE TABLE region (...);
CREATE TABLE store (...); -- referencing region

Output:
Star schema has denormalized dimensions. Snowflake schema normalizes them.

Conclusion:
Successfully created both schemas demonstrating differences in structure.
Experiment 2: Perform OLAP operations (Roll-up, Drill-down, Slice, Dice, Pivot)
using SQL

Objective:
To perform olap operations (roll-up, drill-down, slice, dice, pivot) using sql.

Tools:
Python / SQL / R / WEKA / Orange, as applicable.

Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results

Code:
-- Roll-up
SELECT region, SUM(amount) FROM sales GROUP BY region;

-- Slice
SELECT * FROM sales WHERE year = 2024;

Output:
Output shows aggregated and filtered sales data as per OLAP operations.

Conclusion:
OLAP queries allow multidimensional analysis through SQL operations.

Experiment 3: Import a CSV dataset and perform data cleaning, missing value
handling, and normalization

Objective:
To import a csv dataset and perform data cleaning, missing value handling, and normalization.

Tools:
Python / SQL / R / WEKA / Orange, as applicable.

Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results

Code:
import pandas as pd
df = pd.read_csv('data.csv')
df.fillna(df.mean(), inplace=True)
from sklearn.preprocessing import MinMaxScaler
df[['col']] = MinMaxScaler().fit_transform(df[['col']])

Output:
Missing values filled and numeric columns normalized to [0,1].

Conclusion:
Cleaned and prepared dataset for further analysis.

Experiment 4: Implement Apriori algorithm to find frequent itemsets and


generate association rules

Objective:
To implement apriori algorithm to find frequent itemsets and generate association rules.

Tools:
Python / SQL / R / WEKA / Orange, as applicable.

Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results

Code:
from mlxtend.frequent_patterns import apriori, association_rules
frequent = apriori(df, min_support=0.5, use_colnames=True)
rules = association_rules(frequent, metric='lift', min_threshold=1.0)

Output:
Generated rules showing strong associations like {bread} → {milk}.

Conclusion:
Apriori helps identify frequent patterns and strong item associations.

Experiment 5: Use FP-Growth algorithm for mining frequent patterns from a


retail dataset

Objective:
To use fp-growth algorithm for mining frequent patterns from a retail dataset.
Tools:
Python / SQL / R / WEKA / Orange, as applicable.

Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results

Code:
from mlxtend.frequent_patterns import fpgrowth
fpgrowth(df, min_support=0.5, use_colnames=True)

Output:
Identified frequent itemsets without candidate generation.

Conclusion:
FP-Growth is efficient for large datasets with less computation.

Experiment 6: Implement Naïve Bayes classifier and evaluate it using accuracy,


precision, and recall

Objective:
To implement naïve bayes classifier and evaluate it using accuracy, precision, and recall.

Tools:
Python / SQL / R / WEKA / Orange, as applicable.

Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results

Code:
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score

Output:
Accuracy: 85%, Precision: 80%, Recall: 82%

Conclusion:
Naïve Bayes gives good performance on text/classification data.
Experiment 7: Build a Decision Tree using ID3 or C4.5 algorithm and visualize
the result

Objective:
To build a decision tree using id3 or c4.5 algorithm and visualize the result.

Tools:
Python / SQL / R / WEKA / Orange, as applicable.

Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results

Code:
from sklearn.tree import DecisionTreeClassifier, plot_tree
clf = DecisionTreeClassifier(criterion='entropy')
plot_tree(clf.fit(X, y))

Output:
Tree structure showing decision splits based on information gain.

Conclusion:
Decision Trees provide interpretable and accurate classification models.

Experiment 8: Perform classification using K-Nearest Neighbors (KNN) and


analyze the results

Objective:
To perform classification using k-nearest neighbors (knn) and analyze the results.

Tools:
Python / SQL / R / WEKA / Orange, as applicable.

Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results

Code:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
Output:
KNN accuracy: 88%

Conclusion:
KNN is simple and effective for small to medium datasets.

Experiment 9: Apply K-Means clustering on a dataset and visualize cluster


separation

Objective:
To apply k-means clustering on a dataset and visualize cluster separation.

Tools:
Python / SQL / R / WEKA / Orange, as applicable.

Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results

Code:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
plt.scatter(..., c=kmeans.labels_)

Output:
Visual clusters with clear boundaries among 3 groups.

Conclusion:
K-Means effectively groups similar data points.

Experiment 10: Use Hierarchical clustering and dendrogram visualization

Objective:
To use hierarchical clustering and dendrogram visualization.

Tools:
Python / SQL / R / WEKA / Orange, as applicable.

Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results
Code:
from scipy.cluster.hierarchy import dendrogram, linkage
dendrogram(linkage(data, method='ward'))

Output:
Dendrogram showing hierarchy of merged clusters.

Conclusion:
Hierarchical clustering reveals natural data structure.

Experiment 11: Perform Principal Component Analysis (PCA) on a high-


dimensional dataset

Objective:
To perform principal component analysis (pca) on a high-dimensional dataset.

Tools:
Python / SQL / R / WEKA / Orange, as applicable.

Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results

Code:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

Output:
Variance explained: PC1 - 70%, PC2 - 20%

Conclusion:
PCA reduces dimensions while retaining variance.

Experiment 12: Mini-project: Apply classification/clustering/association on a


real-world dataset and present findings

Objective:
To mini-project: apply classification/clustering/association on a real-world dataset and present
findings.

Tools:
Python / SQL / R / WEKA / Orange, as applicable.
Procedure:
1. Load data
2. Preprocess/transform
3. Apply algorithm
4. Analyze results

Code:
# Example: Titanic dataset - classification using Decision Tree

Output:
Achieved 82% accuracy with insights on key features like age and class.

Conclusion:
Applied end-to-end workflow from data cleaning to model evaluation.

You might also like