0% found this document useful (0 votes)
69 views27 pages

DMlab - FilE prINCE

DMlab_FilE prINCE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
69 views27 pages

DMlab - FilE prINCE

DMlab_FilE prINCE
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Madhav Institute of Technology and Science, Gwalior

(Deemed to be University)

NAAC accredited with A++ Grade

Centre for Artificial Intelligence

A
Practical File
On
“Data Mining And Warehousing”
(270601)

Session: Jan-May(2024)

SUBMITTED BY :
PRINCE SINGH
0901AM211041

SUBMITTED TO:
Prof. Shubha Mishra
INDEX

S. Name of Experiment DATE OF SUBMITTED Sign


No. EXPERIMENT ON

1 To perform basic operations for mining


data (Preprocessing, Regression,
Classification,
Association, Clustering and Visualization )
using WEKA simulator/Python.
2 Setting up a flow to load an ARFF file (batch
mode) and perform a cross validation using J48
(WEKA’s C4.5 implementation).
3 Draw multiple ROC curves in the same plot
window for J48 and RandomForest as
classifiers using Knowledge flow in weka.
4 Training and Testing of naive Bayes
classifiers incrementally using Knowledge flow
in weka.
5 Write a program to count the occurrence
frequency of items in the given data set.

6 Write a program to generate frequent itemset


from a given data set.

7 Write a program to generate Association rules


from the generated frequent itemsets

8 Write a program to implement various


Association Rule Mining algorithms such as
Apriori, Eclat, FP growth and FP Tree.
9 Write a program to implement different type
of classification algorithms such as SVM,

10 Write a program to implement different types


of clustering algorithms such as Kmean,
Hierarchical, DBScan and EM Clustering.
PROGRAM - 1
AIM : To perform basic operations for mining data (Preprocessing, Regression,
Classification, Association,Clustering and Visualization ) using WEKA simulator/Python.

THEORY :

Weka contains a collection of visualization tools and algorithms for data analysis and
predictive modelling, together with graphical user interfaces for easy access to these
functions. Weka supports several standard data mining tasks, specifically, data preprocessing,
clustering, classification, regression, visualization, and feature selection. Input to Weka is
expected to be formatted according to the Attribute-Relational File Format and filename with
the .arff extension.

 Preprocessing:

The preprocessing of data is a crucial task in data mining. Because most of the data is raw,
there are chances that it may contain empty or duplicate values, have garbage values, outliers,
extra columns, or have a different naming convention. All these things degrade the results.

To make data cleaner, better and comprehensive, WEKA comes up with a comprehensive set
of options under the filter category.
 Classification :

Classification is one of the essential functions in machine learning, where we assign classes
or categories to items. The classic examples of classification are: declaring a brain tumour as
"malignant" or "benign" or assigning an email to a "spam" or "not_spam" class.
 Clustering :

In clustering, a dataset is arranged in different groups/clusters based on some similarities. In


this case, the items within the same cluster are identical but different from other clusters.
Examples of clustering include identifying customers with similar behaviours and organizing
the regions according to homogenous land use.

 Association :

Association rules highlight all the associations and correlations between items of a dataset. In
short, it is an if-then statement that depicts the probability of relationships between data
items. A classic example of association refers to a connection between the sale of milk and
bread. The tool provides Apriori , FilteredAssociator, and FPGrowth algorithms for
association rules mining in this category.
 Visualisation :

In the visualize tab, different plot matrices and graphs are available to show the trends and
errors identified by the model.
PROGRAM - 2
AIM : Setting up a flow to load an ARFF file (batch mode) and perform a cross validation
using J48 (WEKA’s C4.5 implementation).

THEORY :

 ARFF (Attribute-Relation File Format):

ARFF is a file format commonly used to describe datasets for WEKA. It includes information
about the dataset's attributes, their types, and the data values.

 Cross-validation:

A technique used to assess the performance and generalizability of a machine learning model.
It involves partitioning the dataset into subsets, training the model on some subsets, and
evaluating it on others.

 J48 (C4.5):

J48 is an implementation of the C4.5 algorithm in WEKA. It's a decision tree algorithm used
for classification.

OUTPUT –
PROGRAM - 3
AIM - Draw multiple ROC curves in the same plot window for J48 and Random Forest as
classifiers using Knowledge flow in Weka.

THEORY –

 ROC Curves:

Receiver Operating Characteristic (ROC) curves are graphical representations commonly


used to evaluate the performance of binary classification algorithms. They illustrate the trade-
off between the true positive rate (sensitivity) and the false positive rate (1 - specificity)
across various decision thresholds.

 Key Concepts:

 True Positive Rate (TPR):

Also known as sensitivity, TPR measures the proportion of actual positive instances that
are correctly identified by the classifier.

TPR = TP / (TP + FN), where TP denotes true positives and FN denotes false negatives.

 False Positive Rate (FPR):

FPR measures the proportion of actual negative instances that are incorrectly classified
as positive by the classifier.

FPR = FP / (FP + TN), where FP denotes false positives and TN denotes true negatives.

The ROC curve is created by plotting the TPR against the FPR for different threshold values.
Each point on the curve represents a sensitivity-specificity pair corresponding to a particular
decision threshold. A diagonal line (the line of no-discrimination) represents the
performance of a random classifier.

 Interpretation of ROC Curves:

 Area Under the Curve (AUC):


• AUC quantifies the overall performance of the classifier.
• AUC ranges from 0 to 1, where 1 indicates a perfect classifier, and 0.5 indicates
a random classifier.
• Higher AUC values indicate better classifier performance.

 Shape of the ROC Curve:


 ROC curves with higher elevations and closer to the upper-left corner indicate
superior classifier performance.
 The closer the ROC curve to the upper-left corner, the better the classifier
discriminates between positive and negative instances.
 Use Cases and Significance:

 Comparative Analysis:
ROC curves enable the comparison of multiple classifiers to determine which one
performs better across various decision thresholds.
It helps in selecting the most suitable classifier for a given task based on its AUC value.

 Model Selection and Tuning:


ROC curves aid in tuning classifier parameters to optimize performance.
They provide insights into the sensitivity-specificity trade-offs, helping to select an
appropriate operating point for the classifier.

WEKA KNOWLEDGE FLOW ENVIRONMENT VISUALIZATION :


RESULT :
PROGRAM - 4
AIM : Training and Testing of naive Bayes classifiers incrementally using Knowledge flow in
weka.

THEORY :

Naive Bayes classifiers are simple probabilistic classifiers based on applying Bayes' theorem
with strong (naive) independence assumptions between the features. They are often used in text
classification, spam filtering, and other applications where the assumption of independence
between features holds reasonably well.

In WEKA, the Knowledge Flow interface allows for the creation of workflows for data mining
tasks, including incremental learning. The IncrementalClassifierUpdate operator in WEKA's
Knowledge Flow allows us to train and test Naive Bayes classifiers incrementally, updating the
model as new data arrives.

Components :

 ARFF (Attribute-Relation File Format) : ARFF is a file format commonly used to


describe datasets for WEKA. It includes information about the dataset's attributes, their
types, and the data values.

 ClassAssigner : Sets The Column (first or last) as Class, name a empty column as Class.

 NaiveBayesUpdateable :- This is a class that implements the Naive Bayes algorithm for
classification and is designed to handle data streams or situations where data arrives
sequentially and cannot be stored in memory all at once. It allows for incremental updating of
the model as new data arrives, which is useful for scenarios where training data is constantly
changing or evolving.

 IncrementalClassifierEvaluator : This is a class that allows you to evaluate the


performance of a classifier on a data stream or in a situation where data arrives sequentially
and cannot be stored in memory all at once. It is useful for assessing the performance of
classifiers in online learning scenarios or when dealing with continuously evolving data.

 TextViewer : Used to Show the results of the model in text format.


OUTPUT :
PROGRAM - 5
AIM : Write a program to count the occurrence frequency of items in the given data set.

THEORY :

Counting the occurrence frequency of items in a dataset is a fundamental task in data analysis. It
provides valuable insights into the distribution of data and helps in understanding the importance
or prevalence of different categories or classes within the dataset. This information can be useful
in various applications such as classification, anomaly detection, and clustering.

CODE :

import pandas as pd

iris_df = pd.read_csv("/content/iris_data.csv", header=None,


names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class'])

# Count the occurrence frequency of items in the 'class'


column frequency = iris_df['class'].value_counts()

print("Occurrence frequency of items in the IRIS dataset:")


print(frequency)

OUTPUT :
PROGRAM - 6
AIM : Write a program to generate frequent item set from a given data set.

THEORY :

In this experiment, we applied the Apriori algorithm to a given dataset to generate frequent
itemsets. The algorithm identified sets of items that frequently appear together in transactions,
with the minimum support threshold set to 0.2.

The generated frequent itemsets can be used to derive association rules, which can provide
valuable insights into the relationships between different items in the dataset. These rules can be
used for various purposes, such as market basket analysis, where they can help identify patterns
in customer purchasing behavior.

CODE :

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori

df = pd.read_csv("/new_dataset.csv")

te = TransactionEncoder()
te_ary = te.fit(df.values).transform(df.values)
df = pd.DataFrame(te_ary, columns=te.columns_)

frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True)


print("Frequent Itemsets:")
print(frequent_itemsets)

OUTPUT :
PROGRAM - 7
AIM : Write a program to generate Association rules from the generated frequent itemsets.

THEORY :

Association rule mining is a technique used to discover interesting relationships, or associations,


between items in large datasets. It is often used in market basket analysis to uncover patterns in
consumer behavior. The process involves finding frequent itemsets, which are sets of items that
frequently occur together in transactions, and then deriving association rules from these itemsets.

In this experiment, we first generate frequent itemsets from the dataset using the Apriori
algorithm. Frequent itemsets are sets of items that have a support value greater than a specified
threshold. We then use these frequent itemsets to generate association rules. Association rules
are rules that indicate a strong relationship between the presence of certain items (antecedent)
and the presence of another item (consequent) in a transaction, based on the support and
confidence values of the rule.

CODE :

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori,
association_rules df = pd.read_csv("/new_dataset.csv")
te = TransactionEncoder()
te_ary = te.fit(df.values).transform(df.values)

df = pd.DataFrame(te_ary, columns=te.columns_)

frequent_itemsets = apriori(df, min_support=0.2, use_colnames=True)


association_rules_df = association_rules(frequent_itemsets,
metric='confidence', min_threshold=0.7)
print("Association Rules:")
print(association_rules_df)

OUTPUT :
Program – 8
Aim: Write a program to implement various Association Rule Mining algorithms such as Apriori
, Eclat, FP growth and FP Tree.

Theory:
1. Apriori Algorithm
Theory:

The Apriori algorithm is based on the principle of "apriori property," which states that any subset
of a frequent itemset must also be frequent. The algorithm employs a level-wise approach to
discover frequent itemsets. It starts by identifying frequent individual items, then iteratively
generates larger itemsets by joining frequent itemsets found in the previous step.

Implementation:

The Python implementation of the Apriori algorithm involves generating candidate itemsets,
pruning infrequent itemsets, and iterating until no new frequent itemsets are found. It then
derives association rules based on the discovered frequent itemsets and evaluates them using
support and confidence measures.

2. Eclat Algorithm

Theory:

Eclat (Equivalence Class Clustering and bottom-up Lattice Traversal) is a depth-first search
algorithm that avoids candidate generation. It uses a vertical database representation to
efficiently mine frequent itemsets by intersecting transactions containing each item.

Implementation:

The Python implementation of the Eclat algorithm involves constructing a vertical database
representation, recursively exploring itemsets, and counting their support. It efficiently generates
frequent itemsets without the need for candidate generation, making it suitable for memory-
constrained environments.
3. FP-Growth Algorithm

Theory:

The FP-Growth (Frequent Pattern Growth) algorithm is a tree-based method that constructs a
compact data structure called FP-tree to represent the dataset. It then recursively mines frequent
itemsets from the FP-tree by exploiting the properties of prefix paths.

Implementation:

The Python implementation of the FP-Growth algorithm involves constructing the FP-tree,
mining frequent itemsets using the FP-tree structure, and deriving association rules. FP-Growth
eliminates the need for candidate generation and multiple scans of the dataset, making it highly
efficient for large datasets.

4. FP-Tree Algorithm

Theory:

FP-Tree is a variation of the FP-Growth algorithm that focuses on constructing the FP-tree data
structure efficiently. It uses a frequent itemset ordering technique to optimize the construction
process and reduce memory consumption.

Implementation:

The Python implementation of the FP-Tree algorithm involves constructing the FP-tree data
structure, mining frequent itemsets, and deriving association rules. It shares similarities with FP-
Growth but may offer better performance in certain scenarios due to its optimized tree
construction.

Code:
Output:
Program – 9
Aim: Write a program to implement different type of classification algorithms such as SVM.

Theory:

Classification algorithms are essential tools in machine learning and data mining, facilitating the
categorization of data into distinct classes or categories based on input features. Various
classification algorithms employ different approaches to learn patterns and make predictions.

Decision Trees: Decision trees partition the feature space into regions, creating a tree-like
structure where each internal node represents a decision based on a feature value, and each leaf
node represents a class label. Decision trees are interpretable and can handle both numerical and
categorical data, making them suitable for understanding complex decision-making processes.

Random Forest: Random Forest is an ensemble learning method that constructs multiple decision
trees during training. It aggregates the predictions of individual trees to determine the final class
label. By reducing overfitting and improving generalization, Random Forest achieves higher
accuracy than individual decision trees. It also provides estimates of feature importance, aiding
in feature selection.

Support Vector Machines (SVM): SVM aims to find the optimal hyperplane that separates data
points of different classes with the maximum margin. SVM can handle high-dimensional data
efficiently and is effective in cases where the data is not linearly separable by transforming the
feature space using kernel functions. However, SVM's performance may degrade with large
datasets.

K-Nearest Neighbors (KNN): KNN is a non-parametric algorithm that classifies data points
based on the majority class among their 'k' nearest neighbors in the feature space. KNN is simple
and intuitive, making no assumptions about the underlying data distribution. However, its
performance can be sensitive to the choice of the distance metric and the value of 'k'.

Code:

import numpy as np

from sklearn import datasets


from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.tree import
DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
def train_and_evaluate(classifier, X_train, X_test, y_train,
y_test): # Train the classifier
classifier.fit(X_train, y_train)
# Make predictions
y_pred =
classifier.predict(X_test) #
Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
return accuracy
# Load sample dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Feature scaling
sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
# Define classifiers
svm_classifier = SVC(kernel='linear', random_state=42)
dt_classifier = DecisionTreeClassifier(random_state=42)
rf_classifier = RandomForestClassifier(n_estimators=100,
random_state=42) knn_classifier = KNeighborsClassifier(n_neighbors=3)

# Train and evaluate classifiers


classifiers = {
"SVM": svm_classifier,
"Decision Tree": dt_classifier,
"Random Forest": rf_classifier,
"K-Nearest Neighbors": knn_classifier
}
for name, classifier in classifiers.items():
accuracy = train_and_evaluate(classifier, X_train, X_test, y_train,
y_test) print("Accuracy of {} classifier: {:.2f}%".format(name, accuracy
* 100))

Output:
Program - 10
Aim: Write a program to implement different types of clustering algorithms such as Kmean,
Hierarchical, DBScan and EM Clustering.

Theory:
Clustering algorithms are unsupervised learning techniques used to group similar data points
together. One commonly used algorithm is K-Means, which partitions the data into 'k' clusters by
iteratively updating cluster centroids and assigning data points to the nearest centroid. K-Means
is efficient and easy to implement but requires the number of clusters to be specified beforehand.

Hierarchical clustering builds a hierarchy of clusters by merging or splitting them based on the
similarity between data points. It can be agglomerative (starting with individual data points and
merging them into clusters) or divisive (starting with all data points in one cluster and
recursively splitting them).

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies clusters


based on the density of data points. It groups together points in high-density regions while
classifying points in low-density regions as outliers. DBSCAN is effective in discovering
clusters of arbitrary shapes and sizes and does not require the number of clusters as input.

EM (Expectation-Maximization) clustering models the data as a mixture of Gaussian


distributions. It iteratively maximizes the likelihood of the data under the Gaussian mixture
model, estimating the parameters of each component distribution. EM clustering is flexible and
can capture complex cluster structures, making it suitable for datasets with overlapping clusters
or non-spherical shapes.

Code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.mixture import GaussianMixture

# Generate sample data


X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.6, random_state=0)

# K-Means clustering
kmeans = KMeans(n_clusters=4)
kmeans_labels = kmeans.fit_predict(X)

# Hierarchical clustering
hierarchical = AgglomerativeClustering(n_clusters=4)
hierarchical_labels = hierarchical.fit_predict(X)

# DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)
# EM (Expectation-Maximization) clustering
em = GaussianMixture(n_components=4)
em_labels = em.fit_predict(X)

# Plotting
plt.figure(figsize=(12, 10))

plt.subplot(2, 2, 1)
plt.scatter(X[:, 0], X[:, 1], c=kmeans_labels,
cmap='viridis') plt.title('K-Means Clustering')

plt.subplot(2, 2, 2)
plt.scatter(X[:, 0], X[:, 1], c=hierarchical_labels,
cmap='viridis') plt.title('Hierarchical Clustering')

plt.subplot(2, 2, 3)
plt.scatter(X[:, 0], X[:, 1], c=dbscan_labels,
cmap='viridis') plt.title('DBSCAN Clustering')

plt.subplot(2, 2, 4)
plt.scatter(X[:, 0], X[:, 1], c=em_labels, cmap='viridis')
plt.title('EM (Expectation-Maximization) Clustering')

plt.tight_layout()
plt.show()
Output:

You might also like