Data Minig Lab File
Data Minig Lab File
Lab Report
of
Data Mining & Warehousing (280601)
SUBMITTED BY
Aashutosh Savita
0901AM211001
3rd Year (Sixth Semester)
Artificial Intelligence and Machine Learning
SUBMITTED TO
Dr. Shubha Mishra Prof. Gaurisha Sisodia
Session 2023-2024
INDEX
S.No List of Experiment Date of Sign.
Submission
Aim: To perform basic operations for mining data (Preprocessing, Regression, Classification,
Association, Clustering and Visualization) using WEKA simulator/Python.
Theory:
• Dataset - The Iris dataset includes three species of iris flowers: Setosa, Versicolor, and
Virginica. These are the target classes in classification tasks.
• Features –
1. Classification:
Given the features of an iris flower, predict its species (Setosa, Versicolor, or
Virginica).
2. Regression:
Predict numerical values, such as the length or width of a specific feature based on the
others.
3. Clustering:
Group similar iris flowers based on their feature similarities without using predefined
species labels.
4. Association:
Discover associations between different features or characteristics of iris flowers.
• Preprocessing:
• Classification:
• Association:
• Clustering:
• Visualization:
Experiment -2
Aim: To establish a workflow for loading an ARFF file in batch mode and executing cross-
validation using J48, WEKA's implementation of C4.5..
Theory:
Cross-validation:
A technique used to assess the performance and generalizability of a machine learning model.
It involves partitioning the dataset into subsets, training the model on some subsets, and
evaluating it on others.
J48 (C4.5):
J48 is an implementation of the C4.5 algorithm in WEKA. It's a decision tree algorithm used
for classification.
Output:
Experiment – 3
Aim: Draw multiple ROC curves in the same plot window for J48 and RandomForest as
classifiers using Knowledge flow in weka.
Theory:
1. ROC Curve: The ROC curve is a graphical representation of a classifier's performance
across various threshold settings. It plots the true positive rate (sensitivity) against the false
positive rate (1-specificity). ROC curves are commonly used to assess and compare the
performance of different classifiers.
2. J48 Classifier: J48 is Weka's implementation of the C4.5 decision tree algorithm. It
constructs a decision tree from the training data and uses it to classify new instances.
Decision trees are interpretable and can handle both numerical and categorical data.
3. RandomForest Classifier: RandomForest is an ensemble learning method that constructs
multiple decision trees during training and outputs the class that is the mode of the classes
of the individual trees. It improves accuracy and reduces overfitting by aggregating the
results of multiple decision trees.
4. AUC (Area Under the Curve): AUC is a metric often used to summarize the performance
of a ROC curve. AUC closer to 1 indicates better performance, while an AUC of 0.5
suggests a random classifier. Comparing the AUC values of different classifiers helps in
determining which classifier performs better for the given task.
Output:
Experiment – 4
Aim: To incremental training and testing of naive Bayes classifiers using Knowledge Flow
in WEKA for efficient classification tasks.
Theory:
1. Incremental Training:
Unlike batch training where the entire dataset is used at once, incremental training updates
the classifier's parameters incrementally as new data becomes available.This approach
allows classifiers to adapt to changes in the data distribution over time without the need to
retrain the model from scratch.
2. Naive Bayes Classifier:
Naive Bayes is a probabilistic classifier based on Bayes' theorem with the "naive"
assumption of independence between features. It calculates the probability of a class label
given the observed features using conditional probability distributions.
3. Updating Probability Distributions:
When new data is introduced, the probability distributions of the features and class labels
need to be updated. This involves adjusting the probabilities of different feature values
and class labels based on the new observations.
4. Maintaining Previous Parameters:
While updating the model, it's essential to retain the previously learned parameters to
preserve the knowledge gained from past data. This ensures that the classifier continues
to make informed predictions based on both old and new data.
5. Model Adaptation:
Incremental training allows the classifier to adapt to changes in the underlying data
distribution, such as concept drift or the emergence of new patterns. By continuously
updating the model, it remains relevant and accurate over time.
6. Knowledge Flow in WEKA:
WEKA's Knowledge Flow interface provides a visual environment for building and
executing data processing pipelines. It allows users to create workflows consisting of
various data preprocessing, modeling, and evaluation steps. With Knowledge Flow,
incremental training and testing procedures can be seamlessly integrated into classification
workflows, simplifying the implementation of incremental learning algorithms like naive
Bayes.
Output:
Experiment – 5
Aim: To facilitate data analysis tasks, create a program to determine the frequency of item
occurrences within a given dataset, serving as a practical exercise in computational analysis.
Theory:
1. Import Necessary Libraries: We start by importing the pandas library, which is a
powerful tool for data manipulation and analysis in Python. We'll use it to load and handle
the Iris dataset.
2. Load the Iris Dataset: We use the read_csv function from pandas to load the Iris dataset
from a URL. This dataset contains information about iris flowers, including measurements
of sepal and petal lengths and widths, as well as the species of iris. We specify the column
names explicitly using the names parameter to ensure clarity.
3. Initialize a Dictionary: We create an empty dictionary named frequency to store the
frequency of each species in the dataset. The keys of this dictionary will be the unique
species, and the values will be their corresponding frequencies.
4. Count Occurrences: We iterate through the species data extracted from the dataset. For
each species encountered, we update its frequency in the frequency dictionary. If the
species is already in the dictionary, we increment its count by 1. If the species is not yet in
the dictionary, we initialize its count to 1.
5. Output the Results: After counting the occurrences of all species in the dataset, we print
the results. We iterate through the items in the frequency dictionary and print each species
along with its count.
Code:
import csv
column = 'Species'
if __name__ == "__main__":
main()
Output:
Experiment – 6
Aim: The aim of this experiment is to implement the Apriori algorithm to generate frequent
itemsets from a given dataset. Frequent itemsets represent sets of items that frequently appear
together in transactions, which can provide insights into patterns and associations within the
data.
Theory: The Apriori algorithm is a classical algorithm used in data mining to discover
frequent itemsets in a transaction database. It employs an iterative approach to discover
frequent itemsets by generating candidate itemsets and pruning those that do not satisfy the
minimum support threshold.
1. Support: Support is a measure used to identify the frequency of occurrence of an
itemset in the dataset. It is calculated as the ratio of the number of transactions
containing the itemset to the total number of transactions.
2. Apriori Principle: The key idea behind the Apriori algorithm is that if an itemset is
frequent, then all of its subsets must also be frequent. This principle is used to reduce
the search space by eliminating candidate itemsets that contain infrequent subsets.
3. Apriori Algorithm Steps:
• Step 1: Generate frequent itemsets of size 1 (singleton itemsets).
• Step 2: Generate candidate itemsets of size k+1 from frequent itemsets of size
k, by joining pairs of frequent itemsets.
• Step 3: Prune candidate itemsets that contain infrequent subsets.
• Step 4: Calculate the support of remaining candidate itemsets.
• Step 5: Repeat steps 2-4 until no new frequent itemsets can be generated.
Code:
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#Importing the dataset
dataset = pd.read_csv(Market_basket_data.csv')
transactions=[]
for i in range(0, 7501):
transactions.append([str(dataset.values[i,j]) for j in range(0,20)])
from apyori import apriori
rules= apriori(transactions= transactions, min_support=0.003, min_confidence = 0.2, min_lift=3,
min_length=2, max_length=2)
results= list(rules)
results
Output:
Aim: The aim of this experiment is to implement a program to generate association rules
from the generated frequent itemsets. Association rules reveal relationships between items in
transactional data, enabling the discovery of interesting patterns and insights.
Theory: Association rules are typically represented as "if-then" statements of the form X -> Y,
where X and Y are itemsets. These rules are derived from frequent itemsets and are characterized by
two metrics: support and confidence.
1. Support: Support measures the frequency of occurrence of an itemset in the dataset.
It is calculated as the ratio of the number of transactions containing the itemset to the
total number of transactions.
2. Confidence: Confidence measures the reliability of the association rule. It is
calculated as the ratio of the number of transactions containing both X and Y to the
number of transactions containing X.
3. Association Rule Generation: Association rules are generated from frequent itemsets
by exploring all possible combinations of items within each frequent itemset. For each
frequent itemset, association rules are generated by considering all possible non-
empty subsets as antecedents (X) and the remaining items as consequents (Y).
Code:
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#Importing the dataset
dataset = pd.read_csv(Market_basket_data.csv')
transactions=[]
for i in range(0, 7501):
transactions.append([str(dataset.values[i,j]) for j in range(0,20)])
from apyori import apriori
rules= apriori(transactions= transactions, min_support=0.003, min_confidence = 0.2, min_lift=3,
min_length=2, max_length=2)
results= list(rules)
results
for item in results:
pair = item[0]
items = [x for x in pair]
print("Rule: " + items[0] + " -> " + items[1])
Aim: The aim of this experiment is to implement various Association Rule Mining
algorithms, including Apriori, Eclat, FP-Growth. These algorithms are widely used for
discovering frequent itemsets and generating association rules from transaction datasets.
Theory:
1) Apriori Algorithm:
a) The Apriori algorithm is a classical method for Association Rule Mining.
b) Support: Support is a measure of the frequency of occurrence of an itemset in the
dataset.
c) Apriori Principle: If an itemset is frequent, then all of its subsets must also be
frequent.
d) Algorithm Steps:
i) Initialize frequent itemsets of size 1.
ii) Generate candidate itemsets of size k+1 from frequent itemsets of size k.
iii) Prune candidate itemsets containing infrequent subsets.
iv) Calculate the support of remaining candidate itemsets.
v) Repeat the process until no new frequent itemsets can be generated.
2) Eclat Algorithm:
a) Eclat is an efficient method for Association Rule Mining.
b) Vertical Data Structure: Eclat uses a vertical data structure, associating each item with
the list of transactions in which it appears.
c) Depth-First Search: Eclat employs a depth-first search strategy to explore the lattice
of itemsets.
d) Recursive Algorithm: Eclat is a recursive algorithm that combines itemsets based on
their support counts.
3) FP-Growth Algorithm:
a) FP-Growth is a tree-based method for Association Rule Mining.
b) FP-Tree Construction: FP-Growth constructs a special data structure called the FP-
Tree to encode itemset frequencies.
c) Header Table: FP-Growth uses a header table to link identical items in the FP-Tree.
d) Conditional FP-Tree: FP-Growth recursively constructs conditional FP-Trees for each
frequent item in the dataset.
Code:
Apriori:
from apyori import apriori
rules = apriori(transactions = transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 3,
min_length = 2, max_length = 2)
results = list(rules)
def inspect(results):
lhs = [tuple(result[2][0][0])[0] for result in results]
rhs = [tuple(result[2][0][1])[0] for result in results]
supports = [result[1] for result in results]
confidences = [result[2][0][2] for result in results]
lifts = [result[2][0][3] for result in results]
return list(zip(lhs, rhs, supports, confidences, lifts))
resultsinDataFrame = pd.DataFrame(inspect(results), columns = ['Left Hand Side', 'Right Hand Side',
'Support', 'Confidence', 'Lift'])
resultsinDataFrame.nlargest(n = 10, columns = 'Lift')
Eclat:
FP-Growth:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
from mlxtend.frequent_patterns import fpgrowth
fpgrowth(df, min_support=0.6, use_colnames=True)
Output:
Apriori:
Elcat:
FP-Growth:
Experiment – 9
Aim: The aim of this experiment is to implement the Support Vector Machine (SVM)
algorithm for binary classification tasks. SVM is a powerful supervised learning algorithm
used for classification, regression, and outlier detection.
Theory:
Support Vector Machine (SVM) is a popular supervised machine learning algorithm that is
commonly used for classification tasks. The basic idea behind SVM is to find the hyperplane
that best separates the classes in the feature space. This hyperplane is chosen such that it
maximizes the margin, which is the distance between the hyperplane and the nearest data
points (support vectors) from each class.
1. Linear SVM: In linear SVM, the hyperplane is linear, meaning it is a straight line in
two dimensions, a plane in three dimensions, and a hyperplane in higher dimensions.
Linear SVM works well when the data is linearly separable.
2. Kernel SVM: Kernel SVM extends linear SVM to handle non-linearly separable data
by mapping the input features into a higher-dimensional space using a kernel
function. This allows SVM to find a non-linear decision boundary in the original
feature space.
3. Optimization: The optimization problem in SVM involves maximizing the margin
while minimizing the classification error. This is typically formulated as a convex
optimization problem and solved using techniques like gradient descent or quadratic
programming.
Code:
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_train), y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 0.25),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 0.25))
plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(),
X2.ravel()]).T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('SVM (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_test), y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 0.25),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 0.25))
plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(),
X2.ravel()]).T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('SVM (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
Output:
Experiment – 10
Aim: The aim of this experiment is to implement and compare different types of clustering
algorithms, including K-Means, Hierarchical, DBSCAN, and EM (Expectation-
Maximization) clustering. The objective is to understand the principles and performance of
each algorithm in clustering datasets.
Theory:
1. K-Means Clustering:
• K-Means is a partitioning-based clustering algorithm that partitions the dataset
into K clusters.
• Algorithm:
• Initialize K cluster centroids randomly.
• Assign each data point to the nearest centroid.
• Update the centroids as the mean of the data points assigned to each
cluster.
• Repeat the above steps until convergence.
• K-Means aims to minimize the within-cluster sum of squares.
2. Hierarchical Clustering:
• Hierarchical clustering builds a tree of clusters, known as a dendrogram, by
iteratively merging or splitting clusters.
• Algorithm:
• Start with each data point as a singleton cluster.
• Merge the closest pair of clusters until only one cluster remains.
• Hierarchical clustering can be agglomerative (bottom-up) or divisive (top-
down).
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
• DBSCAN is a density-based clustering algorithm that clusters together points
that are closely packed, while marking points in low-density regions as
outliers.
• Algorithm:
• Core points: Points with a minimum number of neighbors within a
specified radius.
• Border points: Points that are within the neighborhood of a core point
but do not satisfy the minimum neighbor criterion.
• Noise points: Points that are neither core nor border points.
• DBSCAN does not require the number of clusters to be specified in advance.
4. EM (Expectation-Maximization) Clustering:
• EM clustering assumes that the dataset is generated from a mixture of several
Gaussian distributions and aims to estimate the parameters of these
distributions.
• Algorithm:
• Expectation step: Estimate the probability of each data point belonging
to each cluster.
• Maximization step: Update the parameters (mean, covariance) of each
Gaussian distribution based on the expected responsibilities.
• EM clustering aims to maximize the likelihood of the observed data.
Code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.mixture import GaussianMixture
# K-Means clustering
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
kmeans_labels = kmeans.labels_
kmeans_centers = kmeans.cluster_centers_
# Hierarchical clustering
agg_clustering = AgglomerativeClustering(n_clusters=4)
agg_labels = agg_clustering.fit_predict(X)
# DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)
# EM (Expectation-Maximization) clustering
em = GaussianMixture(n_components=4)
em.fit(X)
em_labels = em.predict(X)
plt.subplot(221)
plt.scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis', s=50, alpha=0.5)
plt.scatter(kmeans_centers[:, 0], kmeans_centers[:, 1], c='red', marker='x', s=200)
plt.title('K-Means Clustering')
plt.subplot(222)
plt.scatter(X[:, 0], X[:, 1], c=agg_labels, cmap='viridis', s=50, alpha=0.5)
plt.title('Hierarchical Clustering')
plt.subplot(223)
plt.scatter(X[:, 0], X[:, 1], c=dbscan_labels, cmap='viridis', s=50, alpha=0.5)
plt.title('DBSCAN Clustering')
plt.subplot(224)
plt.scatter(X[:, 0], X[:, 1], c=em_labels, cmap='viridis', s=50, alpha=0.5)
plt.title('EM Clustering')
plt.show()
Output: