0% found this document useful (0 votes)

47 views25 pages

Data Minig Lab File

The document describes experiments performed on the Iris dataset using machine learning algorithms for classification, regression, clustering, association and visualization. The experiments include preprocessing the data, building classification models with J48 and Random Forest, comparing ROC curves, implementing incremental naive Bayes classification, and counting item frequencies.

Uploaded by

savitaannu07

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views25 pages

Data Minig Lab File

Uploaded by

savitaannu07

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Madhav Institute of Technology & Science, Gwalior

(A Govt. Aided UGC Autonomous Institute Affiliated to RGPV, Bhopal)

NAAC Accredited with A++ Grade.

Lab Report
of
Data Mining & Warehousing (280601)

SUBMITTED BY
Aashutosh Savita
0901AM211001
3rd Year (Sixth Semester)
Artificial Intelligence and Machine Learning

SUBMITTED TO
Dr. Shubha Mishra Prof. Gaurisha Sisodia

Assistant Professor Assistant Professor

Centre For Artificial Intelligence

Session 2023-2024
INDEX
S.No List of Experiment Date of Sign.
Submission

1 To perform basic operations for mining data

(Preprocessing, Regression, Classification,
Association, Clustering and Visualization)
using WEKA simulator/Python.
2 Setting up a flow to load an ARFF file (batch
mode) and perform a cross validation using
J48 (WEKA’s C4.5 implementation).
3 Draw multiple ROC curves in the same plot
window for J48 and RandomForest as
classifiers using Knowledge flow in weka.
4 Training and Testing of naive Bayes
classifiers incrementally using Knowledge
flow in weka.
5 Write a program to count the occurrence
frequency of items in the given data set..
6 Write a program to generate frequent itemset
from a given data set.
7 Write a program to generate Association
rules from the generated frequent itemsets.
8 Write a program to implement various
Association Rule Mining algorithms such as
Apriori, Eclat, FP growth and FP Tree.
9 Write a program to implement different type
of classification algorithms such as SVM,
10 Write a program to implement different types
of clustering algorithms such as Kmean,
Hierarchical, DBScan and EM Clustering.
Experiment – 1

Aim: To perform basic operations for mining data (Preprocessing, Regression, Classification,
Association, Clustering and Visualization) using WEKA simulator/Python.

Theory:
• Dataset - The Iris dataset includes three species of iris flowers: Setosa, Versicolor, and
Virginica. These are the target classes in classification tasks.

• Features –

1. Sepal Length and Width:

Sepals are the outermost whorl of a flower. The dataset includes measurements of
sepal length and sepal width.
2. Petal Length and Width:
Petals are the innermost whorl of a flower. The dataset includes measurements of petal
length and petal width.

• Machine Learning Tasks –

1. Classification:
Given the features of an iris flower, predict its species (Setosa, Versicolor, or
Virginica).

2. Regression:
Predict numerical values, such as the length or width of a specific feature based on the
others.

3. Clustering:
Group similar iris flowers based on their feature similarities without using predefined
species labels.

4. Association:
Discover associations between different features or characteristics of iris flowers.

• Preprocessing:
• Classification:
• Association:

• Clustering:

• Visualization:
Experiment -2
Aim: To establish a workflow for loading an ARFF file in batch mode and executing cross-
validation using J48, WEKA's implementation of C4.5..

Theory:
Cross-validation:
A technique used to assess the performance and generalizability of a machine learning model.
It involves partitioning the dataset into subsets, training the model on some subsets, and
evaluating it on others.

J48 (C4.5):
J48 is an implementation of the C4.5 algorithm in WEKA. It's a decision tree algorithm used
for classification.

ARFF (Attribute-Relation File Format):

ARFF is a file format commonly used to describe datasets for WEKA. It includes information
about the dataset's attributes, their types, and the data values.

Output:
Experiment – 3

Aim: Draw multiple ROC curves in the same plot window for J48 and RandomForest as
classifiers using Knowledge flow in weka.

Theory:
1. ROC Curve: The ROC curve is a graphical representation of a classifier's performance
across various threshold settings. It plots the true positive rate (sensitivity) against the false
positive rate (1-specificity). ROC curves are commonly used to assess and compare the
performance of different classifiers.
2. J48 Classifier: J48 is Weka's implementation of the C4.5 decision tree algorithm. It
constructs a decision tree from the training data and uses it to classify new instances.
Decision trees are interpretable and can handle both numerical and categorical data.
3. RandomForest Classifier: RandomForest is an ensemble learning method that constructs
multiple decision trees during training and outputs the class that is the mode of the classes
of the individual trees. It improves accuracy and reduces overfitting by aggregating the
results of multiple decision trees.
4. AUC (Area Under the Curve): AUC is a metric often used to summarize the performance
of a ROC curve. AUC closer to 1 indicates better performance, while an AUC of 0.5
suggests a random classifier. Comparing the AUC values of different classifiers helps in
determining which classifier performs better for the given task.
Output:
Experiment – 4

Aim: To incremental training and testing of naive Bayes classifiers using Knowledge Flow
in WEKA for efficient classification tasks.

Theory:
1. Incremental Training:
Unlike batch training where the entire dataset is used at once, incremental training updates
the classifier's parameters incrementally as new data becomes available.This approach
allows classifiers to adapt to changes in the data distribution over time without the need to
retrain the model from scratch.
2. Naive Bayes Classifier:
Naive Bayes is a probabilistic classifier based on Bayes' theorem with the "naive"
assumption of independence between features. It calculates the probability of a class label
given the observed features using conditional probability distributions.
3. Updating Probability Distributions:
When new data is introduced, the probability distributions of the features and class labels
need to be updated. This involves adjusting the probabilities of different feature values
and class labels based on the new observations.
4. Maintaining Previous Parameters:
While updating the model, it's essential to retain the previously learned parameters to
preserve the knowledge gained from past data. This ensures that the classifier continues
to make informed predictions based on both old and new data.
5. Model Adaptation:
Incremental training allows the classifier to adapt to changes in the underlying data
distribution, such as concept drift or the emergence of new patterns. By continuously
updating the model, it remains relevant and accurate over time.
6. Knowledge Flow in WEKA:
WEKA's Knowledge Flow interface provides a visual environment for building and
executing data processing pipelines. It allows users to create workflows consisting of
various data preprocessing, modeling, and evaluation steps. With Knowledge Flow,
incremental training and testing procedures can be seamlessly integrated into classification
workflows, simplifying the implementation of incremental learning algorithms like naive
Bayes.
Output:
Experiment – 5

Aim: To facilitate data analysis tasks, create a program to determine the frequency of item
occurrences within a given dataset, serving as a practical exercise in computational analysis.

Theory:
1. Import Necessary Libraries: We start by importing the pandas library, which is a
powerful tool for data manipulation and analysis in Python. We'll use it to load and handle
the Iris dataset.
2. Load the Iris Dataset: We use the read_csv function from pandas to load the Iris dataset
from a URL. This dataset contains information about iris flowers, including measurements
of sepal and petal lengths and widths, as well as the species of iris. We specify the column
names explicitly using the names parameter to ensure clarity.
3. Initialize a Dictionary: We create an empty dictionary named frequency to store the
frequency of each species in the dataset. The keys of this dictionary will be the unique
species, and the values will be their corresponding frequencies.
4. Count Occurrences: We iterate through the species data extracted from the dataset. For
each species encountered, we update its frequency in the frequency dictionary. If the
species is already in the dictionary, we increment its count by 1. If the species is not yet in
the dictionary, we initialize its count to 1.
5. Output the Results: After counting the occurrences of all species in the dataset, we print
the results. We iterate through the items in the frequency dictionary and print each species
along with its count.

Code:
import csv

def count_frequency(data, column):

frequency = {}
for row in data:
item = row[column]
if item in frequency:
frequency[item] += 1
else:
frequency[item] = 1
return frequency
def main():
with open('iris.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
data = list(reader)

column = 'Species'

frequency = count_frequency(data, column)

for item, count in frequency.items():
print(f"{item}: {count}")

if __name__ == "__main__":
main()

Output:
Experiment – 6

Aim: The aim of this experiment is to implement the Apriori algorithm to generate frequent
itemsets from a given dataset. Frequent itemsets represent sets of items that frequently appear
together in transactions, which can provide insights into patterns and associations within the
data.

Theory: The Apriori algorithm is a classical algorithm used in data mining to discover
frequent itemsets in a transaction database. It employs an iterative approach to discover
frequent itemsets by generating candidate itemsets and pruning those that do not satisfy the
minimum support threshold.
1. Support: Support is a measure used to identify the frequency of occurrence of an
itemset in the dataset. It is calculated as the ratio of the number of transactions
containing the itemset to the total number of transactions.
2. Apriori Principle: The key idea behind the Apriori algorithm is that if an itemset is
frequent, then all of its subsets must also be frequent. This principle is used to reduce
the search space by eliminating candidate itemsets that contain infrequent subsets.
3. Apriori Algorithm Steps:
• Step 1: Generate frequent itemsets of size 1 (singleton itemsets).
• Step 2: Generate candidate itemsets of size k+1 from frequent itemsets of size
k, by joining pairs of frequent itemsets.
• Step 3: Prune candidate itemsets that contain infrequent subsets.
• Step 4: Calculate the support of remaining candidate itemsets.
• Step 5: Repeat steps 2-4 until no new frequent itemsets can be generated.
Code:
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#Importing the dataset
dataset = pd.read_csv(Market_basket_data.csv')
transactions=[]
for i in range(0, 7501):
transactions.append([str(dataset.values[i,j]) for j in range(0,20)])
from apyori import apriori
rules= apriori(transactions= transactions, min_support=0.003, min_confidence = 0.2, min_lift=3,
min_length=2, max_length=2)
results= list(rules)
results
Output:

[RelationRecord(items=frozenset({'light cream', 'chicken'}), support=0.004532728969470737, ordered_statistic

s=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.290
59829059829057, lift=4.84395061728395)]),
RelationRecord(items=frozenset({'mushroom cream sauce', 'escalope'}), support=0.005732568990801226, orde
red_statistics=[OrderedStatistic(items_base=frozenset({'mushroom cream sauce'}), items_add=frozenset({'escal
ope'}), confidence=0.3006993006993007, lift=3.790832696715049)]),
RelationRecord(items=frozenset({'pasta', 'escalope'}), support=0.005865884548726837, ordered_statistics=[Or
deredStatistic(items_base=frozenset({'pasta'}), items_add=frozenset({'escalope'}), confidence=0.372881355932
2034, lift=4.700811850163794)]),
RelationRecord(items=frozenset({'fromage blanc', 'honey'}), support=0.003332888948140248, ordered_statisti
cs=[OrderedStatistic(items_base=frozenset({'fromage blanc'}), items_add=frozenset({'honey'}), confidence=0.2
450980392156863, lift=5.164270764485569)]),
RelationRecord(items=frozenset({'ground beef', 'herb & pepper'}), support=0.015997866951073192, ordered_s
tatistics=[OrderedStatistic(items_base=frozenset({'herb & pepper'}), items_add=frozenset({'ground beef'}), conf
idence=0.3234501347708895, lift=3.2919938411349285)]),
RelationRecord(items=frozenset({'ground beef', 'tomato sauce'}), support=0.005332622317024397, ordered_st
atistics=[OrderedStatistic(items_base=frozenset({'tomato sauce'}), items_add=frozenset({'ground beef'}), confid
ence=0.3773584905660377, lift=3.840659481324083)]),
RelationRecord(items=frozenset({'light cream', 'olive oil'}), support=0.003199573390214638, ordered_statistic
s=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'olive oil'}), confidence=0.20
512820512820515, lift=3.1147098515519573)]),
RelationRecord(items=frozenset({'olive oil', 'whole wheat pasta'}), support=0.007998933475536596, ordered_s
tatistics=[OrderedStatistic(items_base=frozenset({'whole wheat pasta'}), items_add=frozenset({'olive oil'}), con
fidence=0.2714932126696833, lift=4.122410097642296)]),
RelationRecord(items=frozenset({'shrimp', 'pasta'}), support=0.005065991201173177, ordered_statistics=[Ord
eredStatistic(items_base=frozenset({'pasta'}), items_add=frozenset({'shrimp'}), confidence=0.32203389830508
47, lift=4.506672147735896)])]
Experiment – 7

Aim: The aim of this experiment is to implement a program to generate association rules
from the generated frequent itemsets. Association rules reveal relationships between items in
transactional data, enabling the discovery of interesting patterns and insights.

Theory: Association rules are typically represented as "if-then" statements of the form X -> Y,
where X and Y are itemsets. These rules are derived from frequent itemsets and are characterized by
two metrics: support and confidence.
1. Support: Support measures the frequency of occurrence of an itemset in the dataset.
It is calculated as the ratio of the number of transactions containing the itemset to the
total number of transactions.
2. Confidence: Confidence measures the reliability of the association rule. It is
calculated as the ratio of the number of transactions containing both X and Y to the
number of transactions containing X.
3. Association Rule Generation: Association rules are generated from frequent itemsets
by exploring all possible combinations of items within each frequent itemset. For each
frequent itemset, association rules are generated by considering all possible non-
empty subsets as antecedents (X) and the remaining items as consequents (Y).
Code:
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#Importing the dataset
dataset = pd.read_csv(Market_basket_data.csv')
transactions=[]
for i in range(0, 7501):
transactions.append([str(dataset.values[i,j]) for j in range(0,20)])
from apyori import apriori
rules= apriori(transactions= transactions, min_support=0.003, min_confidence = 0.2, min_lift=3,
min_length=2, max_length=2)
results= list(rules)
results
for item in results:
pair = item[0]
items = [x for x in pair]
print("Rule: " + items[0] + " -> " + items[1])

print("Support: " + str(item[1]))

print("Confidence: " + str(item[2][0][2]))
print("Lift: " + str(item[2][0][3]))
print("=====================================")
Output:

Rule: light cream -> chicken

Support: 0.004532728969470737
Confidence: 0.29059829059829057
Lift: 4.84395061728395
=====================================
Rule: mushroom cream sauce -> escalope
Support: 0.005732568990801226
Confidence: 0.3006993006993007
Lift: 3.790832696715049
=====================================
Rule: pasta -> escalope
Support: 0.005865884548726837
Confidence: 0.3728813559322034
Lift: 4.700811850163794
=====================================
Rule: fromage blanc -> honey
Support: 0.003332888948140248
Confidence: 0.2450980392156863
Lift: 5.164270764485569
=====================================
Rule: ground beef -> herb & pepper
Support: 0.015997866951073192
Confidence: 0.3234501347708895
Lift: 3.2919938411349285
=====================================
Rule: ground beef -> tomato sauce
Support: 0.005332622317024397
Confidence: 0.3773584905660377
Lift: 3.840659481324083
=====================================
Rule: light cream -> olive oil
Support: 0.003199573390214638
Confidence: 0.20512820512820515
Lift: 3.1147098515519573
=====================================
Rule: olive oil -> whole wheat pasta
Support: 0.007998933475536596
Confidence: 0.2714932126696833
Lift: 4.122410097642296
=====================================
Rule: shrimp -> pasta
Support: 0.005065991201173177
Confidence: 0.3220338983050847
Lift: 4.506672147735896
=====================================
Experiment – 8

Aim: The aim of this experiment is to implement various Association Rule Mining
algorithms, including Apriori, Eclat, FP-Growth. These algorithms are widely used for
discovering frequent itemsets and generating association rules from transaction datasets.

Theory:
1) Apriori Algorithm:
a) The Apriori algorithm is a classical method for Association Rule Mining.
b) Support: Support is a measure of the frequency of occurrence of an itemset in the
dataset.
c) Apriori Principle: If an itemset is frequent, then all of its subsets must also be
frequent.
d) Algorithm Steps:
i) Initialize frequent itemsets of size 1.
ii) Generate candidate itemsets of size k+1 from frequent itemsets of size k.
iii) Prune candidate itemsets containing infrequent subsets.
iv) Calculate the support of remaining candidate itemsets.
v) Repeat the process until no new frequent itemsets can be generated.
2) Eclat Algorithm:
a) Eclat is an efficient method for Association Rule Mining.
b) Vertical Data Structure: Eclat uses a vertical data structure, associating each item with
the list of transactions in which it appears.
c) Depth-First Search: Eclat employs a depth-first search strategy to explore the lattice
of itemsets.
d) Recursive Algorithm: Eclat is a recursive algorithm that combines itemsets based on
their support counts.
3) FP-Growth Algorithm:
a) FP-Growth is a tree-based method for Association Rule Mining.
b) FP-Tree Construction: FP-Growth constructs a special data structure called the FP-
Tree to encode itemset frequencies.
c) Header Table: FP-Growth uses a header table to link identical items in the FP-Tree.
d) Conditional FP-Tree: FP-Growth recursively constructs conditional FP-Trees for each
frequent item in the dataset.
Code:
Apriori:
from apyori import apriori
rules = apriori(transactions = transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 3,
min_length = 2, max_length = 2)
results = list(rules)
def inspect(results):
lhs = [tuple(result[2][0][0])[0] for result in results]
rhs = [tuple(result[2][0][1])[0] for result in results]
supports = [result[1] for result in results]
confidences = [result[2][0][2] for result in results]
lifts = [result[2][0][3] for result in results]
return list(zip(lhs, rhs, supports, confidences, lifts))
resultsinDataFrame = pd.DataFrame(inspect(results), columns = ['Left Hand Side', 'Right Hand Side',
'Support', 'Confidence', 'Lift'])
resultsinDataFrame.nlargest(n = 10, columns = 'Lift')

Eclat:

from apyori import apriori

rules = apriori(transactions = transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 3,
min_length = 2, max_length = 2)
results = list(rules)
def inspect(results):
lhs = [tuple(result[2][0][0])[0] for result in results]
rhs = [tuple(result[2][0][1])[0] for result in results]
supports = [result[1] for result in results]
return list(zip(lhs, rhs, supports))
resultsinDataFrame = pd.DataFrame(inspect(results), columns = ['Product 1', 'Product 2', 'Support'])
resultsinDataFrame.nlargest(n = 10, columns = 'Support')

FP-Growth:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
from mlxtend.frequent_patterns import fpgrowth
fpgrowth(df, min_support=0.6, use_colnames=True)
Output:
Apriori:

Elcat:

FP-Growth:
Experiment – 9

Aim: The aim of this experiment is to implement the Support Vector Machine (SVM)
algorithm for binary classification tasks. SVM is a powerful supervised learning algorithm
used for classification, regression, and outlier detection.

Theory:
Support Vector Machine (SVM) is a popular supervised machine learning algorithm that is
commonly used for classification tasks. The basic idea behind SVM is to find the hyperplane
that best separates the classes in the feature space. This hyperplane is chosen such that it
maximizes the margin, which is the distance between the hyperplane and the nearest data
points (support vectors) from each class.
1. Linear SVM: In linear SVM, the hyperplane is linear, meaning it is a straight line in
two dimensions, a plane in three dimensions, and a hyperplane in higher dimensions.
Linear SVM works well when the data is linearly separable.
2. Kernel SVM: Kernel SVM extends linear SVM to handle non-linearly separable data
by mapping the input features into a higher-dimensional space using a kernel
function. This allows SVM to find a non-linear decision boundary in the original
feature space.
3. Optimization: The optimization problem in SVM involves maximizing the margin
while minimizing the classification error. This is typically formulated as a convex
optimization problem and solved using techniques like gradient descent or quadratic
programming.

Code:
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_train), y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 0.25),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 0.25))
plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(),
X2.ravel()]).T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('SVM (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_test), y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 0.25),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 0.25))
plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(),
X2.ravel()]).T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('SVM (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
Output:
Experiment – 10

Aim: The aim of this experiment is to implement and compare different types of clustering
algorithms, including K-Means, Hierarchical, DBSCAN, and EM (Expectation-
Maximization) clustering. The objective is to understand the principles and performance of
each algorithm in clustering datasets.

Theory:
1. K-Means Clustering:
• K-Means is a partitioning-based clustering algorithm that partitions the dataset
into K clusters.
• Algorithm:
• Initialize K cluster centroids randomly.
• Assign each data point to the nearest centroid.
• Update the centroids as the mean of the data points assigned to each
cluster.
• Repeat the above steps until convergence.
• K-Means aims to minimize the within-cluster sum of squares.
2. Hierarchical Clustering:
• Hierarchical clustering builds a tree of clusters, known as a dendrogram, by
iteratively merging or splitting clusters.
• Algorithm:
• Start with each data point as a singleton cluster.
• Merge the closest pair of clusters until only one cluster remains.
• Hierarchical clustering can be agglomerative (bottom-up) or divisive (top-
down).
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
• DBSCAN is a density-based clustering algorithm that clusters together points
that are closely packed, while marking points in low-density regions as
outliers.
• Algorithm:
• Core points: Points with a minimum number of neighbors within a
specified radius.
• Border points: Points that are within the neighborhood of a core point
but do not satisfy the minimum neighbor criterion.
• Noise points: Points that are neither core nor border points.
• DBSCAN does not require the number of clusters to be specified in advance.
4. EM (Expectation-Maximization) Clustering:
• EM clustering assumes that the dataset is generated from a mixture of several
Gaussian distributions and aims to estimate the parameters of these
distributions.
• Algorithm:
• Expectation step: Estimate the probability of each data point belonging
to each cluster.
• Maximization step: Update the parameters (mean, covariance) of each
Gaussian distribution based on the expected responsibilities.
• EM clustering aims to maximize the likelihood of the observed data.

Code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.mixture import GaussianMixture

# Generate sample data

X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# K-Means clustering
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
kmeans_labels = kmeans.labels_
kmeans_centers = kmeans.cluster_centers_

# Hierarchical clustering
agg_clustering = AgglomerativeClustering(n_clusters=4)
agg_labels = agg_clustering.fit_predict(X)

# DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)

# EM (Expectation-Maximization) clustering
em = GaussianMixture(n_components=4)
em.fit(X)
em_labels = em.predict(X)

# Plotting the results

plt.figure(figsize=(12, 12))

plt.subplot(221)
plt.scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis', s=50, alpha=0.5)
plt.scatter(kmeans_centers[:, 0], kmeans_centers[:, 1], c='red', marker='x', s=200)
plt.title('K-Means Clustering')

plt.subplot(222)
plt.scatter(X[:, 0], X[:, 1], c=agg_labels, cmap='viridis', s=50, alpha=0.5)
plt.title('Hierarchical Clustering')

plt.subplot(223)
plt.scatter(X[:, 0], X[:, 1], c=dbscan_labels, cmap='viridis', s=50, alpha=0.5)
plt.title('DBSCAN Clustering')

plt.subplot(224)
plt.scatter(X[:, 0], X[:, 1], c=em_labels, cmap='viridis', s=50, alpha=0.5)
plt.title('EM Clustering')

plt.show()
Output:

Iris Flower Classification Project
100% (1)
Iris Flower Classification Project
14 pages
Data Warehousing Lab Excercise
No ratings yet
Data Warehousing Lab Excercise
45 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
8 pages
Data Werehousing Lab Manual
No ratings yet
Data Werehousing Lab Manual
63 pages
XCMG Catalogue 2017
67% (6)
XCMG Catalogue 2017
14 pages
Beginning With Weka and R Language
No ratings yet
Beginning With Weka and R Language
27 pages
Lecture 12 - Weka Tutorial
No ratings yet
Lecture 12 - Weka Tutorial
84 pages
JAYESH BANSAL - FinalProjectReport - Jayesh Bansal
No ratings yet
JAYESH BANSAL - FinalProjectReport - Jayesh Bansal
38 pages
AI-43 Data Mining
No ratings yet
AI-43 Data Mining
96 pages
DWDM Manual-1
No ratings yet
DWDM Manual-1
96 pages
Data Warehouse Lab Manual
No ratings yet
Data Warehouse Lab Manual
60 pages
70T RT Tadano GR-700EX Load Charts PDF
No ratings yet
70T RT Tadano GR-700EX Load Charts PDF
12 pages
DWDM Lab Manual
No ratings yet
DWDM Lab Manual
55 pages
Dinesh DM
No ratings yet
Dinesh DM
34 pages
Data Warehousing
No ratings yet
Data Warehousing
54 pages
Iot Domain Analyst-Ece3502: Data Analytics Using Weka For Weather Land Related Data
No ratings yet
Iot Domain Analyst-Ece3502: Data Analytics Using Weka For Weather Land Related Data
21 pages
Data Mining Practical
No ratings yet
Data Mining Practical
13 pages
ML Assignment 2
No ratings yet
ML Assignment 2
25 pages
Iot Domain Analyst-Ece3502: Data Analytics Using Weka For Water Quality Related Data
No ratings yet
Iot Domain Analyst-Ece3502: Data Analytics Using Weka For Water Quality Related Data
14 pages
Practical DWDM
No ratings yet
Practical DWDM
32 pages
DWDM Print
No ratings yet
DWDM Print
20 pages
DA LabFile
No ratings yet
DA LabFile
63 pages
DMDW LAB NEW - Merged
No ratings yet
DMDW LAB NEW - Merged
53 pages
DMlab - FilE prINCE
No ratings yet
DMlab - FilE prINCE
27 pages
Sameed Ahmed Khan Tools For Artificial Neural Network and Machine Learning
No ratings yet
Sameed Ahmed Khan Tools For Artificial Neural Network and Machine Learning
14 pages
Iris Dataset Project Report - Compress
No ratings yet
Iris Dataset Project Report - Compress
16 pages
Data Warehouse Final Record
No ratings yet
Data Warehouse Final Record
55 pages
EDA AnalysisA
No ratings yet
EDA AnalysisA
15 pages
Recent Trends in IT Practical Solutions
No ratings yet
Recent Trends in IT Practical Solutions
11 pages
OS Journal
No ratings yet
OS Journal
28 pages
Data Warehousing - To Write
No ratings yet
Data Warehousing - To Write
23 pages
DMDW Lab10
No ratings yet
DMDW Lab10
6 pages
SUMITs MINOR REPORT
No ratings yet
SUMITs MINOR REPORT
16 pages
DW Lab
No ratings yet
DW Lab
85 pages
Analysis of Machine Learning Algorithms Using WEKA: Aaditya Desai Dr. Sunil Rai
No ratings yet
Analysis of Machine Learning Algorithms Using WEKA: Aaditya Desai Dr. Sunil Rai
6 pages
Experiment 1 Aim:: Introduction To ML Lab With Tools (Hands On WEKA On Data Set (Iris - Arff) ) - (A) Start Weka
No ratings yet
Experiment 1 Aim:: Introduction To ML Lab With Tools (Hands On WEKA On Data Set (Iris - Arff) ) - (A) Start Weka
55 pages
WEKA Manual
No ratings yet
WEKA Manual
11 pages
Perform Task On Weka
No ratings yet
Perform Task On Weka
3 pages
DWDM File-Final Ver3.pdf 20241230 172003 0000
No ratings yet
DWDM File-Final Ver3.pdf 20241230 172003 0000
54 pages
Experiment No. 7
No ratings yet
Experiment No. 7
4 pages
BT-2016 SEM-IV Project Report (Review 1)
No ratings yet
BT-2016 SEM-IV Project Report (Review 1)
42 pages
DM Lab Material
No ratings yet
DM Lab Material
88 pages
Classification With WEKA: Data Mining Lab 2
No ratings yet
Classification With WEKA: Data Mining Lab 2
8 pages
DMW LabFile 0901CS243D11 Swastik
No ratings yet
DMW LabFile 0901CS243D11 Swastik
25 pages
ML Lab External QP
No ratings yet
ML Lab External QP
2 pages
Data Mining Unit 5
No ratings yet
Data Mining Unit 5
12 pages
Task 1 Iris Flower Classification Using Machine Learning
No ratings yet
Task 1 Iris Flower Classification Using Machine Learning
10 pages
Classification of Iris Flower Species Updated
100% (1)
Classification of Iris Flower Species Updated
5 pages
Data Mining Term Project Machine Learning With WEKA: Weka Explorer Tutorial For Version 3.4.3
No ratings yet
Data Mining Term Project Machine Learning With WEKA: Weka Explorer Tutorial For Version 3.4.3
42 pages
PM Clinic Dozers Komatsu
100% (1)
PM Clinic Dozers Komatsu
3 pages
Ijcait1211 Kalpanasharma
No ratings yet
Ijcait1211 Kalpanasharma
5 pages
Indian Institute of Management Bangalore: PGP 4 Term 2019-20
No ratings yet
Indian Institute of Management Bangalore: PGP 4 Term 2019-20
3 pages
DWDM Lab 2
No ratings yet
DWDM Lab 2
3 pages
Performance Assessment of Different Classification Techniques For Intrusion Detection
No ratings yet
Performance Assessment of Different Classification Techniques For Intrusion Detection
5 pages
DWM1
No ratings yet
DWM1
19 pages
Ijiset V2 I2 63 PDF
No ratings yet
Ijiset V2 I2 63 PDF
9 pages
Chapter 3 PG - 36
No ratings yet
Chapter 3 PG - 36
401 pages
DWDM Lab QP 3-1
No ratings yet
DWDM Lab QP 3-1
1 page
Weka (20030421-Version1 by Kdelab)
No ratings yet
Weka (20030421-Version1 by Kdelab)
51 pages
Weka Tutorial: 1. Downloading and Installing Weka (Version 3.6)
No ratings yet
Weka Tutorial: 1. Downloading and Installing Weka (Version 3.6)
4 pages
Invitation Letter For Visa Spouse
No ratings yet
Invitation Letter For Visa Spouse
2 pages
Government of India Act 1858
No ratings yet
Government of India Act 1858
20 pages
Research and Practice in Human Resource Management
No ratings yet
Research and Practice in Human Resource Management
6 pages
(Exp 4) Classification Via Decision Trees in WEKA
No ratings yet
(Exp 4) Classification Via Decision Trees in WEKA
10 pages
Computerized System Validation
No ratings yet
Computerized System Validation
14 pages
Weka Tutorial
No ratings yet
Weka Tutorial
8 pages
4K电影合集 - 副本
No ratings yet
4K电影合集 - 副本
19 pages
Intellect OCR To SAP FB60 Integration Proposal
No ratings yet
Intellect OCR To SAP FB60 Integration Proposal
2 pages
01 AUBF Notes On Lab Safety (HIGHLIGHTED)
No ratings yet
01 AUBF Notes On Lab Safety (HIGHLIGHTED)
5 pages
Redmond Catalogo
No ratings yet
Redmond Catalogo
242 pages
The Role of Chittagong Port in The Economy of Bangladesh II
100% (2)
The Role of Chittagong Port in The Economy of Bangladesh II
15 pages
Haven Technical Services
No ratings yet
Haven Technical Services
12 pages
Question Bank New
No ratings yet
Question Bank New
3 pages
Latihan Bahasa Inggris
No ratings yet
Latihan Bahasa Inggris
13 pages
FII and DII in Indian Stock Market: A Behavioural Study
No ratings yet
FII and DII in Indian Stock Market: A Behavioural Study
9 pages
CSE303 CourseOutline Spring2024 IUB
No ratings yet
CSE303 CourseOutline Spring2024 IUB
6 pages
Fosroc Nitomortar FC (FS) : Constructive Solutions
No ratings yet
Fosroc Nitomortar FC (FS) : Constructive Solutions
2 pages
Dao 2015-09
No ratings yet
Dao 2015-09
14 pages
Oiv Ma As1 12
No ratings yet
Oiv Ma As1 12
92 pages
Using Genetic Algorithms in Process Planning For Job Shop Machining
No ratings yet
Using Genetic Algorithms in Process Planning For Job Shop Machining
12 pages
Deepfake Research Paper (ResNET)
No ratings yet
Deepfake Research Paper (ResNET)
18 pages
3 I Specification BoQ Modular Furniture
No ratings yet
3 I Specification BoQ Modular Furniture
13 pages
Resume - Android Developer - Format4
No ratings yet
Resume - Android Developer - Format4
2 pages
FAQs PDF
0% (1)
FAQs PDF
2 pages
Image Processing Skill Based Mini Project
No ratings yet
Image Processing Skill Based Mini Project
20 pages
Electronics 13 00804
No ratings yet
Electronics 13 00804
22 pages
RNN Lecture 4 by Dr. Vibha Tiwari
No ratings yet
RNN Lecture 4 by Dr. Vibha Tiwari
27 pages
820P 203
No ratings yet
820P 203
10 pages
Accreditation Body Members
No ratings yet
Accreditation Body Members
3 pages
Convolution Neural Networks Sharing Unit 3 Deep
No ratings yet
Convolution Neural Networks Sharing Unit 3 Deep
3 pages
How To Pay Profession Tax Online
No ratings yet
How To Pay Profession Tax Online
22 pages
Data Mining Macro Project
No ratings yet
Data Mining Macro Project
7 pages
AOA 2023 Solution
No ratings yet
AOA 2023 Solution
25 pages
Aryan Experiment4
No ratings yet
Aryan Experiment4
6 pages
NIT Maghalaya Aplication Form
No ratings yet
NIT Maghalaya Aplication Form
3 pages
Chavez vs. CA
No ratings yet
Chavez vs. CA
1 page
How To Become A Full Sea Ship Captain
No ratings yet
How To Become A Full Sea Ship Captain
2 pages
Aashutosh Exp6
No ratings yet
Aashutosh Exp6
2 pages
Email Exchange
No ratings yet
Email Exchange
2 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
From Everand
Lexicon of Programming Terminology: Lexicon of Tech and Business, #17
Mustafa Al-Dori
5/5 (1)

Data Minig Lab File

Uploaded by

Data Minig Lab File

Uploaded by

Madhav Institute of Technology & Science, Gwalior

(A Govt. Aided UGC Autonomous Institute Affiliated to RGPV, Bhopal)

Assistant Professor Assistant Professor

Centre For Artificial Intelligence

1 To perform basic operations for mining data

1. Sepal Length and Width:

• Machine Learning Tasks –

ARFF (Attribute-Relation File Format):

def count_frequency(data, column):

frequency = count_frequency(data, column)

[RelationRecord(items=frozenset({'light cream', 'chicken'}), support=0.004532728969470737, ordered_statistic

print("Support: " + str(item[1]))

Rule: light cream -> chicken

from apyori import apriori

# Generate sample data

# Plotting the results

You might also like