0% found this document useful (0 votes)
47 views25 pages

Data Minig Lab File

The document describes experiments performed on the Iris dataset using machine learning algorithms for classification, regression, clustering, association and visualization. The experiments include preprocessing the data, building classification models with J48 and Random Forest, comparing ROC curves, implementing incremental naive Bayes classification, and counting item frequencies.

Uploaded by

savitaannu07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views25 pages

Data Minig Lab File

The document describes experiments performed on the Iris dataset using machine learning algorithms for classification, regression, clustering, association and visualization. The experiments include preprocessing the data, building classification models with J48 and Random Forest, comparing ROC curves, implementing incremental naive Bayes classification, and counting item frequencies.

Uploaded by

savitaannu07
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Madhav Institute of Technology & Science, Gwalior

(A Govt. Aided UGC Autonomous Institute Affiliated to RGPV, Bhopal)


NAAC Accredited with A++ Grade.

Lab Report
of
Data Mining & Warehousing (280601)

SUBMITTED BY
Aashutosh Savita
0901AM211001
3rd Year (Sixth Semester)
Artificial Intelligence and Machine Learning

SUBMITTED TO
Dr. Shubha Mishra Prof. Gaurisha Sisodia

Assistant Professor Assistant Professor

Centre For Artificial Intelligence

Session 2023-2024
INDEX
S.No List of Experiment Date of Sign.
Submission

1 To perform basic operations for mining data


(Preprocessing, Regression, Classification,
Association, Clustering and Visualization)
using WEKA simulator/Python.
2 Setting up a flow to load an ARFF file (batch
mode) and perform a cross validation using
J48 (WEKA’s C4.5 implementation).
3 Draw multiple ROC curves in the same plot
window for J48 and RandomForest as
classifiers using Knowledge flow in weka.
4 Training and Testing of naive Bayes
classifiers incrementally using Knowledge
flow in weka.
5 Write a program to count the occurrence
frequency of items in the given data set..
6 Write a program to generate frequent itemset
from a given data set.
7 Write a program to generate Association
rules from the generated frequent itemsets.
8 Write a program to implement various
Association Rule Mining algorithms such as
Apriori, Eclat, FP growth and FP Tree.
9 Write a program to implement different type
of classification algorithms such as SVM,
10 Write a program to implement different types
of clustering algorithms such as Kmean,
Hierarchical, DBScan and EM Clustering.
Experiment – 1

Aim: To perform basic operations for mining data (Preprocessing, Regression, Classification,
Association, Clustering and Visualization) using WEKA simulator/Python.

Theory:
• Dataset - The Iris dataset includes three species of iris flowers: Setosa, Versicolor, and
Virginica. These are the target classes in classification tasks.

• Features –

1. Sepal Length and Width:


Sepals are the outermost whorl of a flower. The dataset includes measurements of
sepal length and sepal width.
2. Petal Length and Width:
Petals are the innermost whorl of a flower. The dataset includes measurements of petal
length and petal width.

• Machine Learning Tasks –

1. Classification:
Given the features of an iris flower, predict its species (Setosa, Versicolor, or
Virginica).

2. Regression:
Predict numerical values, such as the length or width of a specific feature based on the
others.

3. Clustering:
Group similar iris flowers based on their feature similarities without using predefined
species labels.

4. Association:
Discover associations between different features or characteristics of iris flowers.

• Preprocessing:
• Classification:
• Association:

• Clustering:

• Visualization:
Experiment -2
Aim: To establish a workflow for loading an ARFF file in batch mode and executing cross-
validation using J48, WEKA's implementation of C4.5..

Theory:
Cross-validation:
A technique used to assess the performance and generalizability of a machine learning model.
It involves partitioning the dataset into subsets, training the model on some subsets, and
evaluating it on others.

J48 (C4.5):
J48 is an implementation of the C4.5 algorithm in WEKA. It's a decision tree algorithm used
for classification.

ARFF (Attribute-Relation File Format):


ARFF is a file format commonly used to describe datasets for WEKA. It includes information
about the dataset's attributes, their types, and the data values.

Output:
Experiment – 3

Aim: Draw multiple ROC curves in the same plot window for J48 and RandomForest as
classifiers using Knowledge flow in weka.

Theory:
1. ROC Curve: The ROC curve is a graphical representation of a classifier's performance
across various threshold settings. It plots the true positive rate (sensitivity) against the false
positive rate (1-specificity). ROC curves are commonly used to assess and compare the
performance of different classifiers.
2. J48 Classifier: J48 is Weka's implementation of the C4.5 decision tree algorithm. It
constructs a decision tree from the training data and uses it to classify new instances.
Decision trees are interpretable and can handle both numerical and categorical data.
3. RandomForest Classifier: RandomForest is an ensemble learning method that constructs
multiple decision trees during training and outputs the class that is the mode of the classes
of the individual trees. It improves accuracy and reduces overfitting by aggregating the
results of multiple decision trees.
4. AUC (Area Under the Curve): AUC is a metric often used to summarize the performance
of a ROC curve. AUC closer to 1 indicates better performance, while an AUC of 0.5
suggests a random classifier. Comparing the AUC values of different classifiers helps in
determining which classifier performs better for the given task.
Output:
Experiment – 4

Aim: To incremental training and testing of naive Bayes classifiers using Knowledge Flow
in WEKA for efficient classification tasks.

Theory:
1. Incremental Training:
Unlike batch training where the entire dataset is used at once, incremental training updates
the classifier's parameters incrementally as new data becomes available.This approach
allows classifiers to adapt to changes in the data distribution over time without the need to
retrain the model from scratch.
2. Naive Bayes Classifier:
Naive Bayes is a probabilistic classifier based on Bayes' theorem with the "naive"
assumption of independence between features. It calculates the probability of a class label
given the observed features using conditional probability distributions.
3. Updating Probability Distributions:
When new data is introduced, the probability distributions of the features and class labels
need to be updated. This involves adjusting the probabilities of different feature values
and class labels based on the new observations.
4. Maintaining Previous Parameters:
While updating the model, it's essential to retain the previously learned parameters to
preserve the knowledge gained from past data. This ensures that the classifier continues
to make informed predictions based on both old and new data.
5. Model Adaptation:
Incremental training allows the classifier to adapt to changes in the underlying data
distribution, such as concept drift or the emergence of new patterns. By continuously
updating the model, it remains relevant and accurate over time.
6. Knowledge Flow in WEKA:
WEKA's Knowledge Flow interface provides a visual environment for building and
executing data processing pipelines. It allows users to create workflows consisting of
various data preprocessing, modeling, and evaluation steps. With Knowledge Flow,
incremental training and testing procedures can be seamlessly integrated into classification
workflows, simplifying the implementation of incremental learning algorithms like naive
Bayes.
Output:
Experiment – 5

Aim: To facilitate data analysis tasks, create a program to determine the frequency of item
occurrences within a given dataset, serving as a practical exercise in computational analysis.

Theory:
1. Import Necessary Libraries: We start by importing the pandas library, which is a
powerful tool for data manipulation and analysis in Python. We'll use it to load and handle
the Iris dataset.
2. Load the Iris Dataset: We use the read_csv function from pandas to load the Iris dataset
from a URL. This dataset contains information about iris flowers, including measurements
of sepal and petal lengths and widths, as well as the species of iris. We specify the column
names explicitly using the names parameter to ensure clarity.
3. Initialize a Dictionary: We create an empty dictionary named frequency to store the
frequency of each species in the dataset. The keys of this dictionary will be the unique
species, and the values will be their corresponding frequencies.
4. Count Occurrences: We iterate through the species data extracted from the dataset. For
each species encountered, we update its frequency in the frequency dictionary. If the
species is already in the dictionary, we increment its count by 1. If the species is not yet in
the dictionary, we initialize its count to 1.
5. Output the Results: After counting the occurrences of all species in the dataset, we print
the results. We iterate through the items in the frequency dictionary and print each species
along with its count.

Code:
import csv

def count_frequency(data, column):


frequency = {}
for row in data:
item = row[column]
if item in frequency:
frequency[item] += 1
else:
frequency[item] = 1
return frequency
def main():
with open('iris.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile)
data = list(reader)

column = 'Species'

frequency = count_frequency(data, column)


for item, count in frequency.items():
print(f"{item}: {count}")

if __name__ == "__main__":
main()

Output:
Experiment – 6

Aim: The aim of this experiment is to implement the Apriori algorithm to generate frequent
itemsets from a given dataset. Frequent itemsets represent sets of items that frequently appear
together in transactions, which can provide insights into patterns and associations within the
data.

Theory: The Apriori algorithm is a classical algorithm used in data mining to discover
frequent itemsets in a transaction database. It employs an iterative approach to discover
frequent itemsets by generating candidate itemsets and pruning those that do not satisfy the
minimum support threshold.
1. Support: Support is a measure used to identify the frequency of occurrence of an
itemset in the dataset. It is calculated as the ratio of the number of transactions
containing the itemset to the total number of transactions.
2. Apriori Principle: The key idea behind the Apriori algorithm is that if an itemset is
frequent, then all of its subsets must also be frequent. This principle is used to reduce
the search space by eliminating candidate itemsets that contain infrequent subsets.
3. Apriori Algorithm Steps:
• Step 1: Generate frequent itemsets of size 1 (singleton itemsets).
• Step 2: Generate candidate itemsets of size k+1 from frequent itemsets of size
k, by joining pairs of frequent itemsets.
• Step 3: Prune candidate itemsets that contain infrequent subsets.
• Step 4: Calculate the support of remaining candidate itemsets.
• Step 5: Repeat steps 2-4 until no new frequent itemsets can be generated.
Code:
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#Importing the dataset
dataset = pd.read_csv(Market_basket_data.csv')
transactions=[]
for i in range(0, 7501):
transactions.append([str(dataset.values[i,j]) for j in range(0,20)])
from apyori import apriori
rules= apriori(transactions= transactions, min_support=0.003, min_confidence = 0.2, min_lift=3,
min_length=2, max_length=2)
results= list(rules)
results
Output:

[RelationRecord(items=frozenset({'light cream', 'chicken'}), support=0.004532728969470737, ordered_statistic


s=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.290
59829059829057, lift=4.84395061728395)]),
RelationRecord(items=frozenset({'mushroom cream sauce', 'escalope'}), support=0.005732568990801226, orde
red_statistics=[OrderedStatistic(items_base=frozenset({'mushroom cream sauce'}), items_add=frozenset({'escal
ope'}), confidence=0.3006993006993007, lift=3.790832696715049)]),
RelationRecord(items=frozenset({'pasta', 'escalope'}), support=0.005865884548726837, ordered_statistics=[Or
deredStatistic(items_base=frozenset({'pasta'}), items_add=frozenset({'escalope'}), confidence=0.372881355932
2034, lift=4.700811850163794)]),
RelationRecord(items=frozenset({'fromage blanc', 'honey'}), support=0.003332888948140248, ordered_statisti
cs=[OrderedStatistic(items_base=frozenset({'fromage blanc'}), items_add=frozenset({'honey'}), confidence=0.2
450980392156863, lift=5.164270764485569)]),
RelationRecord(items=frozenset({'ground beef', 'herb & pepper'}), support=0.015997866951073192, ordered_s
tatistics=[OrderedStatistic(items_base=frozenset({'herb & pepper'}), items_add=frozenset({'ground beef'}), conf
idence=0.3234501347708895, lift=3.2919938411349285)]),
RelationRecord(items=frozenset({'ground beef', 'tomato sauce'}), support=0.005332622317024397, ordered_st
atistics=[OrderedStatistic(items_base=frozenset({'tomato sauce'}), items_add=frozenset({'ground beef'}), confid
ence=0.3773584905660377, lift=3.840659481324083)]),
RelationRecord(items=frozenset({'light cream', 'olive oil'}), support=0.003199573390214638, ordered_statistic
s=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'olive oil'}), confidence=0.20
512820512820515, lift=3.1147098515519573)]),
RelationRecord(items=frozenset({'olive oil', 'whole wheat pasta'}), support=0.007998933475536596, ordered_s
tatistics=[OrderedStatistic(items_base=frozenset({'whole wheat pasta'}), items_add=frozenset({'olive oil'}), con
fidence=0.2714932126696833, lift=4.122410097642296)]),
RelationRecord(items=frozenset({'shrimp', 'pasta'}), support=0.005065991201173177, ordered_statistics=[Ord
eredStatistic(items_base=frozenset({'pasta'}), items_add=frozenset({'shrimp'}), confidence=0.32203389830508
47, lift=4.506672147735896)])]
Experiment – 7

Aim: The aim of this experiment is to implement a program to generate association rules
from the generated frequent itemsets. Association rules reveal relationships between items in
transactional data, enabling the discovery of interesting patterns and insights.

Theory: Association rules are typically represented as "if-then" statements of the form X -> Y,
where X and Y are itemsets. These rules are derived from frequent itemsets and are characterized by
two metrics: support and confidence.
1. Support: Support measures the frequency of occurrence of an itemset in the dataset.
It is calculated as the ratio of the number of transactions containing the itemset to the
total number of transactions.
2. Confidence: Confidence measures the reliability of the association rule. It is
calculated as the ratio of the number of transactions containing both X and Y to the
number of transactions containing X.
3. Association Rule Generation: Association rules are generated from frequent itemsets
by exploring all possible combinations of items within each frequent itemset. For each
frequent itemset, association rules are generated by considering all possible non-
empty subsets as antecedents (X) and the remaining items as consequents (Y).
Code:
import numpy as nm
import matplotlib.pyplot as mtp
import pandas as pd
#Importing the dataset
dataset = pd.read_csv(Market_basket_data.csv')
transactions=[]
for i in range(0, 7501):
transactions.append([str(dataset.values[i,j]) for j in range(0,20)])
from apyori import apriori
rules= apriori(transactions= transactions, min_support=0.003, min_confidence = 0.2, min_lift=3,
min_length=2, max_length=2)
results= list(rules)
results
for item in results:
pair = item[0]
items = [x for x in pair]
print("Rule: " + items[0] + " -> " + items[1])

print("Support: " + str(item[1]))


print("Confidence: " + str(item[2][0][2]))
print("Lift: " + str(item[2][0][3]))
print("=====================================")
Output:

Rule: light cream -> chicken


Support: 0.004532728969470737
Confidence: 0.29059829059829057
Lift: 4.84395061728395
=====================================
Rule: mushroom cream sauce -> escalope
Support: 0.005732568990801226
Confidence: 0.3006993006993007
Lift: 3.790832696715049
=====================================
Rule: pasta -> escalope
Support: 0.005865884548726837
Confidence: 0.3728813559322034
Lift: 4.700811850163794
=====================================
Rule: fromage blanc -> honey
Support: 0.003332888948140248
Confidence: 0.2450980392156863
Lift: 5.164270764485569
=====================================
Rule: ground beef -> herb & pepper
Support: 0.015997866951073192
Confidence: 0.3234501347708895
Lift: 3.2919938411349285
=====================================
Rule: ground beef -> tomato sauce
Support: 0.005332622317024397
Confidence: 0.3773584905660377
Lift: 3.840659481324083
=====================================
Rule: light cream -> olive oil
Support: 0.003199573390214638
Confidence: 0.20512820512820515
Lift: 3.1147098515519573
=====================================
Rule: olive oil -> whole wheat pasta
Support: 0.007998933475536596
Confidence: 0.2714932126696833
Lift: 4.122410097642296
=====================================
Rule: shrimp -> pasta
Support: 0.005065991201173177
Confidence: 0.3220338983050847
Lift: 4.506672147735896
=====================================
Experiment – 8

Aim: The aim of this experiment is to implement various Association Rule Mining
algorithms, including Apriori, Eclat, FP-Growth. These algorithms are widely used for
discovering frequent itemsets and generating association rules from transaction datasets.

Theory:
1) Apriori Algorithm:
a) The Apriori algorithm is a classical method for Association Rule Mining.
b) Support: Support is a measure of the frequency of occurrence of an itemset in the
dataset.
c) Apriori Principle: If an itemset is frequent, then all of its subsets must also be
frequent.
d) Algorithm Steps:
i) Initialize frequent itemsets of size 1.
ii) Generate candidate itemsets of size k+1 from frequent itemsets of size k.
iii) Prune candidate itemsets containing infrequent subsets.
iv) Calculate the support of remaining candidate itemsets.
v) Repeat the process until no new frequent itemsets can be generated.
2) Eclat Algorithm:
a) Eclat is an efficient method for Association Rule Mining.
b) Vertical Data Structure: Eclat uses a vertical data structure, associating each item with
the list of transactions in which it appears.
c) Depth-First Search: Eclat employs a depth-first search strategy to explore the lattice
of itemsets.
d) Recursive Algorithm: Eclat is a recursive algorithm that combines itemsets based on
their support counts.
3) FP-Growth Algorithm:
a) FP-Growth is a tree-based method for Association Rule Mining.
b) FP-Tree Construction: FP-Growth constructs a special data structure called the FP-
Tree to encode itemset frequencies.
c) Header Table: FP-Growth uses a header table to link identical items in the FP-Tree.
d) Conditional FP-Tree: FP-Growth recursively constructs conditional FP-Trees for each
frequent item in the dataset.
Code:
Apriori:
from apyori import apriori
rules = apriori(transactions = transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 3,
min_length = 2, max_length = 2)
results = list(rules)
def inspect(results):
lhs = [tuple(result[2][0][0])[0] for result in results]
rhs = [tuple(result[2][0][1])[0] for result in results]
supports = [result[1] for result in results]
confidences = [result[2][0][2] for result in results]
lifts = [result[2][0][3] for result in results]
return list(zip(lhs, rhs, supports, confidences, lifts))
resultsinDataFrame = pd.DataFrame(inspect(results), columns = ['Left Hand Side', 'Right Hand Side',
'Support', 'Confidence', 'Lift'])
resultsinDataFrame.nlargest(n = 10, columns = 'Lift')

Eclat:

from apyori import apriori


rules = apriori(transactions = transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 3,
min_length = 2, max_length = 2)
results = list(rules)
def inspect(results):
lhs = [tuple(result[2][0][0])[0] for result in results]
rhs = [tuple(result[2][0][1])[0] for result in results]
supports = [result[1] for result in results]
return list(zip(lhs, rhs, supports))
resultsinDataFrame = pd.DataFrame(inspect(results), columns = ['Product 1', 'Product 2', 'Support'])
resultsinDataFrame.nlargest(n = 10, columns = 'Support')

FP-Growth:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
dataset = [['Milk', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Dill', 'Onion', 'Nutmeg', 'Kidney Beans', 'Eggs', 'Yogurt'],
['Milk', 'Apple', 'Kidney Beans', 'Eggs'],
['Milk', 'Unicorn', 'Corn', 'Kidney Beans', 'Yogurt'],
['Corn', 'Onion', 'Onion', 'Kidney Beans', 'Ice cream', 'Eggs']]
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)
from mlxtend.frequent_patterns import fpgrowth
fpgrowth(df, min_support=0.6, use_colnames=True)
Output:
Apriori:

Elcat:

FP-Growth:
Experiment – 9

Aim: The aim of this experiment is to implement the Support Vector Machine (SVM)
algorithm for binary classification tasks. SVM is a powerful supervised learning algorithm
used for classification, regression, and outlier detection.

Theory:
Support Vector Machine (SVM) is a popular supervised machine learning algorithm that is
commonly used for classification tasks. The basic idea behind SVM is to find the hyperplane
that best separates the classes in the feature space. This hyperplane is chosen such that it
maximizes the margin, which is the distance between the hyperplane and the nearest data
points (support vectors) from each class.
1. Linear SVM: In linear SVM, the hyperplane is linear, meaning it is a straight line in
two dimensions, a plane in three dimensions, and a hyperplane in higher dimensions.
Linear SVM works well when the data is linearly separable.
2. Kernel SVM: Kernel SVM extends linear SVM to handle non-linearly separable data
by mapping the input features into a higher-dimensional space using a kernel
function. This allows SVM to find a non-linear decision boundary in the original
feature space.
3. Optimization: The optimization problem in SVM involves maximizing the margin
while minimizing the classification error. This is typically formulated as a convex
optimization problem and solved using techniques like gradient descent or quadratic
programming.

Code:
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
from sklearn.svm import SVC
classifier = SVC(kernel = 'linear', random_state = 0)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_train), y_train
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 0.25),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 0.25))
plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(),
X2.ravel()]).T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('SVM (Training set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
from matplotlib.colors import ListedColormap
X_set, y_set = sc.inverse_transform(X_test), y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 0.25),
np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 0.25))
plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(),
X2.ravel()]).T)).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
plt.xlim(X1.min(), X1.max())
plt.ylim(X2.min(), X2.max())
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('SVM (Test set)')
plt.xlabel('Age')
plt.ylabel('Estimated Salary')
plt.legend()
plt.show()
Output:
Experiment – 10

Aim: The aim of this experiment is to implement and compare different types of clustering
algorithms, including K-Means, Hierarchical, DBSCAN, and EM (Expectation-
Maximization) clustering. The objective is to understand the principles and performance of
each algorithm in clustering datasets.

Theory:
1. K-Means Clustering:
• K-Means is a partitioning-based clustering algorithm that partitions the dataset
into K clusters.
• Algorithm:
• Initialize K cluster centroids randomly.
• Assign each data point to the nearest centroid.
• Update the centroids as the mean of the data points assigned to each
cluster.
• Repeat the above steps until convergence.
• K-Means aims to minimize the within-cluster sum of squares.
2. Hierarchical Clustering:
• Hierarchical clustering builds a tree of clusters, known as a dendrogram, by
iteratively merging or splitting clusters.
• Algorithm:
• Start with each data point as a singleton cluster.
• Merge the closest pair of clusters until only one cluster remains.
• Hierarchical clustering can be agglomerative (bottom-up) or divisive (top-
down).
3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
• DBSCAN is a density-based clustering algorithm that clusters together points
that are closely packed, while marking points in low-density regions as
outliers.
• Algorithm:
• Core points: Points with a minimum number of neighbors within a
specified radius.
• Border points: Points that are within the neighborhood of a core point
but do not satisfy the minimum neighbor criterion.
• Noise points: Points that are neither core nor border points.
• DBSCAN does not require the number of clusters to be specified in advance.
4. EM (Expectation-Maximization) Clustering:
• EM clustering assumes that the dataset is generated from a mixture of several
Gaussian distributions and aims to estimate the parameters of these
distributions.
• Algorithm:
• Expectation step: Estimate the probability of each data point belonging
to each cluster.
• Maximization step: Update the parameters (mean, covariance) of each
Gaussian distribution based on the expected responsibilities.
• EM clustering aims to maximize the likelihood of the observed data.

Code:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.mixture import GaussianMixture

# Generate sample data


X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# K-Means clustering
kmeans = KMeans(n_clusters=4)
kmeans.fit(X)
kmeans_labels = kmeans.labels_
kmeans_centers = kmeans.cluster_centers_

# Hierarchical clustering
agg_clustering = AgglomerativeClustering(n_clusters=4)
agg_labels = agg_clustering.fit_predict(X)

# DBSCAN clustering
dbscan = DBSCAN(eps=0.5, min_samples=5)
dbscan_labels = dbscan.fit_predict(X)

# EM (Expectation-Maximization) clustering
em = GaussianMixture(n_components=4)
em.fit(X)
em_labels = em.predict(X)

# Plotting the results


plt.figure(figsize=(12, 12))

plt.subplot(221)
plt.scatter(X[:, 0], X[:, 1], c=kmeans_labels, cmap='viridis', s=50, alpha=0.5)
plt.scatter(kmeans_centers[:, 0], kmeans_centers[:, 1], c='red', marker='x', s=200)
plt.title('K-Means Clustering')

plt.subplot(222)
plt.scatter(X[:, 0], X[:, 1], c=agg_labels, cmap='viridis', s=50, alpha=0.5)
plt.title('Hierarchical Clustering')

plt.subplot(223)
plt.scatter(X[:, 0], X[:, 1], c=dbscan_labels, cmap='viridis', s=50, alpha=0.5)
plt.title('DBSCAN Clustering')

plt.subplot(224)
plt.scatter(X[:, 0], X[:, 1], c=em_labels, cmap='viridis', s=50, alpha=0.5)
plt.title('EM Clustering')

plt.show()
Output:

You might also like