Python DM Lab Manual Part 2
Python DM Lab Manual Part 2
Apriori:
The Apriori algorithm is used for mining frequent item sets from a transactional database. It
uses “support” parameter to put constraint on quality of item sets. Support refers to
frequency of occurrence of itemsets to be “frequent”. “min_sup” is used as a threshold to
support measure and is denoted by “𝜎”.
It uses a key concept in “Apriori property” of itemsets which is the anti-monotonicity of the
support measure.
1. All subsets of a frequent itemset must be frequent – Apriori property
2. For any infrequent itemset, all its supersets must be infrequent. – Anti-monotone
property.
Problem Statement: Generating the frequent itemsets and association rules from
transactional dataset.
Apriori Algorithm:
Following are the main steps of the algorithm:
1. Calculate the support of item sets of size k (= 1) in the transactional database. This is
called the set candidate 1-itemsets.
2. Prune candidates by eliminating itemsets with a support less than the given threshold 𝜎
by scanning the dataset D.
3. Join the frequent itemsets to form sets of size k + 1 e.g. size of 2 – in 2-itemsets.
4. Repeat step 2&3 until no more itemsets can be formed. This will happen when the set(s)
formed have a support less than the given support threshold.
Required libraries:
Working in Jupiter notebook:
1. Install library for Apriori algorithm using: !pip install mlxtend.
2. Load the “basket” dataset using the required command.
3. Perform pre-processing (if required). Select only products bought by a customer. Do not
include customer details.
4. Apply Apriori algorithm using given value for support and conYidence using the required
library.
Assignment Questions
1. Find frequent itemsets in the dataset using Apriori min_sup 0.1
2. Find the association rules in the dataset having min_conf 10%
3. Find association rules having minimum antecedent_len 2 & conYidence greater than 0.75
4. Load the "zoo" data
5. Perform pre-processing (if required)
Hint: All the atrributes will be transformed using one hot encoding
6. Find Frequent itemsets in zoo dataset having min support 0.5
7. Find Frequent association rules having min conYidence 0.5
8. Convert the dataset into two classes "Mammal" and "others"
9. Partition the dataset into training and testing part (70:30)
10. Generate association rules for "mammal" class (training data) with min_sup 0.4 and
conYidence as 1
11. Test the rules generated on testing dataset and Yind precision and recall for the rule
based classiYier
12. Apply decision tree on the dataset and calculate the performance evaluation measures
13. Which out of the two performs better.
Clustering using K-Means
K-Means:
K-means is an iterative algorithm that tries to partition the dataset into K pre-deYined distinct
non-overlapping subgroups (clusters) where each data point belongs to only one group. It
tries to make the intra-cluster similarities as high as possible while also keeping inter-cluster
similarities as low as possible. It assigns data points to a cluster whose centroid is at
minimum distance.
Algorithm:
Following are the main steps of the algorithm:
1. Specify number of clusters ‘K’.
2. Initialize centroids by Yirst shufYling the dataset and then randomly selecting K data
points without replacement.
3. Assign each data point to the closest cluster (centroid).
4. Compute the centroids for the clusters by taking the average of the all data points
that belong to each cluster
5. Keep iterating until there is no change to the centroids i.e. assignment of data points
to clusters isn’t changing.
Required libraries:
DBSCAN:
The main concept of DBSCAN algorithm is to locate regions of high density that are separated
from one another by regions of low density.
• Density at a point p: Number of points within a hypersphere of Radius Eps (ϵ) from point p.
• Dense Region: For each point in the cluster, the hypersphere with radius ϵ contains at least
minimum number of points (MinPts).
𝑁𝜖(p) = {𝑞 ∈ 𝐷 | 𝑑𝑖𝑠𝑡(𝑝, 𝑞) ≤ 𝜖
1. Core Point if |𝑁𝜖 (p)|≥ MinPts (including p). The Core Points, as the name suggests, lie
usually within the interior of a cluster.
2. Border Point has fewer than MinPts within its 𝑁𝜖 , but it lies in the neighborhood of
another core point.
3. Noise is any data point that is neither core nor border point.
Advantages of DBSCAN:
1. DBSCAN is good at separating clusters of high density versus clusters of low density
within a given dataset.
2. It handles outliers well.
3. It can discover clusters of arbitrary shape.
4. EfYicient for large database
Disadvantages of DBSCAN:
1. Does not work well when dealing with clusters of varying densities. This is
because varying will either Yind cluster of high density or put many
clusters into one.
2. DBSCAN is sensitive to its parameters ϵ and MinPts. Steps of
Algorithm:
Following are the main steps of the algorithm:
1. All the data points are tagged as unprocessed. The algorithm starts with an arbitrary
unprocessed point for its neighborhood information from dataset.
2. If this point contains MinPts within ϵ neighborhood, cluster formation starts. Otherwise
the point is labelled as processed. This point may be later found within the ϵ
neighborhood of a different point and, thus can be made a part of the cluster. So we don’t
label it as noise.
3. If a point is found to be a core point then the points within the ϵ neighborhood are also
part of the cluster. So all the points found within ϵ neighborhood are added to queue Q.
4. Repeat this process for every point in Q until queue is empty. The core point is tagged as
‘core’ and processed. If the point from Q is not core tag, it is border point and processed.
In the last step, tag all the points which are not core or border as ‘noise’.
5. This completes evolution of one cluster. Repeat this process till all the clusters are
identiYied.
Required libraries:
import pandas as pd
from sklearn import preprocessing
import numpy as np
from sklearn import metrics # for evaluations
import matplotlib.pyplot as plt
%matplotlibinlineKMeans
Hierarchical Clustering
plt.Yigure(Yigsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(X, method='single'))
Assignment Questions
1. Load the “s1_modiYied” dataset and perform standardization as pre-processing task.
2. DBSCAN Algorithm with Scikit-Learn using epsilon as 0.3 and min points as 50. Take
euclidean distance for calculating distance between points.
3. Plot the clusters.
4. Detect the outliers using DBSCAN
5. Import the required libraries for agglomerative clustering Hint: from sklearn.cluster
import AgglomerativeClustering
6. Load dataset into a data frame and perform pre-processing if required
7. Apply agglomerative clustering with single link
8. Plot the clusters 9. Plot the dendrograms.
10. Apply agglomerative clustering using complete link and wards method and plot their
dendrogram.