Python DM Lab Manual Part 2

Uploaded by

HRITAV SINGH SOLANKI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views8 pages

Python DM Lab Manual Part 2

Uploaded by

HRITAV SINGH SOLANKI

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Association Rule Mining

Apriori:
The Apriori algorithm is used for mining frequent item sets from a transactional database. It
uses “support” parameter to put constraint on quality of item sets. Support refers to
frequency of occurrence of itemsets to be “frequent”. “min_sup” is used as a threshold to
support measure and is denoted by “𝜎”.
It uses a key concept in “Apriori property” of itemsets which is the anti-monotonicity of the
support measure.
1. All subsets of a frequent itemset must be frequent – Apriori property
2. For any infrequent itemset, all its supersets must be infrequent. – Anti-monotone
property.
Problem Statement: Generating the frequent itemsets and association rules from
transactional dataset.
Apriori Algorithm:
Following are the main steps of the algorithm:
1. Calculate the support of item sets of size k (= 1) in the transactional database. This is
called the set candidate 1-itemsets.
2. Prune candidates by eliminating itemsets with a support less than the given threshold 𝜎
by scanning the dataset D.
3. Join the frequent itemsets to form sets of size k + 1 e.g. size of 2 – in 2-itemsets.
4. Repeat step 2&3 until no more itemsets can be formed. This will happen when the set(s)
formed have a support less than the given support threshold.
Required libraries:
Working in Jupiter notebook:
1. Install library for Apriori algorithm using: !pip install mlxtend.
2. Load the “basket” dataset using the required command.
3. Perform pre-processing (if required). Select only products bought by a customer. Do not
include customer details.
4. Apply Apriori algorithm using given value for support and conYidence using the required
library.

Generating Association rules:

To Yind the association rules, the obtained frequent itemsets are generated using the
conYidence measure calculated on the basis of support of itemsets.

from mlxtend.frequent_patterns import apriori

frequent_itemsets = apriori(df, min_support=0.25, use_colnames=True/False)
frequent_itemsets

Assignment Questions
1. Find frequent itemsets in the dataset using Apriori min_sup 0.1
2. Find the association rules in the dataset having min_conf 10%
3. Find association rules having minimum antecedent_len 2 & conYidence greater than 0.75
4. Load the "zoo" data
5. Perform pre-processing (if required)
Hint: All the atrributes will be transformed using one hot encoding
6. Find Frequent itemsets in zoo dataset having min support 0.5
7. Find Frequent association rules having min conYidence 0.5
8. Convert the dataset into two classes "Mammal" and "others"
9. Partition the dataset into training and testing part (70:30)
10. Generate association rules for "mammal" class (training data) with min_sup 0.4 and
conYidence as 1
11. Test the rules generated on testing dataset and Yind precision and recall for the rule
based classiYier
12. Apply decision tree on the dataset and calculate the performance evaluation measures
13. Which out of the two performs better.
Clustering using K-Means

K-Means:
K-means is an iterative algorithm that tries to partition the dataset into K pre-deYined distinct
non-overlapping subgroups (clusters) where each data point belongs to only one group. It
tries to make the intra-cluster similarities as high as possible while also keeping inter-cluster
similarities as low as possible. It assigns data points to a cluster whose centroid is at
minimum distance.

Algorithm:
Following are the main steps of the algorithm:
1. Specify number of clusters ‘K’.
2. Initialize centroids by Yirst shufYling the dataset and then randomly selecting K data
points without replacement.
3. Assign each data point to the closest cluster (centroid).
4. Compute the centroids for the clusters by taking the average of the all data points
that belong to each cluster
5. Keep iterating until there is no change to the centroids i.e. assignment of data points
to clusters isn’t changing.

Required libraries:

To load dataset: import pandas as pd

Preprocessing: from sklearn import preprocessing
ploting graph: import matplotlib.pyplot as plt k-
means: from sklearn.cluster import KMeans

Working in Jupiter notebook:

1. Load the required libraries for K-means algorithm.
2. Load the given dataset using the required command.
3. Perform pre-processing if required
4. Apply K-means algorithm. The various parameters used are as follows:
a. init controls the initialization technique. The standard version of the k-means
algorithm is implemented by setting init to "random". Setting this to "k-means++"
employs an advanced trick to speed up convergence.
b. n_clusters sets k for the clustering step. This is the most important parameter for
kmeans.
c. n_init sets the number of initializations to perform. This is important because two
runs can converge on different cluster assignments. The default behaviour for the
scikit-learn algorithm is to perform ten k-means runs and return the results of the one
with the lowest SSE.
d. max_iter sets the number of maximum iterations for each initialization of the kmeans
algorithm.
Assignment Questions
1. Load the dataset and perform standardization as pre-processing task.
2. Apply K-means on the dataset. Use random initialization, number of clusters = 5,
number of initializations = 10, maximum iterations = 300, random state = 42
3. Plot the clusters
4. Find the lowest SSE values and Yinal locations of the centroids.
5. Use initialization technique as "K-means++" and calculate the above values again
6. Draw a graph between K and SSE. Vary the value of K from 1 to 10, using same
parameters for K-Means as above
Hint: The elbow nick method is commonly used to evaluate the appropriate number of
clusters. To perform the elbow nick method, run several K-means, increment K with each
iteration, and record the SSE.
DBSCAN

DBSCAN:
The main concept of DBSCAN algorithm is to locate regions of high density that are separated
from one another by regions of low density.

To measure density of a region:

• Density at a point p: Number of points within a hypersphere of Radius Eps (ϵ) from point p.
• Dense Region: For each point in the cluster, the hypersphere with radius ϵ contains at least
minimum number of points (MinPts).

The Epsilon neighborhood of a point P in the database D is deYined as:

𝑁𝜖(p) = {𝑞 ∈ 𝐷 | 𝑑𝑖𝑠𝑡(𝑝, 𝑞) ≤ 𝜖

Following the deYinition of dense region, a point can be classiYied as a

1. Core Point if |𝑁𝜖 (p)|≥ MinPts (including p). The Core Points, as the name suggests, lie
usually within the interior of a cluster.
2. Border Point has fewer than MinPts within its 𝑁𝜖 , but it lies in the neighborhood of
another core point.
3. Noise is any data point that is neither core nor border point.
Advantages of DBSCAN:
1. DBSCAN is good at separating clusters of high density versus clusters of low density
within a given dataset.
2. It handles outliers well.
3. It can discover clusters of arbitrary shape.
4. EfYicient for large database

Disadvantages of DBSCAN:
1. Does not work well when dealing with clusters of varying densities. This is
because varying will either Yind cluster of high density or put many
clusters into one.
2. DBSCAN is sensitive to its parameters ϵ and MinPts. Steps of
Algorithm:
Following are the main steps of the algorithm:
1. All the data points are tagged as unprocessed. The algorithm starts with an arbitrary
unprocessed point for its neighborhood information from dataset.
2. If this point contains MinPts within ϵ neighborhood, cluster formation starts. Otherwise
the point is labelled as processed. This point may be later found within the ϵ
neighborhood of a different point and, thus can be made a part of the cluster. So we don’t
label it as noise.
3. If a point is found to be a core point then the points within the ϵ neighborhood are also
part of the cluster. So all the points found within ϵ neighborhood are added to queue Q.
4. Repeat this process for every point in Q until queue is empty. The core point is tagged as
‘core’ and processed. If the point from Q is not core tag, it is border point and processed.
In the last step, tag all the points which are not core or border as ‘noise’.
5. This completes evolution of one cluster. Repeat this process till all the clusters are
identiYied.
Required libraries:

import pandas as pd
from sklearn import preprocessing
import numpy as np
from sklearn import metrics # for evaluations
import matplotlib.pyplot as plt
%matplotlibinlineKMeans
Hierarchical Clustering

Hierarchical clustering is a type of unsupervised machine learning algorithm used to cluster

unlabelled data points. Like K-means clustering, hierarchical clustering also groups together
the data points with similar characteristics.
There are two types of hierarchical clustering: Agglomerative and Divisive. In the former, data
points are clustered using a bottom-up approach starting with individual data points as
individual clusters. The closest clusters are merged together as the clustering process evolves
till one big cluster is obtained. While in the latter top-down approach is followed where all
the data points are treated as one big cluster and the clustering process involves dividing the
one big cluster into several small clusters.

Steps for agglomerative clustering

1. Treat each data point as an individual cluster. Therefore, the number of clusters at the
start will be n, while n is an integer representing the number of data points.
2. Form a cluster by joining the two closest data points resulting in n-1 clusters.
3. Next two closest clusters resulting in n-2 clusters.
4. Repeat this step until one big cluster is formed.
5. Each merge will happen at an agglomerative level. Draw dendrogram of agglomerative
level vs no. of clusters. Cut the dendrogram to get right set of clusters.
import scipy.cluster.hierarchy as shc

plt.Yigure(Yigsize=(10, 7))
plt.title("Dendrograms")
dend = shc.dendrogram(shc.linkage(X, method='single'))

Assignment Questions
1. Load the “s1_modiYied” dataset and perform standardization as pre-processing task.
2. DBSCAN Algorithm with Scikit-Learn using epsilon as 0.3 and min points as 50. Take
euclidean distance for calculating distance between points.
3. Plot the clusters.
4. Detect the outliers using DBSCAN
5. Import the required libraries for agglomerative clustering Hint: from sklearn.cluster
import AgglomerativeClustering
6. Load dataset into a data frame and perform pre-processing if required
7. Apply agglomerative clustering with single link
8. Plot the clusters 9. Plot the dendrograms.
10. Apply agglomerative clustering using complete link and wards method and plot their
dendrogram.

ARM and Clustering
No ratings yet
ARM and Clustering
79 pages
CC Unit IV
No ratings yet
CC Unit IV
30 pages
Clustering
No ratings yet
Clustering
55 pages
BDA LabReport-9
No ratings yet
BDA LabReport-9
17 pages
AI Ass 2
No ratings yet
AI Ass 2
32 pages
Lecture 13 - Unsupervised Learning, PCA ICA
No ratings yet
Lecture 13 - Unsupervised Learning, PCA ICA
50 pages
Apriori Algorithm (Python 3.0) - A Data Analyst
No ratings yet
Apriori Algorithm (Python 3.0) - A Data Analyst
13 pages
R Record-1
No ratings yet
R Record-1
53 pages
AbidAdhikari26840 DWDM
No ratings yet
AbidAdhikari26840 DWDM
43 pages
Wa0033.
No ratings yet
Wa0033.
38 pages
Soal Try Out UN Fis
No ratings yet
Soal Try Out UN Fis
6 pages
Solve These
No ratings yet
Solve These
7 pages
DM Lab Internal
No ratings yet
DM Lab Internal
37 pages
Problem Set C5
No ratings yet
Problem Set C5
4 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
21 pages
Lecture 5
No ratings yet
Lecture 5
53 pages
Datamininganddataware
No ratings yet
Datamininganddataware
25 pages
Assignment # 1: Performance Timeline of Flynn Taxonomy
No ratings yet
Assignment # 1: Performance Timeline of Flynn Taxonomy
21 pages
ML-Notes - 4 and 5 - 16 Marks
No ratings yet
ML-Notes - 4 and 5 - 16 Marks
21 pages
Machine Learning Note Modul 4 5
No ratings yet
Machine Learning Note Modul 4 5
20 pages
Prac7 8 9 10
No ratings yet
Prac7 8 9 10
12 pages
Data Mining and Machine Learning PDF
No ratings yet
Data Mining and Machine Learning PDF
10 pages
Data Mining II 4986
No ratings yet
Data Mining II 4986
4 pages
Casos de ML Unsupervised Daniel Ames Camayo
No ratings yet
Casos de ML Unsupervised Daniel Ames Camayo
20 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
77 pages
DWDM Answer
No ratings yet
DWDM Answer
19 pages
FML
No ratings yet
FML
18 pages
DBSCAN Clustering in ML - Density Based Clustering
No ratings yet
DBSCAN Clustering in ML - Density Based Clustering
5 pages
Exercises - Dss - Partd - Handout
No ratings yet
Exercises - Dss - Partd - Handout
12 pages
ML Notes 1
No ratings yet
ML Notes 1
3 pages
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
No ratings yet
Introduction To Data Science Unsupervised Learning: CS 194 Fall 2015 John Canny
54 pages
Clustering
No ratings yet
Clustering
65 pages
Detecting Patterns With Unsupervised Learning
No ratings yet
Detecting Patterns With Unsupervised Learning
21 pages
Mod 4 - CLustering
No ratings yet
Mod 4 - CLustering
55 pages
ML Exp5 C36
No ratings yet
ML Exp5 C36
18 pages
DW&M Unit 3 Part II
No ratings yet
DW&M Unit 3 Part II
50 pages
COSC 6335 Data Mining (Dr. Eick) Solution Sketches Midterm Exam October 25, 2012
No ratings yet
COSC 6335 Data Mining (Dr. Eick) Solution Sketches Midterm Exam October 25, 2012
11 pages
01 K Means - Merged
No ratings yet
01 K Means - Merged
26 pages
Lecture 6
No ratings yet
Lecture 6
55 pages
ML Clustering2
No ratings yet
ML Clustering2
11 pages
Unit-4 Da
No ratings yet
Unit-4 Da
15 pages
Dmbi Iat-2 Imp Ques Soln
No ratings yet
Dmbi Iat-2 Imp Ques Soln
43 pages
Lecture 12 - Unsupervised Learning - Shoould Be Marged
No ratings yet
Lecture 12 - Unsupervised Learning - Shoould Be Marged
31 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
Drawback of Standard K-Means Algorithm
No ratings yet
Drawback of Standard K-Means Algorithm
5 pages
WQD7005 Final Exam - 17219402
No ratings yet
WQD7005 Final Exam - 17219402
12 pages
2.3 Aiml Rishit
No ratings yet
2.3 Aiml Rishit
7 pages
Data Mining Ex1
No ratings yet
Data Mining Ex1
10 pages
YanchangZhao Refcard Data Mining
No ratings yet
YanchangZhao Refcard Data Mining
3 pages
K Means
No ratings yet
K Means
3 pages
Esam - DWM Lab 8
No ratings yet
Esam - DWM Lab 8
5 pages
Fuzzy Extensions of The DBScan Clustering Algorithm
No ratings yet
Fuzzy Extensions of The DBScan Clustering Algorithm
12 pages
DB Scan
No ratings yet
DB Scan
7 pages
Unit 4
No ratings yet
Unit 4
5 pages
ML DSBA Lab7
No ratings yet
ML DSBA Lab7
6 pages
K Means
No ratings yet
K Means
25 pages
DBSCAN - Introduction in Machine Learning.
No ratings yet
DBSCAN - Introduction in Machine Learning.
3 pages
Reconfigurable Computing: Architectures, Tools, and Applications
No ratings yet
Reconfigurable Computing: Architectures, Tools, and Applications
370 pages
R Reference Card For Data Mining
No ratings yet
R Reference Card For Data Mining
3 pages
4 Kurtulgu - 2
No ratings yet
4 Kurtulgu - 2
11 pages
Module 1 ML Mumbai University
No ratings yet
Module 1 ML Mumbai University
47 pages
DBSCAN Clustering
No ratings yet
DBSCAN Clustering
22 pages
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course
No ratings yet
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course
14 pages
Assignment 1: Data Mining MGSC5126 - 10
No ratings yet
Assignment 1: Data Mining MGSC5126 - 10
10 pages
Research Paper Mini Project
No ratings yet
Research Paper Mini Project
13 pages
February 2024-: Top Read Articles in Computer Science & Information Technology
No ratings yet
February 2024-: Top Read Articles in Computer Science & Information Technology
35 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
Image Segmentation Using K-Mean and DBSCAN
No ratings yet
Image Segmentation Using K-Mean and DBSCAN
26 pages
Clustering Algorithms To Analyze Smart City Traffic Data
No ratings yet
Clustering Algorithms To Analyze Smart City Traffic Data
8 pages
Customer Segmentation Using Machine Learning Model
No ratings yet
Customer Segmentation Using Machine Learning Model
12 pages
Comparative Study Between Density Based Clustering - Dbscan and Optics
No ratings yet
Comparative Study Between Density Based Clustering - Dbscan and Optics
4 pages
Ajith-Quiz 1 - K-Means, DBSCAN and Hierarchical Clustering - Machine Learning 3 - Olympus LMS
No ratings yet
Ajith-Quiz 1 - K-Means, DBSCAN and Hierarchical Clustering - Machine Learning 3 - Olympus LMS
7 pages
Inferring Temporal Motifs For Travel Pattern Analysis Using Large Scale Smart Card Data
No ratings yet
Inferring Temporal Motifs For Travel Pattern Analysis Using Large Scale Smart Card Data
21 pages
Lab Manual
No ratings yet
Lab Manual
100 pages
A Cluster-Based Optimization Framework For Vehicle Routing Problem With Workload Balance
No ratings yet
A Cluster-Based Optimization Framework For Vehicle Routing Problem With Workload Balance
14 pages
Chameleon PDF
100% (1)
Chameleon PDF
10 pages
7 - Chapter 7-Chapter 7 - Density-Based Clustering Methods
No ratings yet
7 - Chapter 7-Chapter 7 - Density-Based Clustering Methods
30 pages
DS143 Group 13 Presentation-1
No ratings yet
DS143 Group 13 Presentation-1
27 pages
DBSCAN Clustering
No ratings yet
DBSCAN Clustering
17 pages
ABIRCH-Automatic Threshold Estimation For The BIRCH Clustering Algorithm
No ratings yet
ABIRCH-Automatic Threshold Estimation For The BIRCH Clustering Algorithm
11 pages
Lesson 4.1 - Unsupervised Learning Partitioning Methods
No ratings yet
Lesson 4.1 - Unsupervised Learning Partitioning Methods
32 pages
Density Based Clustering Algorithm
No ratings yet
Density Based Clustering Algorithm
25 pages
DMDW 5th Module
No ratings yet
DMDW 5th Module
28 pages
Unit 5 - Cluster Analysis
No ratings yet
Unit 5 - Cluster Analysis
14 pages
Density Based Clustering
No ratings yet
Density Based Clustering
22 pages
Customer Segmentation With RFM Models and Demographic Variable Using DBSCAN Algorithm
No ratings yet
Customer Segmentation With RFM Models and Demographic Variable Using DBSCAN Algorithm
8 pages
Assign 7
No ratings yet
Assign 7
5 pages
I. Choose The Correct Alternative:: II. Fill in The Blanks
No ratings yet
I. Choose The Correct Alternative:: II. Fill in The Blanks
1 page
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Alternating Decision Tree: Fundamentals and Applications
From Everand
Alternating Decision Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet