Data Minig Lab Manual
Data Minig Lab Manual
R18 B.Tech. Cse (Computer Networks) Iii & Iv Year Jntu Hyderabad (Jawaharlal Nehru
Technological University, Hyderabad)
College Code: 7Q
DATA MINING
LAB MANUAL
Regulation:18/JNTUH
Academic year: 2023-2024
B.TECH III YEAR I SEM(CSD)
BRILLIANT GRAMMAR SCHOOL EDUCATIONAL SOCIETY’S
GROUP OF INSTITUTIONS – INTEGRATED CAMPUS
DEPARTMENT
OF
COMPUTER SCIENCE SCIENCE AND DESIGN
Brilliant
Grammar School Educational
Society's of
Institutions- IC
Abdullapur(V),Hayathnagar(M),Hyderabad,R.R.Dist-501505
LIST OF EXPERIMENTS:
1. Data Processing Techniques: (i) Data cleaning (ii) Data transformation – Normalization
1. Data Processing Techniques: (i) Data cleaning (ii) Data transformation – Normalization
Data Cleaning :
Data in the real world is frequently incomplete, noisy, and inconsistent. Many bits of the data may be
irrelevant or missing. Data cleaning is carried out to handle this aspect. Data cleaning methods aim to fill
in missing values, smooth out noise while identifying outliers, and fix data discrepancies. Unclean data
can confuse data and the model. Therefore, running the data through various Data Cleaning/Cleansing
It’s fairly common for your dataset to contain missing values. It could have happened during data
collection or as a result of a data validation rule, but missing values must be considered anyway.
1. Dropping rows/columns: If the complete row is having NaN values then it doesn't make any
value out of it. So such rows/columns are to be dropped immediately. Or if the % of row/column is
mostly missing say about more than 65% then also one can choose to drop.
2. Checking for duplicates: If the same row or column is repeated then also you can drop it by
keeping the first instance. So that while running machine learning algorithms, so as not to offer that
3. Estimate missing values: If only a small percentage of the values are missing, basic interpolation
methods can be used to fill in the gaps. However, the most typical approach of dealing with missing
data is to fill them in with the feature’s mean, median, or mode value.
Noisy data is meaningless data that machines cannot interpret. It can be caused by poor data collecting,
data input problems, and so on. It can be dealt with in the following ways:
1. Binning Method: This method smooths data that has been sorted. The data is divided into equal-
sized parts, and the process is completed using a variety of approaches. Each segment is dealt with
independently. All data in a segment can be replaced by its mean, or boundary values can be used to
2. Clustering: In this method, related data is grouped in a cluster. Outliers may go unnoticed, or
3. Regression: By fitting data to a regression function, data can be smoothed out. The regression
model employed may be linear (with only one independent variable) or multiple (with numerous
Data Integration
It is involved in a data analysis task that combines data from multiple sources into a coherent data store.
These sources may include multiple databases. Do you think how data can be matched up ?? For a data
analyst in one database, he finds Customer_ID and in another he finds cust_id, How can he sure about
them and say these two belong to the same entity. Databases and Data warehouses have Metadata (It is
Data Normalization
Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or [0.0,
1.0]. There are different methods to normalize the data, as discussed below.
Consider that we have a numeric attribute A and we have n number of observed values for attribute A
that are V1, V2, V3, ….Vn.
o Min-max normalization: This method implements a linear transformation on the original data.
Let us consider that we have minA and maxA as the minimum and maximum value observed for
attribute A and Viis the value for attribute A that has to be normalized.
The min-max normalization would map Vi to the V'i in a new smaller range [new_minA,
new_maxA]. The formula for min-max normalization is given below:
For example, we have $1200 and $9800 as the minimum, and maximum value for the attribute
income, and [0.0, 1.0] is the range in which we have to map a value of $73,600.
o Z-score normalization: This method normalizes the value for attribute A using
the meanand standard deviation. The following formula is used for Z-score normalization:
Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively.
For example, we have a mean and standard deviation for attribute A as $54,000 and $16,000.
And we have to normalize the value $73,600 using z-score normalization.
o Decimal Scaling: This method normalizes the value of attribute A by moving the decimal point
in the value. This movement of a decimal point depends on the maximum absolute value of A.
The formula for the decimal scaling is given below:
There are various ways in which a fact table can be partitioned. In horizontal partitioning,
we have to keep in mind the requirements for manageability of the data warehouse.
Vertical Partition
Vertical partitioning, splits the data vertically. The following images depicts how vertical
partitioning is done.
Normalization
Row Splitting
Normalization
Normalization is the standard relational method of database organization. In this method, the
rows are collapsed into a single row, hence it reduce space. Take a look at the following
tables that show how normalization is performed.
Row Splitting
Row splitting tends to leave a one-to-one map between partitions. The motive of row
splitting is to speed up the access to large table by reducing its size.
Hash Partitioning
Hash partitioning maps data to partitions based on a hashing algorithm that Oracle
applies to a partitioning key that you identify. The hashing algorithm evenly distributes
rows among partitions, giving partitions approximately the same size. Hash partitioning
is the ideal method for distributing data evenly across devices. Hash partitioning is also
an easy-to-use alternative to range partitioning, especially when the data to be partitioned
is not historical.
Oracle Database uses a linear hashing algorithm and to prevent data from clustering
within specific partitions, you should define the number of partitions by a power of two
(for example, 2, 4, 8).
Round-robin partitioning: the simplest strategy, it ensures uniform data distribution. With n partitions,
the ith tuple in insertion order is assigned to partition (i mod n). This strategy enables the sequential
access to a relation to be done in parallel. However, the direct access to individual tuples, based on a
predicate, requires accessing the entire relation.
Star Schema
Each dimension in a star schema is represented with only one-dimension table.
This dimension table contains the set of attributes.
The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
10
There is a fact table at the center. It contains the keys to each of four dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
Star Schema Definition
The star schema that we have discussed can be defined using Data Mining Query
Language (DMQL) as follows −
define cube sales star [time, item, branch, location]:
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)
Snowflake Schema
Some dimension tables in the Snowflake schema are normalized.
The normalization splits up the data into additional tables.
Unlike Star schema, the dimensions table in a snowflake schema are normalized. For
example, the item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.
Now the item dimension table contains the attributes item_key, item_name, type,
brand, and supplier-key.
The supplier key is linked to the supplier dimension table. The supplier dimension
table contains the attributes supplier_key and supplier_type.
Snowflake Schema Definition
Snowflake schema can be defined using DMQL as follows −
define cube sales snowflake [time, item, branch, location]:
11
define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state,country)
12
Operations:
1. Slice is the act of picking a rectangular subset of a cube by choosing a single value
for one of its dimensions, creating a new cube with one fewer dimension.[4] The
picture shows a slicing operation: The sales figures of all sales regions and all product
categories of the company in the year 2005 and 2006 are "sliced" out of the data cube.
2. Dice: The dice operation produces a subcube by allowing the analyst to pick specific
values of multiple dimensions.[5]The picture shows a dicing operation: The new cube
shows the sales figures of a limited number of product categories, the time and region
dimensions cover the same range as before.
3. Drill Down/Up allows the user to navigate among levels of data ranging from the
most summarized (up) to the most detailed (down).[4] The picture shows a drill-down
operation: The analyst moves from the summary category "Outdoor-Schutzausrüstung"
to see the sales figures for the individual products.
5. Pivot allows an analyst to rotate the cube in space to see its various faces. For
13
example, cities could be arranged vertically and products horizontally while viewing
data for a particular quarter. Pivoting could replace products with time periods to see
data across time for a single product.
14
15
warehouse.
In general, it implements off-line aggregation earlier an OLAP or data mining query is
submitted for processing. In other words, the attribute-oriented induction approach
is generally a query-oriented, generalization-based, on-line data analysis methods. The
general idea of attribute-oriented induction is to first collect the task-relevant data using
a database query and then perform generalization based on the examination of the
number of distinct values of each attribute in the relevant collection of data.
The generalization is implemented by attribute removal or attribute generalization.
Aggregation is implemented by combining identical generalized tuples and
accumulating their specific counts. This decreases the size of the generalized data set.
The resulting generalized association can be mapped into several forms for
presentation to the user, including charts or rules.
Algorithm: The process of attribute-oriented induction which are as follows −
• First, data focusing must be implemented before attribute-oriented induction. This
step corresponds to the description of the task-relevant records (i.e., data for analysis).
The data are collected based on the data supported in the data mining query.
• Because a data mining query is usually relevant to only a portion of the database,
selecting the relevant set of data not only makes mining more efficient, but also
changes more significant results than mining the whole database.
• It can be specifying the set of relevant attributes (i.e., attributes for mining, as
indicated in DMQL with the in relevance to clause) may be difficult for the user. A user
can choose only a few attributes that it is important, while missing others that can
also play a role in the representation.
• For example, suppose that the dimension birth place is defined by the attributes city,
province or state, and country. It can allow generalization on the birth place
dimension, the other attributes defining this dimension should also be included.
• In other terms, having the system automatically involve province or state and
country as relevant attributes enables city to be generalized to these larger conceptual
levels during the induction phase.
• At the other extreme, suppose that the user may have introduced too many attributes
by specifying all of the possible attributes with the clause “in relevance to *”. In this
case, all of the attributes in the relation specified by the from clause would be included in
the analysis.
• Some attributes are unlikely to contribute to an interesting representation. A
correlation-based or entropy-based analysis method can be used to perform attribute
relevance analysis and filter out statistically irrelevant or weakly relevant attributes
from the descriptive mining process.
17
Step1: Open the data file in Weka Explorer. It is presumed that the required data fields have
been discretized. In this example it is age attribute.
Step2: Clicking on the associate tab will bring up the interface for association rule algorithm.
Step4: Inorder to change the parameters for the run (example support, confidence etc) we
click on the text box immediately to the right of the choose button.
Dataset contactlenses.arff
18
The following screenshot shows the association rules that were generated when apriori
algorithm is applied on the given dataset.
Step 1: Run the Weka explorer and load the data file iris.arff in preprocessing interface.
Step 2: Inorder to perform clustering select the ‘cluster’ tab in the explorer and click on the
choose button. This step results in a dropdown list of available clustering algorithms.
Step 4: Next click in text button to the right of the choose button to get popup window shown
19
in the screenshots. In this window we enter six on the number of clusters and we leave the
value of the seed on as it is. The seed value is used in generating a random number which is
used for making the internal assignments of instances of clusters.
Step 5 : Once of the option have been specified. We run the clustering algorithm there we
must make sure that they are in the ‘cluster mode’ panel. The use of training set option is
selected and then we click ‘start’ button. This process and resulting window are shown in the
following screenshots.
Step 6 : The result window shows the centroid of each cluster as well as statistics on the
number and the percent of instances assigned to different clusters. Here clusters centroid are
means vectors for each clusters. This clusters can be used to characterized the cluster.For eg,
the centroid of cluster1 shows the class iris.versicolor mean value of the sepal length is
5.4706, sepal width 2.4765, petal width 1.1294, petal length 3.7941.
Step 7: Another way of understanding characterstics of each cluster through visualization ,we
can do this, try right clicking the result set on the result. List panel and selecting the visualize
cluster assignments.
The following screenshot shows the clustering rules that were generated when simple k
means algorithm is applied on the given dataset.
20
The Apriori algorithm is an influential algorithm for mining frequent item sets for Boolean
association rules. It uses a “bottom-up” approach, where frequent subsets are extended one at a
time (a step known as candidate generation, and groups of candidates are tested against the data).
Problem:
TID ITEM
S
100 1,3,4
200 2,3,5
300 1,2,3,5
400 2,5
To find frequent item sets for above transaction with a minimum support of 2 having
confidence measure of 70% (i.e, 0.7).
Procedure:
Step 1:
Count the number of transactions in which each item occurs
TI ITEM
D S
1 2
2 3
3 3
4 1
5 3
Step 2:
Eliminate all those occurrences that have transaction numbers less than the minimum support (
2 in this case).
21
22
ITEM NO. OF
TRANSACTIONS
1 2
2 3
3 3
5 3
This is the single items that are bought frequently. Now let‟s say we want to find a pair of
items that are bought frequently. We continue from the above table (Table in step 2).
Step 3:
We start making pairs from the first item like 1,2;1,3;1,5 and then from second item like
2,3;2,5. We do not perform 2,1 because we already did 1,2 when we were making pairs with 1
and buying 1 and 2 together is same as buying 2 and 1 together. After making all the pairs we
get,
ITEM PAIRS
1,2
1,3
1,5
2,3
2,5
3,5
Step 4:
Now, we count how many times each pair is bought together.
NO.OF
TRANSACTIONS
ITEM PAIRS
1,2 1
1,3 2
1,5 1
2,3 2
2,5 3
3,5 2
23
Step 5:
Again remove all item pairs having number of transactions less than 2.
1,3 2
2,3 2
2,5 3
3,5 2
These pair of items is bought frequently together. Now, let‟s say we want to find a set of
three items that are bought together. We use above table (of step 5) and make a set of three
items.
Step 6:
To make the set of three items we need one more rule (It‟s termed as self-join), it simply
means, from item pairs in above table, we find two pairs with the same first numeric, so, we get
(2,3) and (2,5), which gives (2,3,5). Then we find how many times (2, 3, 5) are bought together
in the original table and we get the following
ITEM NO. OF
SET TRANSACTIONS
(2,3,5) 2
Thus, the set of three items that are bought together from this data are (2, 3,
5). Confidence:
We can take our frequent item set knowledge even further, by finding association rules using the
frequent item set. In simple words, we know (2, 3, 5) are bought together frequently, but what is
the association between them. To do this, we create a list of all subsets of frequently bought
items (2, 3, 5) in our case we get following subsets:
{2}
{3}
{5}
{2,3}
24
{3,5}
{2,5}
25
So first create a csv file for the above problem, the csv file for the above problem will
look like the rows and columns in the above figure. This file is written in excel sheet.
26
TID ITEMS
100 1,3,4
200 2,3,5
300 1,2,3,5
400 2,5
Solution:
Similar to Apriori Algorithm, find the frequency of occurrences of all each item in dataset and
then prioritize the items according to its descending order of its frequency of occurrence.
Eliminating those occurrences with the value less than minimum support and assigning the
priorities, we obtain the following table.
1 2 4
2 3 1
3 3 2
5 3 3
TID ITEMS
100 1,3
200 2,3,5
300 2,3,5,1
400 2,5
27
28
Construction of tree:
Note that all FP trees have „null‟ node as the root node. So, draw the root node first and attach
the items of the row 1 one by one respectively and write their occurrences in front of it. The
tree is further expanded by adding nodes according to the prefixes (count) formed and by
further incrementing the occurrences every time they occur and hence the tree is built.
Prefixes:
1->3:1 2,3,5:1
5->2,3:2 2:1
3->2:2
1-> 3:2 /*2 and 5 are eliminated because they‟re less than minimum support, and
the occurrence of 3 is obtained by adding the occurrences in both the instances*/
Similarly, 5->2,3:2 ; 2:3;3:2
3->2 :2
Therefore, the frequent item sets are {3,1}, {2,3,5}, {2,5}, {2,3},
{3,5} The tree is constructed as below:
NUL
3:1
2:3
1:1
3:2
5:1
5:2
1:1
29
Generating the association rules for the following tree and calculating
the confidence measures we get-
{3}=>{1}=2/3=67%
{1}=>{3}=2/2=100%
{2}=>{3,5}=2/3=67%
30
{2,5}=>{3}=2/3=67%
{3,5}=>{2}=2/2=100%
{2,3}=>{5}=2/2=100%
{3}=>{2,5}=2/3=67%
{5}=>{2,3}=2/3=67%
{2}=>{5}=3/3=100%
{5}=>{2}=3/3=100%
{2}=>{3}=2/3=67%
{3}=>{2}=2/3=67%
Thus eliminating all the sets having confidence less than 70%, we obtain the following
conclusions:
{1}=>{3} , {3,5}=>{2} , {2,3}=>{5} , {2}=>{5}, {5}=>{2}.
As we see there are 5 rules that are being generated manually and these are to be checked against
the results in WEKA. Inorder to check the results in the tool we need to follow the similar
procedure like
Apriori.
So first create a csv file for the above problem, the csv file for the above problem will look like the rows
and columns in the above figure. This file is written in excel sheet.
Decision tree learning is one of the most widely used and practical methods for inductive
inference over supervised data. It represents a procedure for classifying categorical database on
their attributes. This representation of acquired knowledge in tree form is intuitive and easy to
assimilate by humans.
31
ILLUSTRATION:
Build a decision tree for the following data
32
The entropy is a measure of the uncertainty associated with a random variable. As uncertainty
increases, so does entropy, values range from [0-1] to present the entropy of information
Entropy (D) =
Information gain is used as an attribute selection measure; pick the attribute having the highest
information gain, the gain is calculated by:
Gain (D, A) = Entropy (D) -
Where, D: A given data partition A: Attribute
V: Suppose we were partition the tuples in D on some attribute A having v distinct values D is
split into v partition or subsets, (D1, D2….. Dj) , where Dj contains those tuples in D that have
outcome Aj of A.
Class P:
buys_computer=”yes” Class
N: buys_computer=”no”
= Entropy (D) -
= Entropy ( D ) – 5/14Entropy(Syouth)-4/14Entropy(Smiddle-aged)-5/14Entropy(Ssenior)
= 0.940-0.694
=0.246
High No Fair No
High No Excellent No
Medium No Fair No
33
Similarly, Gain(student)=0.971
Gain(credit)=0.0208
Gain( student) is highest ,
A decision tree for the concept buys_computer, indicating whether a customer at All Electronics
is likely to purchase a computer. Each internal (non-leaf) node represents a test on an attribute.
Each leaf node represents a class ( either buys_computer=”yes” or buys_computer=”no”.
first create a csv file for the above problem,the csv file for the above problem will look like the
rows and columns in the above figure. This file is written in excel sheet.
34
35
Step2:
Now select the classify tab in the tool and click on start button and then we can see the result of
the problem as below
Step3:
Check the main result which we got manually and the result in weka by right clicking on
the result and visualizing the tree.
36
Information gain (IG) measures how much “information” a feature gives us about the class. –
Features that perfectly partition should give maximal information. – Unrelated features should
give no information. It measures the reduction in entropy. CfsSubsetEval aims to identify a
subset of attributes that are highly correlated with the target while not being strongly correlated
with one another. It searches through the space of possible attribute subsets for the “best” one
using the BestFirst search method by default, although other methods can be chosen. To use the
wrapper method rather than a filter method, such as CfsSubsetEval, first select
WrapperSubsetEval and then configure it by choosing a learning algorithm to apply and setting
the number of cross-validation folds to use when evaluating it on each attribute subset.
Steps:
37
Example:
.
AGE INCOME STUDENT CREDIT_RATING BUYS_COMPUTER
<=30
Low No Fair No
<=30 Mediu m Yes Fair Yes
CLASS:
C1:buys_com
puter = ‘yes’
C2:buys_com
puter=’no’
DATA TO
BECLASSIFIED
39
P (buys_computer=”no”) =5/14=0.357
1. P( age=”<=30” |buys_computer=”yes”)=2/9
2. P(age=”<=30”|buys_computer=”no”)=3/5
3. P(income=”medium”|buys_computer=”yes”)=4/9
4. P(income=”medium”|buys_computer=”no”)=2/5
5. P(student=”yes”|buys_computer=”yes”)=6/9
6. P(student=”yes” |buys_computer=”no”)=1/5=0.2
7. P(credit_rating=”fair ”|buys_computer=”yes”)=6/9
8. P(credit_rating=”fair” |buys_computer=”no”)=2/5
P(X/C2):P(X/buys_computer=”no”)=3/5*2/5*1
/5*2/5=12/125
P(C1/X)=P(X/C1)*P(C1)
P(X/buys_computer=”yes”)*P(buys_computer=”yes”)=(32/1134)*(9/14)=0.019
P(C2/X)=p(x/c2)*p(c2)
40
P (X/buys_computer=”no”)*P(buys_computer=”no”)=(12/125)*(5/14)=0.007
Create a csv file with the above table considered in the example. the arff
file will look as shown below:
Step 2:
Now open weka explorer and then select all the attributes in the table.
41
Step 3:
Select the classifier tab in the tool and choose baye‟s folder and then naïve baye‟s classifier
to see the result as shown below.
42
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
Data Preprocessing will be done with the help of following script lines.
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
Next, we will divide the data into train and test split. Following code will split
the dataset into 60% training data and 40% of testing data −
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Next, train the model with the help of KNeighborsClassifier class of sklearn as
follows −
At last we need to make prediction. It can be done with the help of following
script −
y_pred = classifier.predict(X_test)
Output
Confusion Matrix:
[[21 0 0]
[ 0 16 0]
[ 0 7 16]]
Classification Report:
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 21
Iris-versicolor 0.70 1.00 0.82 16
Iris-virginica 1.00 0.70 0.82 23
micro avg 0.88 0.88 0.88 60
macro avg 0.90 0.90 0.88 60
weighted avg 0.92 0.88 0.88 60
44
Accuracy: 0.8833333333333333
ILLUSTRATION:
As a simple illustration of a k-means algorithm, consider the following data set consisting of
the scores of two variables on each of the five variables.
I X1 X2
A 1 1
B 1 0
C 0 2
D 2 4
E 3 5
This data set is to be grouped into two clusters: As a first step in finding a sensible partition,
let the A & C values of the two individuals furthest apart (using the Euclidean distance
measure), define the initial cluster means, giving:
Cluster1 A (1,1)
Cluster2 C (0,2)
45
The remaining individuals are now examined in sequence and allocated to the cluster to which
they are closest, in terms of Euclidean distance to the cluster mean. The mean vector is
recalculated each time a new member is added. This leads to the following series of steps:
A C
A 0 1.4
B 1 2.5
C 1.4 0
D 3.2 2.82
E 4.5 4.2
Initial partitions have changed, and the two clusters at this stage having the
following characteristics.
But we cannot yet be sure that each individual has been assigned to the right cluster. So, we
compare each individual‟s distance to its own cluster mean and to that of the opposite cluster.
And, we find:
I A C
A 0.5 2.7
B 0.5 3.7
C 1.8 2.4
D 3.6 0.5
E 4.9 1.9
The individuals C is now relocated to Cluster 1 due to its less mean distance with the centroid
46
The iterative relocation would now continue from this new partition until no more relocation
occurs. However, in this example each individual is now nearer its own cluster mean than that of
the other cluster and the iteration stops, choosing the latest partitioning as the final cluster
solution.
Also, it is possible that the k-means algorithm won‟t find a final solution. In this case, it would
be a better idea to consider stopping the algorithm after a pre-chosen maximum number of
iterations.
Checking the solution in weka:
In order to check the result in the tool we need to follow a
procedure. Step 1:
Create a csv file with the above table considered in the example. the csv file will look as shown
below:
Step 2:
Now open weka explorer and then select all the attributes in the table.
47
Step 3:
Select the cluster tab in the tool and choose normal k-means technique to
see the result as shown below.
48
Algorithm
The tree structure of the given data is built by the BIRCH algorithm called the Clustering
feature tree (CF tree). This algorithm is based on the CF (clustering features) tree. In
addition, this algorithm uses a tree-structured summary to create clusters.
49
In context to the CF tree, the algorithm compresses the data into the sets of CF nodes. Those
nodes that have several sub-clusters can be called CF subclusters. These CF subclusters are
situated in no-terminal CF nodes.
The CF tree is a height-balanced tree that gathers and manages clustering features and holds
necessary information of given data for further hierarchical clustering. This prevents the
need to work with whole data given as input. The tree cluster of data points as CF is
represented by three numbers (N, LS, SS).
There are mainly four phases which are followed by the algorithm of BIRCH.
50
o Global clustering.
o Refining clusters.
Two of them (resize data and refining clusters) are optional in these four phases. They come
in the process when more clarity is required. But scanning data is just like loading data into
a model. After loading the data, the algorithm scans the whole data and fits them into the CF
trees.
In condensing, it resets and resizes the data for better fitting into the CF tree. In global
clustering, it sends CF trees for clustering using existing clustering algorithms. Finally,
refining fixes the problem of CF trees where the same valued points are assigned to
different leaf nodes.
Algorithm:
1. Choose k number of random points from the data and assign these k points to k number of clusters. These are the
initial medoids.
2. For all the remaining data points, calculate the distance from each medoid and assign it to the cluster with the
nearest medoid.
3. Calculate the total cost (Sum of all the distances from all the data points to the medoids)
4. Select a random point as the new medoid and swap it with the previous medoid. Repeat 2 and 3 steps.
5. If the total cost of the new medoid is less than that of the previous medoid, make the new medoid permanent and
repeat step 4.
6. If the total cost of the new medoid is greater than the cost of the previous medoid, undo the swap and repeat step 4.
7. The Repetitions have to continue until no change is encountered with new medoids to classify data points.
Data set:
x y
0 5 4
1 7 7
2 1 3
3 8 6
4 4 9
51
Scatter plot:
0 5 4 5 6
1 7 7 10 5
2 1 3 - -
3 8 6 10 7
4 4 9 - -
Cluster 1: 0
Cluster 2: 1, 3
52
0 5 4 - -
1 7 7 5 5
2 1 3 5 9
3 8 6 5 7
4 4 9 - -
Cluster 1: 2, 3
Cluster 2: 1
0 5 4 - -
1 7 7 - -
2 1 3 5 10
3 8 6 5 2
4 4 9 6 5
Cluster 1: 2
Cluster 2: 3, 4
53
Cluster 1: 4
Cluster 2: 0, 2
54
get any complex, and we can understand it very easily. We have to follow the following
steps in order to implement the DBSCAN algorithm and its logic inside a Python program:
First and foremost, we have to import all the required libraries which we have installed in
the prerequisites part so that we can use their functions while implementing the DBSCAN
algorithm.
Here, we have firstly imported all the required libraries or modules of libraries inside
the program:
In this step, we have to load that data, and we can do this by importing or loading the
dataset (that is required in the DBSCAN algorithm to work on it) inside the program. To
load the dataset inside the program, we will use the read.csv() function of the panda's
library and print the information from the dataset as we have done below:
55
3. # Dropping the CUST_ID column from the dataset with drop() function
4. M = M.drop('CUST_ID', axis = 1)
5. # Using fillna() function to handle missing values
6. M.fillna(method ='ffill', inplace = True)
7. # Printing dataset head in output
8. print(M.head())
Output:
BALANCE BALANCE_FREQUENCY ... PRC_FULL_PAYMENT TENURE
0 40.900749 0.818182 ... 0.000000 12
1 3202.467416 0.909091 ... 0.222222 12
2 2495.148862 1.000000 ... 0.000000 12
3 1666.670542 0.636364 ... 0.000000 12
4 817.714335 1.000000 ... 0.000000 12
[5 rows x 17 columns]
The data as given in the output above will be printed when we run the program, and we will
work on this data from the dataset file we loaded.
Now, we will start preprocessing the data of the dataset in this step by using the functions of
preprocessing module of the Sklearn library. We have to use the following technique while
preprocessing the data with Sklearn library functions:
56
In this step, we will be reducing the dimensionality of the scaled and normalized data so
that the data can be visualized easily inside the program. We have to use the PCA function
in the following way in order to transform the data and reduce its dimensionality:
Output:
C1 C2
0 -0.489949 -0.679976
1 -0.519099 0.544828
2 0.330633 0.268877
3 -0.481656 -0.097610
4 -0.563512 -0.482506
57