A Fast DBSCAN Algorithm for Big Data Based on Efficient Density
A Fast DBSCAN Algorithm for Big Data Based on Efficient Density
A fast DBSCAN algorithm for big data based on efficient density calculation
Nooshin Hanafi , Hamid Saadatfar *
Computer Engineering Department, University of Birjand, Birjand, Iran
A R T I C L E I N F O A B S T R A C T
Keywords: Today, data is being generated with a high speed. Managing large volume of data has become a challenge in the
Data Mining current age. Clustering is a method to analyze data that is generated in the Internet. Various approaches have
Clustering been presented for data clustering until now. Among them, DBSCAN is a most well-known density-based clus
Big Data
tering algorithm. This algorithm can detect clusters of different shapes and does not require prior knowledge
DBSCAN Algorithm
about the number of clusters. A major part of the DBSCAN run-time is spent to calculate the distance of data from
each other to find the neighbors of each sample in the dataset. The time complexity of this algorithm is O(n2);
Therefore, it is not suitable for processing big datasets.
In this paper, DBSCAN is improved so that it can be applied to big datasets. The proposed method calculates
accurately each sample density based on a reduced set of data. This reduced set is called the operational set. This
collection is updated periodically. The use of local samples to calculate the density has greatly reduced the
computational cost of clustering. The empirical results on various datasets of different sizes and dimensions show
that the proposed algorithm increases the clustering speed compared to recent related works while having similar
accuracy as the original DBSCAN algorithm.
* Corresponding author.
E-mail addresses: [email protected] (N. Hanafi), [email protected] (H. Saadatfar).
https://doi.org/10.1016/j.eswa.2022.117501
Received 27 December 2021; Received in revised form 10 April 2022; Accepted 1 May 2022
Available online 6 May 2022
0957-4174/© 2022 Elsevier Ltd. All rights reserved.
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501
this paper is classified as computation reduction methods, and can be significantly compared to the DBSCAN algorithm, and the clustering
applied to big datasets. quality of the DBSCAN algorithm is preserved.
This paper is structured as follows: Section 2 reviews the previous Brown et al. (Brown, Japa, & Shi, 2019) presented a method aiming
studies in the context of big data clustering. The pros and cons of the to increase the processing speed for big datasets. This method reduces
previous methods are also examined in this section. Section 3 presents the number of computations because of using the grid concept and
the proposed method for big data clustering; the algorithm is described comprises three phases. In the first phase of this algorithm, the feature
and the related definitions are given. Section 4 presents the experiments space of the dataset is divided in to a grid structure such that each data is
setup include the datasets used in the experiments, evaluation condi located in the grid structure. The grid size is considered as the input
tions and qualitative evaluation metrics. Section 5 presents the evalua parameter. Then, it determines which grid cell does the data belong to
tion results of the proposed method. At the end of this section, the and the density of each cell is calculated separately. In the second phase,
proposed method is compared with other previous recent studies. densest neighbor of each cell is specified. Finally, in the third phase, a
Finally, the paper is concluded and some suggestions are given for future chain of densest neighbors is formed to constitute a cluster. In this
studies. method, a large amount of time is spent to find densest neighbor of each
cell, and the clustering quality in some datasets is reduced.
2. Literature review Hahsler et al. (Hahsler & Bolaos, 2016) proposed a method, called
DBSTREAM to cluster data streams. The data stream is a sorted and
With the development of web, social network, and mobile phones, infinite hierarchy of the data points. Since permanent storage of all data
there exists more data than before, and it is growing every day (Zerhari, in the data stream and frequent access to them are impossible, and the
Lahcen, & Mouline, 2015). Clustering is a tool used for big data analysis. shape and position of the clusters in the data stream changes, clustering
The traditional clustering techniques cannot handle this large volume of algorithms specific for data streams are required. Most data stream
data due to high complexity and computational costs (Mahesh, 2020). clustering algorithms have an online and an offline phase. In the online
Therefore, the main purpose of the reviewed studies in recent years has phase, the data stream is summarized into a large number of micro-
been to increase the clustering speed. The DBSCAN algorithm (Ester, clusters in real time and in an online process. Micro-clusters represent
Kriegel, Sander, & Xu, 1996) is a pioneer technique in the context of a set of similar data points, and they are usually represented as the
density-based clustering. DBSCAN has several advantages over other cluster center with the information, including data density and disper
classical clustering algorithms. Unlike supervised approaches (e.g., sion. Each new data that enters the system is allocated to the nearest
classification algorithms), clustering is an unsupervised technique that micro-cluster according to the similarity function, and if it is not
does not rely on any prior knowledge. The DBSCAN algorithm is a assigned to the existing micro-clusters, a new micro-cluster is developed.
traditional density-based clustering method. This algorithm makes it In the offline phase, by considering the center of the micro-clusters as
possible to identify clusters of different shapes with the ability to the input points, a clustering algorithm is used to cluster the micro-
manage noise patterns in the data. DBSCAN usually offers good results. clusters again. The distinction of this paper with previous papers is
High time-complexity of the original DBSCAN algorithm makes it that this study considers the data density in the area between the micro-
inefficient for high-dimensional databases of large volume. Various clusters and employs the shared density graph. Using shared density
methods have been presented in recent years in order to improve the improves the clustering quality compared to other data stream clus
performance of the DBSCAN in handling big data. tering methods.
In general, the fast clustering techniques presented for big data can The G-DBSCAN algorithm (Kumar & Reddy, 2016) employs the
be divided into two main groups (Shirkhorshidi, Aghabozorgi, Wah, & Groups concept to speed up finding the nearest neighborhood process.
Herawan, 2014): The Groups concept develops a distinct graph-based structure on the
Single machine clustering techniques and multiple machine clus data such that each vertex represents a group. There is one edge between
tering techniques (parallel clustering algorithms). In the following, the two groups that are reachable for each other; the samples that are close
papers and studies in the context of the two above techniques are to each other are integrated in a group. In this algorithm, each data
reviewed. Considering these classes, the proposed method is a single sample in the dataset is classified as master or slave. The G-DBSCAN is
machine clustering method. This class of methods is studied more implemented in two phases. In the first phase, the DBSCAN is imple
accurately. mented for a fast epsilon-neighborhood operation. The improper values
of the parameter for constructing the hierarchical index reduce the
2.1. Single Machine clustering performance compared to real-time implementation.
Another algorithm, called K-DBSCAN was presented in 2020 (Gho
This class of algorithms is implemented on a single machine and lizadeh, Saadatfar, & Hanafi, 2020). This algorithm comprises three
employs the computational power of one machine (Shirkhorshidi, general steps. In the first step, the K-means++ algorithm is applied to
Aghabozorgi, Wah, & Herawan, 2014). These algorithms are based on the whole dataset, aiming to divide the data to smaller parts, where each
two main methods: data reduction techniques (reducing the number of part is called a group. In the second step, the DBSCAN is applied to each
samples or dimensions), and the techniques based on reducing the K-Means++ group independently. Dividing the data to smaller groups
computations by approximating the computations or optimizing the and applying the DBSCAN to each group separately can reduce the
algorithm itself. Some of these algorithms are described below. computations required to measure the distance from other points to
The AnyDBC algorithm was presented in 2016 (Mai, Assent, & detect if a point is the core. In the third step, the clusters created in
Storgaard, 2016). In this study, a novel approach, called anytime, has different groups are integrated. In other words, this step examines if
been presented to address the run-time of the DBSCAN algorithm, which there are DBSCAN clusters in the adjacent K-means++ groups to be
reduces the range query and decreases the label propagation time of the integrated or not. To reduce the computations resulting from finding the
DBSCAN. clusters that can be integrated, two prunes are presented for the cases
The AnyDBC algorithm compresses data to smaller density- that should be checked. The first prune examines the distance of K-
connected subsets, called primitive clusters, and labels the objects means++ groups from each other and eliminates the possibility of
based on connected components of the primitive clusters to reduce the integrating the internal clusters of the groups that their distance from
label propagation time. Also, the AnyDBC learns the current cluster each other is large. The second prune examines the internal clusters of
structure of the data iteratively and actively instead of making range two groups to integrate them or eliminate the ones that cannot be in
queries for all objects. It selects some of the best samples to filter the tegrated. Finally, to integrate the selected clusters, the DBSCAN algo
cluster in each iteration. Therefore, the number of queries is decreased rithm is implemented on the data of this cluster. One disadvantage of
2
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501
this method is that it ignores noises while integrating the border clusters, is that it outperforms the NQ-DBSCAN and AnyDBC methods. The time
reducing the quality of the clustering process. complexity of the BLOCK-DBSCAN method in the worst-case is O(n2)
In (Gunawan & Berg, 2013), a method has been presented that cre and O(nlogn) in the middle case.
ates a grid structure on the data and determines the membership of each The h-DBSCAN (Weng, Gou, & Fan, 2021) presents a simple and fast
√̅̅̅
sample to the grid cells that their size is ε. If a cell has a minimum of method to improve the efficiency of DBSCAN algorithm. This method
MinPts samples, all samples of the cell are identified as core point reduces the execution time in two aspects. The first one is to reduce the
(because the maximum distance in each cell is ε). If the samples of a cell number of points presented to DBSCAN and second one is to apply the
are smaller than MinPts, instead of calculating the distance from all HNSW technique instead of the linear search structure.
points of the database, it is sufficient to calculate the distance of each In (Sanchez, Castillo, Castro, & Melin, 2014), a method for finding
data sample from any data in a maximum of 21 neighboring cells. This fuzzy information granules from multivariate data through a gravita
technique reduces the computational costs, but it can only be applied to tional inspired clustering algorithm is proposed. The algorithm named
2D data space. FGGCA, incorporates the theory of granular computing, which adopts
The HCA-DBSCAN algorithm (Mathur, Mehta, & Singh, 2019) em the cluster size with respect to the context of the given data. The FGGCA
ploys the grid-based approach in the dataset such that all points are in an is an unsupervised clustering algorithm, mainly because the algorithm
area with a radius of epsilon. Therefore, if one of the points in the area finds and suggests to a number of clusters.
belongs to a specific cluster, other points of the area also belong to that IT2FPCM (Rubio, et al., 2017) is an extension of the Fuzzy Possi
cluster. This key feature is used to achieve a significant computational bilistic C-means (FPCM) algorithm based on Type-2 Fuzzy Logic con
speed compared to other improvements of the DBSCAN algorithm. cepts to improve the efficiency of FPCM algorithm and to enhance its
Finally, by specifying the representative points and using the layering ability of handling uncertainty and make it less susceptible to noise. The
concept, the grid is scanned in depth and thus the required computa parameters used in this article are not the optimal parameters ones.
tional are reduced. Among advantages of this method is reducing the The BIRCHSCAN (de Moura Ventorim, Luchi, Loureiros Rodrigues, &
clustering run-time while preserving the clustering quality and accu Miguel Varejão, 2021) presents a new method to apply DBSCAN to a
racy. However, one disadvantage of this method is that its time order for reduced set of elements in order to cluster the entire dataset. This
low-dimensional data is O(nlogn) while its time complexity for high- method samples from large datasets to obtain an approximate of the
dimensional data increases to O(n3/2). In fact, its run-time for higher clustering solution for DBSCAN algorithm. The BIRCHSCAN consists of
dimensional data increases. four steps. The initial step is responsible for clustering the data using
In (Li, 2020), an improved DBSCAN algorithm based on neighbor BIRCH. The second step generates the sample by obtaining the centroids
similarity has been presented. Since the time-consuming part of the of the elements selected on the previous step. The third step runs the
DBSCAN algorithm is finding the neighbors at the distance of ε for each DBSCAN algorithm over this sample. The last step clusters the entire
data, this paper employs the cover tree to restore the neighbors of each dataset. The main difficulty of applying BIRCHSCAN is regarded to the
data point in parallel. Also, it employs the triangle inequality to filter definition of the parameter δ which generates the value of threshold
many of the unnecessary distance calculations, which reduces the dis used in BIRCH that directly influences the sampling and it is not intui
tance calculation in the clustering process significantly. tively defined.
This idea accelerates the original DBSCAN algorithm to a great extent
while its results are still accurate. This algorithm comprises four phases.
In the first phase, a hierarchical cover tree is created for the dataset. In 2.2. Clustering based on multiple Machines
the second phase, the neighbors of each data are initialized with null
value and the unprocessed data are initialized with − 2. In the third This class includes the algorithms that are implemented on multiple
phase, a number of data among the unprocessed data is selected and a machines and employs the computational power and resources of mul
query tree is developed. Then, the cover tree is used to restore the tiple machines. Parallel clustering and map-reduce based clustering are
nearest neighbor of each data. Considering the theories presented in the two main techniques of this class.
paper, a part of the outliers and the core points are identified without Sinha et al. (Sinha & Jana, 2016) have presented a method in which
unnecessary search; this operation continues until all data points are the initial dataset is divided into smaller sections, and distributed among
identified. In the fourth step, all core points are integrated and different multiple nodes of a cluster of computers. The K-means algorithm is
clusters are formed. Finally, in the fifth step, the border points are implemented on two phases. In the first phase, by selecting a high value
allocated to the nearest cluster. for the number of clusters, the initial clusters are created by the K-means
The NQ-DBSCAN algorithm (Chen, et al., 2018) is a novel local algorithm. In the second phase, the center of the clusters obtained in the
search technique that reduces all unnecessary distance calculations first phase, are integrated according to the formula presented in the
efficiently accurate. This algorithm selects p random point from the paper and a predefined threshold. In other words, the distance between
dataset that has not been clustered. If the point p is identified as the core, all centers of the clusters is calculated and compared with the threshold.
it is extended to find all density-reachable points, and all these points are If the distance between two or more centers is less than the threshold,
considered to belong to one cluster. Two theories have been presented to they are integrated with each other and form a bigger cluster. Unlike the
find the non-core points faster. Also, to find the core points, only the main K-means algorithm, in which the initial centers are selected
distance 2*ε is examined instead of the whole dataset to find the randomly. This paper employs probability sampling for specifying the
ε-neighborhood of the points that reduces the computation cost. The center of the clusters to select better initial centers such that the number
indexing method used in this paper is the quadtree hierarchical tree that of convergence round is decreased. Among disadvantages of this paper,
cannot be applied to distributed datasets of high dimension. high run-time compared to other methods can be mentioned.
The BLOCK-DBSCAN (Chen, et al., 2021) that was presented in 2021 The authors of (He, et al., 2011) have proposed a parallel DBSCAN
obtains the correct clusters for big data, even high-dimensional data, algorithm, which its procedure is divided into four steps. In the first step,
effectively and quickly. In this method, additional distance calculations the total data record is summarized and grid-division is created. In the
are filtered using two techniques. The first technique employs the ε/2 second step, the main DBSCAN is implemented for each subspace. The
method to restore the internal core blocks faster. the second technique third step manages the cross-border issues while integrating the sub
employs a fast and approximate method to find out if two internal core spaces obtained from the previous step. It finds a list of clusters of the
blocks are density-reachable or not. In addition to these two techniques, adjacent subspaces for each border that should be integrated. In the last
a cover tree is used for range exploration algorithm to speed up the step, cluster id mapping is first created for the whole dataset; then, the
density calculation process. The advantage of BLOCK-DBSCAN method local id is substituted by a global code for all dataset points.
3
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501
4
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501
Fig. 2. (a) Calculating distance of P from all points of the dataset. (b) Constituting the OP with center of P and radius of n*ε.
On the contrary, the smaller the OP size is, the less the distance parameter selected by the user as the radius of OP, all data that their
calculation cost would be. However, as the size of the OP dataset de distance from P is smaller than or equal to this input parameter is
creases, the overhead associated with updating it will increase. There considered as the OP and the rest of data is considered as PD. Therefore,
fore, there is a trade-off between the radius of the OP and the update to detect whether the data in the OP is core or not (to identify the
overhead of the OP. The optimal radius of OP is calculated through neighbors of each data), instead of calculating the distance from the
calculating time for various radiuses of the OP via trial and error. whole initial dataset samples, it is calculated from just OP dataset
Increasing or decreasing the OP radius affects the algorithm’s run-time, samples. This causes a significant reduction in computation cost.
but it does not affect the clustering quality. The important point is that
although the size of the OP affects the time of the algorithm, using this 3.2.2. Second Phase: Running the DBSCAN algorithm on the OP
set with any reasonable size greatly reduces the run time compared to After generating the OP, the DBSCAN algorithm is applied to it. In
the original DBSCAN algorithm. this algorithm, the samples are scanned in depth first. When the
neighbors of a sample are identified and it turns out that the sample is a
3.1.2. The potential dataset core one, all its neighbors are entered to the stack. Then, a neighbor is
The potential dataset includes the data that is too far from the removed from the stack, its neighbors are found and added to the stack
samples that are currently under investigation and thus ignored under the same conditions. This process continues until the neighbors of
temporarily in order to reduce the calculation. In other words, the po all data inside the stack are restored and there is no data in the stack (the
tential dataset includes all data of the initial dataset, except the data stack is empty). Thus, a list of the data that belongs to one cluster is
existing in the OP. The name potential dataset is adopted because the identified. The ExpandCluster function in Fig. 7 shows the execution
data of this set has the potential to constitute another OP in an early process of DBSCAN.
future. The mathematical representation of the potential dataset is also The important point while scanning the samples is that the distance
given in Eq. (2): of each sample form the center of the current OP should be smaller or
equal to (n-1)*ε. In other words, the data of OP that their distance from
Potential Dataset = {x|(x ∈ D)&&(x ∕
∈ OP) }orPD = D − OP (2)
the center of the OP is smaller than or equal to (n-1)*ε are scanned, and
Where D is the initial dataset, OP is the operational dataset, PD is the the rest of the data, which are between n*ε and (n-1)*ε, are considered
abbreviation of the potential dataset, and × is a sample in the initial as the candid dataset, which is used to update the OP.
dataset. As shown in Fig. 3, scanning the samples between n*ε and (n-1)*ε is
neglected temporarily because considering the current OP, only a part of
3.2. Phases of OP-DBSCAN algorithm the neighbors of these samples (the candid set) is accessible and other
neighbors are outside the OP. For example, in Fig. 3, the point M is
Now, we describe the proposed algorithm. The proposed method is between n*ε and (n-1)*ε. As can be seen, a part of the ε-neighbors of M
comprised of three main phases, which are described in the following. are outside the OP. Since only the distance with samples of the OP is
calculated to identify the neighbors, the neighbors of M that are outside
3.2.1. First Phase: Constituting the OP and PD the OP are not accessible (r and q).Fig. 4..
In this step, the main dataset is divided into OP and PD. Similar to Theorem: In the OP, all the neighbors of the sample that is less than
DBSCAN, a random data P is selected from the initial dataset, and dis or equal to (n-1)*eps from the center of the OP, are accessible.
tance of P from all data existing in the dataset is calculated. By calcu Proof: According to Fig. 3, the point A is on a circle of radius (n-1)*ε
lating these distances and considering the definition of OP, the OP and and center of P. The circle OP includes the center P and radius of n*eps,
PD can be obtained from the main dataset without any additional cal and the circle OP-1 includes the center P and radius of (n-1)*ε. The
culations compared to the conventional DBSCAN. difference of the radius of the two circles (OP and OP-1) is epsilon. The
Considering the parameters ε and MinPts, if the number of data ε-neighborhoods of the point A include all samples in a circle with center
existing in the neighborhood of P is less than MinPts, the point P is of A and radius of ε, where this circle is internal tangent with OP;
labeled as noise, and the algorithm goes to another random point of the therefore, all ε-neighbors of A are inside the OP circle, but if a point is
database. If the data P is identified as core, depending on the value of the farther than (n-1)*ε from the center P (the point is between (n-1)*ε and
5
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501
On the other hand, the candid set can be represented as in Eq. (4)
using the union and intersection of the sets. The symbol OP n*ε repre
sents the OP of radius n*ε with the center of P and OP (n-1)*ε represents
the OP of radius (n-1)*ε with center of P.
If at the time of the update, the center of the new OP is selected as a
point outside the candid set and the old OP, the cluster might be sliced
and the algorithm will not have a correct number of the clusters. In fact,
it identifies more clusters than the real number of the clusters. There
fore, to prevent cluster breakdown and perform clustering correctly and
efficiently, the next OP center is selected among the samples of the
candid set.
Fig. 3. Scanning the samples of the OP and constituting the candid set. As seen in Fig. 5, the clusters A and B continue to the outside of the
current OP (OP1). The point P1 belongs to the cluster A. In Fig. 5(a),
while updating, the OP2 center is selected among the candid set points
ε), some ε-neighbors of that point would be outside the OP circle. If a
(P1) and all accessible points of the new center adopt the center cluster
point is at a distance smaller than (n-1)*ε from the central data P (it is
number (cluster A). In Fig. 5(b), while updating the OP, a random point
inside the OP-1 circle), in the worst case, a part of its ε-neighbors are
of the initial dataset (P2) is selected as the center of the second OP. As
inside the OP-1 circle and the other neighbors are inside the OP circle;
can be seen, the density-connected samples of P2 have constituted the
since the OP-1 circle is the subset of the OP circle, all neighborhoods
independent cluster of C, while all samples of clusters A and C belong to
would be inside the OP circle.
the same cluster. Similarly, for clusters B and D, all samples of these two
clusters belong to one cluster.
3.2.3. Third Phase: Updating the operational dataset
Calculating the distance from other samples and finding the neigh
bors for each data point is limited to the operational dataset that sample 3.2.3.2. Update time of the operational dataset. The operational dataset
belongs to. Therefore, to scan and restore the neighbors of the data that is updated when the neighbors of all samples of OP(n-1)*ε or all samples at
are not inside the current OP, the OP should be updated. Two main a distance of (n-1)*ε or smaller from the OP center are restored.
questions are: “when should the OP be updated?”, and “which sample After updating and obtaining the new OP, obviously this set overlaps
should be used to update the OP (as the center of the new OP)?”. with the previous OP (As shown in Fig. 6). Therefore, there are data that
have been scanned before in the new OP. Eventually, the OP is updated
6
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501
Fig. 5a. Using the samples of the candid set as the OP center while updating.
Fig. 5b. Using a random point of the initial dataset as the center of another OP while updating (instead of using the candid set).
provided that inequality (5) is satisfied. Since in each OP, the distance of each data from all other data of the
same OP is calculated, the time complexity of this section is n*k, where k
VOP > M − VB (5)
is the size of the OP. While updating the OP, and constituting new OPs,
In Eq. (5), VOP is a variable that counts the data scanned in the the distance of a sample from all other samples of the main dataset is
current dataset and M is the total number of samples at the distance of calculated; since the number of update operations is considered as u, the
(n-1)*ε or smaller from the OP center, VB is the number of samples that time complexity of this phase is n*u. Therefore, the time complexity of
have been scanned in the previous OP. The sum of VOP and VB is always the OP-DBSCAN algorithm is O(nk + nu). The larger is the OP (larger k),
equal to M. The pseudo-code of the OP-DBSCAN algorithm is shown in the update overhead would be smaller (smaller u) and vice versa, the
Fig. 7. smaller the top size is (smaller k), the larger the update overhead would
be. According to Eq. (6), it is clear that the sum of k and u is very smaller
3.3. Time complexity analysis of the proposed algorithm than n.
Table 1 shows the effect of changes of radius of the OP on the run-
The DBSCAN algorithm calculates the distance of each sample from time of the OP-DBSCAN algorithm in several datasets. As can be infer
all samples of the main dataset. Therefore, its time complexity is O(n2). red from Table 2, the radius of the OP should not be very large or very
But in the OP-DBSCAN algorithm, the distance of each sample of the OP small, because it increases the run-time of the OP-DBSCAN.
from the other members of the same OP is calculated. On the other hand,
in this algorithm, there exists the OP update overhead. Therefore, the 4. Experiments setup
time complexity of the proposed algorithm is as given in Eq. (6).
In this section, some evaluation metrics are presented first. Then, the
O(nk + nu).k + u << n (6)
proposed method is compared with DBSCAN (Ester, Kriegel, Sander, &
7
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501
4.1. The datasets used in the experiments 4.3.2. Silhouette index (Rousseeuw, 1986)
Another method used to evaluate clustering, is the Silhouette index,
Since the main idea of this paper is to reduce the run-time of the which is represented by SI. This measure depends on the cohesion of the
DBSCAN algorithm, to compare the proposed method with four other clusters and their separability. The SI value of each point represents its
methods, several datasets of different dimensions and different number membership to its cluster compared to the adjacent cluster. To calculate
of samples are used. The datasets using in this study are adopted from SI, we need two main concepts:
UCI and GitHub such that the required diversity in terms of number of The mean distance of one point of the cluster from other points
dimensions and number of samples is satisfied. of the cluster: this value is represented by a(i) and calculated as in Eq.
Table 2 shows the characteristics of 9 employed datasets, including (9):
the name of the dataset, number of samples, number of dimensions,
epsilon and MinPts parameters for each dataset. 1 ∑ni
a(i) = d(xi .xl ) (9)
Considering the type of the dataset, the epsilon and MinPts of each ni l=1
dataset are specified. When comparing the proposed algorithm with the In which, ni is the number of members of the ith cluster, and d(xi,xl) is
other four algorithms, in order to have the same test conditions, the the distance of data xi from other data of the same cluster. a(i) can be
value of epsilon and MinPts parameters is determined for each dataset considered as a measure to evaluate the membership of xi to its cluster.
based on its properties and is considered the same for all methods during The smaller is a(i), the membership of the point to its cluster is higher.
the experiments (Table 2). Then, the proposed method is compared with Mean distance of a point from other clusters: for a point xi, its
four other methods considering these parameters under the same mean distance from points of other clusters is calculated. The cluster
conditions. with minimum mean distance for xi, is called the adjacent cluster. The
mean distance of xi from the points of the adjacent cluster is represented
4.2. Evaluation conditions by b(i).
1∑
To compare the proposed algorithm with DBSCAN and the three b(i) = min1≤l≤k (d(xi .ym )) (10)
nl ym ∈cl
mentioned methods, each method implemented and executed under the
same conditions on a computational machine with a quad-core CPU, 40 Where xi is the data of the ith cluster, d(xi,ym) is the distance of xi
MB disc, and 64 GB RAM. To obtain more accurate results, all unnec from ym in lth cluster and nl is the number of members of the lth cluster.
essary services of the operating system are disabled during execution of Therefore, the SI for xi is calculated using Eq. (11).
the algorithms. All datasets are normalized in the range of [0,1), and the
OP-DBSCAN algorithm and four other algorithms are applied to the s(i) =
b(i) − a(i)
(11)
normalized datasets. The purpose was to compare the clustering quality max(b(i), a(i))
(regarding clustering evaluation metrics) and run-time of the algorithms Therefore, if a(i) is smaller than b(i), SI is positive and if b(i) is
under the same conditions regarding hardware and common parame smaller than a(i), SI is negative, indicating weak clustering; because xi is
ters. The results presented in all experiments are the average of ten similar to the adjacent cluster instead of being similar to its cluster.
different runs. Considering the above equation, SI varies between − 1 and + 1. Values
close to 1 describe good match between the point and its cluster
4.3. Qualitative evaluation metrics compared to the adjacent cluster. If SI is close to 1 for all points of the
clusters, clustering is performed correctly. While small SI values indicate
Using proper evaluation metrics for examining the efficiency of a
8
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501
Table 1
Effect of changes of radius of the OP on the run-time of the OP-DBSCAN.
Dataset OP = 2*ε OP = 4*ε OP = 6*ε OP = 8*ε OP = 10*ε OP = 20*ε
9
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501
Table 3
Comparing the evaluation metrics for 5 algorithms on 9 different datasets.
Dataset Abalone MAGIC Educational Shuttle MoCap FARS Skin New Road Cover Poker
Evaluation Criteria Segmentation aggregator Network Type hand
DB OP-DBSCAN 1.4170 1.0916 1.022 0.7693 0.7546 1.4351 0.8411 1.035 0.8085 0.9015 1.012
DBSCAN 2.0524 1.4197 1.054 1.8097 1.0755 1.7289 – – – – –
HCA- 2.0560 1.5963 1.365 1.8602 1.0779 1.8516 1.4283 1.319 1.1746 1.132 1.2506
DBSCAN
K-DBSCAN 2.0626 1.4795 1.289 1.8535 1.0783 1.8238 1.3912 1.218 1.2299 1.142 1.2381
Density-grid 2.0615 1.6359 1.519 1.8823 1.0786 1.8235 1.4681 1.345 1.2194 1.1415 1.2578
SI OP-DBSCAN − 0.276 − 0.516 − 0.418 − 0.405 − 0.468 − 0.436 − 0.128 − 0.212 − 0.325 − 0.596 − 0.561
DBSCAN − 0.638 − 0.776 − 0.430 − 0.431 − 0.645 − 0.519 – – – – –
HCA- − 0.651 − 0.778 − 0.437 − 0.439 − 0.646 − 0.731 − 0.148 − 0.456 − 0.541 − 0.712 − 0.832
DBSCAN
K-DBSCAN − 0.655 − 0.778 − 0.432 − 0.443 − 0.646 − 0.688 − 0.146 − 0.325 − 0.535 − 0.714 − 0.833
Density-grid − 0.642 − 0.779 − 0.465 − 0.445 − 0.647 − 0.751 − 0.148 − 0.418 − 0.544 − 0.712 − 0.834
10
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501
Table 4
Comparing the run-time of the proposed method and 4 other methods on 11 different datasets.
Running Abalone MAGIC Educational Shuttle MoCap FARS Skin New Road Cover Poker
Time Segmentation aggregator Network Type hand
(s)
OP-DBSCAN 0.4197 7.0835 9.458 20.2553 56.8786 48.0892 123.3931 75.45 77.5716 122.1612 615.4721
DBSCAN 0.3195 4.4409 9.895 128.28 3755.8 16950.74 – – – – –
HCA- 0.3726 4.0123 9.756 104.95 586.70 1193.5 2058.1 3745.8 3967.5 4108.3 12413.42
DBSCAN
Density-Grid 0.3905 5.5327 8.569 114.68 672.98 1627.6 3575.4 4536.7 4756.7 4825.7 13526.63
K-DBSCAN 0.4537 6.2015 8.781 82.893 177.96 299.15 462.47 789.2 806.19 733.51 1789.7
Fig. 8. Comparing run-time of 5 algorithms (DBSCN, HCA-DBSCAN, Density-Grid DBSCAN, K-DBSCAN, and OP-DBSCAN).
11
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501
interests or personal relationships that could have appeared to influence 17th International Conference on Parallel and Distributed Systems. Tainan, Taiwan:
IEEE.
the work reported in this paper.
Hou, J., Liu, W., E, X., & Cui, H. (2016). Towards parameter-independent data clustering
and image segmentation. Pattern Recognition, 60, 25-36.
References Kesavaraj, G., & Sukumaran, S. (2013). A study on classification techniques in data
mining. 2013 Fourth International Conference on Computing, Communications and
Ananthi, V., Balasubramaniam, P., & Kalaiselvi, T. (2015). A new fuzzy clustering Networking Technologies (ICCCNT). Tiruchengode, India: IEEE.
algorithm for the segmentation of brain tumor. Soft Computing, 4859–4879. Kumar, K., & Reddy, A. M. (2016). A fast DBSCAN clustering algorithm by accelerating
Baranidharan, B., & Santhi, B. (2016). DUCF: Distributed load balancing Unequal neighbor searching using Groups method. Pattern Recognition, 39–48.
Clustering in wireless. Applied Soft Computing, 495–506. Li, S.-S. (2020). An Improved DBSCAN Algorithm Based on the Neighbor Similarity and
Berkhin, P. (2006). A Survey of Clustering Data Mining Techniques. In I. J. Kogan, Fast Nearest Neighbor Query. IEEE Access, 47468–47476.
C. Nicholas, & M. Teboulle (Eds.), Grouping Multidimensional Data (pp. 25–71). Mahesh, B. (2020). Machine Learning Algorithms - A Review. International Journal of
Springer. Science and Research (IJSR), 9(1).
Brown, D., Japa, A., & Shi, Y. (2019). A Fast Density-Grid Based Clustering Method. Las Mai, S., Assent, I., & Storgaard, M. (2016). In AnyDBC: An Efficient Anytime Density-based
Vegas, NV, USA: IEEE. Clustering Algorithm for Very Large Complex Datasets (pp. 1025–1034). ACM Press.
Chen, Y., Tang, S., Bouguila, N., Wang, C., Du, J., & Li, H. (2018). A Fast Clustering Mathur, V., Mehta, J., & Singh, S. (2019). HCA-DBSCAN: HyperCube Accelerated Density
Algorithm based on pruning unnecessary distance computations in DBSCAN for Based Spatial Clustering for Applications with Noise. Sets and Partitions workshop at
High-Dimensional Data. Pattern Recognition, 83, 375–387. NeurIPS 2019.
Chen, Y., Zhou, L., Bouguila, N., Wang, C., Chen, Y., & Du, J. (2021). BLOCK-DBSCAN: Qiao, S., Li, T., Li, H., Peng, J., & Chen, H. (2012). A new blockmodeling based
Fast clustering for large scale data. Pattern Recognition, 109, Article 107624. hierarchical clustering algorithm for web social networks. Engineering Applications of
Davies, D., & Bouldin, D. (1979). A Cluster Separation Measure. IEEE Transactions on Artificial Intelligence, 25, 640–647.
Pattern Analysis and Machine Intelligence, PAMI-1, 224–227. Rousseeuw, P. (1986). Silhouettes: A graphical aid to the interpretation and validation of
de Moura Ventorim, I., Luchi, D., Loureiros Rodrigues, A., & Miguel Varejão, F. (2021). cluster analysis. Journal of Computational and Applied Mathematics, 53–65.
BIRCHSCAN: A sampling method for applying DBSCAN to large datasets. Expert Rubio, E., Castillo, O., Valdez, F., Melin, P., Gonzalez, I. C., & Martinez, G. (2017). An
Systems with Applications, 184, Article 115518. Extension of the Fuzzy Possibilistic Clustering Algorithm Using Type-2 Fuzzy Logic
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Technique. Advances in Fuzzy Systems, 2017, 1–23.
Discovering Clusters in Large Spatial Databases with Noise. KDD-96 Proceedings. Sanchez, M., Castillo, O., Castro, J., & Melin, P. (2014). Fuzzy granular gravitational
Feyyad, U. (1996). Data mining and knowledge discovery: Making sense out of data. IEEE clustering algorithm for multivariate data. Information Sciences, 279, 498–511.
Expert, 11, 20–25. Shirkhorshidi, A., Aghabozorgi, S., Wah, T., & Herawan, T. (2014). In Big Data Clustering:
Gholizadeh, N., Saadatfar, H., & Hanafi, N. (2020). K-DBSCAN: An improved DBSCAN A Review (pp. 707–720). Springer International Publishing.
algorithm for big data. The Journal of, Supercomputing(77), 6214–6235. Sinha, A., & Jana, P. (2016). A Novel K-Means based Clustering Algorithm for Big Data.
Gunawan, A., & Berg, M. (2013). A faster algorithm for DBSCAN. In Master’s thesis. 2016 International Conference on Advances in Computing, Communications and
Hahsler, M., & Bolaos, M. (2016). Clustering Data Streams Based on Shared Density Informatics (ICACCI). Jaipur, India: IEEE.
Between Micro-Clusters. IEEE Transactions on Knowledge and Data Engineering, 28, Weng, S., Gou, J., & Fan, Z. (2021). h -DBSCAN: A simple fast DBSCAN algorithm for big
1449–1461. data. In Proceedings of The 13th Asian Conference on Machine Learning (pp. 81–96).
He, Y., Tan, H., Luo, W., Mao, H., Ma, D., Feng, S., & Fan, J. (2011). MR-DBSCAN: An Zerhari, B., Lahcen, A., & Mouline, S. (2015). Big Data Clustering: Algorithms and
Efficient Parallel Density-Based Clustering Algorithm Using MapReduce. 2011 IEEE Challenges. International Conference on Big Data, Cloud and Applications. Tetuan,
Morocco.
12