0% found this document useful (0 votes)
13 views

A Fast DBSCAN Algorithm for Big Data Based on Efficient Density

Uploaded by

Tejas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

A Fast DBSCAN Algorithm for Big Data Based on Efficient Density

Uploaded by

Tejas
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Expert Systems With Applications 203 (2022) 117501

Contents lists available at ScienceDirect

Expert Systems With Applications


journal homepage: www.elsevier.com/locate/eswa

A fast DBSCAN algorithm for big data based on efficient density calculation
Nooshin Hanafi , Hamid Saadatfar *
Computer Engineering Department, University of Birjand, Birjand, Iran

A R T I C L E I N F O A B S T R A C T

Keywords: Today, data is being generated with a high speed. Managing large volume of data has become a challenge in the
Data Mining current age. Clustering is a method to analyze data that is generated in the Internet. Various approaches have
Clustering been presented for data clustering until now. Among them, DBSCAN is a most well-known density-based clus­
Big Data
tering algorithm. This algorithm can detect clusters of different shapes and does not require prior knowledge
DBSCAN Algorithm
about the number of clusters. A major part of the DBSCAN run-time is spent to calculate the distance of data from
each other to find the neighbors of each sample in the dataset. The time complexity of this algorithm is O(n2);
Therefore, it is not suitable for processing big datasets.
In this paper, DBSCAN is improved so that it can be applied to big datasets. The proposed method calculates
accurately each sample density based on a reduced set of data. This reduced set is called the operational set. This
collection is updated periodically. The use of local samples to calculate the density has greatly reduced the
computational cost of clustering. The empirical results on various datasets of different sizes and dimensions show
that the proposed algorithm increases the clustering speed compared to recent related works while having similar
accuracy as the original DBSCAN algorithm.

1. Introduction arbitrary shapes, efficient noise detection, and automatic detection of


number of clusters can be mentioned. But one disadvantage of this al­
Today, data is a valuable capital in the world. The volume of data gorithm is the high time complexity. It’s low speed in encountering big
generate every day is increasing dramatically. Big data and its analysis data has attracted the attention of researchers to reduce the execution
are of great importance. One of the essential issues about big data is the time of this algorithm in recent years. The run-time of the DBSCAN al­
ability to process them. Clustering is a data mining and data analysis gorithm is significantly affected by finding the neighbors and obtaining
technique that aims to detect clusters and groups in a dataset. Clustering density. Therefore, as the data size increases, this algorithm slows down
is a type of data modeling that originates from statistics and mathe­ and its run-time increases. Therefore, with the ever-increasing data
matics (Berkhin, 2006), and is classified as unsupervised learning volume, this algorithm should be improved so that it can operate on big
methods. Clustering is applied in image processing (Hou, Liu, E, & Cui, data in a reasonable time with an acceptable quality. The current study
2016), medicine (Ananthi, Balasubramaniam, & Kalaiselvi, 2015), aims to reduce the run-time of the DBSCAN algorithm, and in this
knowledge extraction (Feyyad, 1996), analysis of web social networks regards, breaks down the dataset into operational and potential datasets.
(Qiao, Li, Li, Peng, & Chen, 2012), and wireless sensor networks (Bar­ The operational dataset includes adjacent data with the sample whose
anidharan & Santhi, 2016). The clustering algorithms are classified as density is being calculated. In fact, this set is explored to find the
hierarchical, partitioning, density-based, model-based, and grid-based neighbors in the ε-neighborhood instead of exploring the whole data.
(Kesavaraj & Sukumaran, 2013). Since the distance calculations are carried out with the samples in the
In recent years, much effort has been done to improve the perfor­ operational dataset instead of the whole dataset, the required calcula­
mance of the existing algorithms to make them applicable for big data. tions for finding the neighbors are largely pruned. After scanning all
The DBSCAN algorithm is a pioneer and well-known technique in samples of the operational dataset, the dataset should be updated. The
density-based clustering (Ester, Kriegel, Sander, & Xu, 1996). This al­ proposed approach creates and updates the operational set with low
gorithm has some advantages over other classical clustering algorithms. computational overhead. Also, this update is done in such a way that the
Among advantages of this algorithm, ability to detect clusters of density of the samples is calculated accurately. The method proposed in

* Corresponding author.
E-mail addresses: [email protected] (N. Hanafi), [email protected] (H. Saadatfar).

https://doi.org/10.1016/j.eswa.2022.117501
Received 27 December 2021; Received in revised form 10 April 2022; Accepted 1 May 2022
Available online 6 May 2022
0957-4174/© 2022 Elsevier Ltd. All rights reserved.
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501

this paper is classified as computation reduction methods, and can be significantly compared to the DBSCAN algorithm, and the clustering
applied to big datasets. quality of the DBSCAN algorithm is preserved.
This paper is structured as follows: Section 2 reviews the previous Brown et al. (Brown, Japa, & Shi, 2019) presented a method aiming
studies in the context of big data clustering. The pros and cons of the to increase the processing speed for big datasets. This method reduces
previous methods are also examined in this section. Section 3 presents the number of computations because of using the grid concept and
the proposed method for big data clustering; the algorithm is described comprises three phases. In the first phase of this algorithm, the feature
and the related definitions are given. Section 4 presents the experiments space of the dataset is divided in to a grid structure such that each data is
setup include the datasets used in the experiments, evaluation condi­ located in the grid structure. The grid size is considered as the input
tions and qualitative evaluation metrics. Section 5 presents the evalua­ parameter. Then, it determines which grid cell does the data belong to
tion results of the proposed method. At the end of this section, the and the density of each cell is calculated separately. In the second phase,
proposed method is compared with other previous recent studies. densest neighbor of each cell is specified. Finally, in the third phase, a
Finally, the paper is concluded and some suggestions are given for future chain of densest neighbors is formed to constitute a cluster. In this
studies. method, a large amount of time is spent to find densest neighbor of each
cell, and the clustering quality in some datasets is reduced.
2. Literature review Hahsler et al. (Hahsler & Bolaos, 2016) proposed a method, called
DBSTREAM to cluster data streams. The data stream is a sorted and
With the development of web, social network, and mobile phones, infinite hierarchy of the data points. Since permanent storage of all data
there exists more data than before, and it is growing every day (Zerhari, in the data stream and frequent access to them are impossible, and the
Lahcen, & Mouline, 2015). Clustering is a tool used for big data analysis. shape and position of the clusters in the data stream changes, clustering
The traditional clustering techniques cannot handle this large volume of algorithms specific for data streams are required. Most data stream
data due to high complexity and computational costs (Mahesh, 2020). clustering algorithms have an online and an offline phase. In the online
Therefore, the main purpose of the reviewed studies in recent years has phase, the data stream is summarized into a large number of micro-
been to increase the clustering speed. The DBSCAN algorithm (Ester, clusters in real time and in an online process. Micro-clusters represent
Kriegel, Sander, & Xu, 1996) is a pioneer technique in the context of a set of similar data points, and they are usually represented as the
density-based clustering. DBSCAN has several advantages over other cluster center with the information, including data density and disper­
classical clustering algorithms. Unlike supervised approaches (e.g., sion. Each new data that enters the system is allocated to the nearest
classification algorithms), clustering is an unsupervised technique that micro-cluster according to the similarity function, and if it is not
does not rely on any prior knowledge. The DBSCAN algorithm is a assigned to the existing micro-clusters, a new micro-cluster is developed.
traditional density-based clustering method. This algorithm makes it In the offline phase, by considering the center of the micro-clusters as
possible to identify clusters of different shapes with the ability to the input points, a clustering algorithm is used to cluster the micro-
manage noise patterns in the data. DBSCAN usually offers good results. clusters again. The distinction of this paper with previous papers is
High time-complexity of the original DBSCAN algorithm makes it that this study considers the data density in the area between the micro-
inefficient for high-dimensional databases of large volume. Various clusters and employs the shared density graph. Using shared density
methods have been presented in recent years in order to improve the improves the clustering quality compared to other data stream clus­
performance of the DBSCAN in handling big data. tering methods.
In general, the fast clustering techniques presented for big data can The G-DBSCAN algorithm (Kumar & Reddy, 2016) employs the
be divided into two main groups (Shirkhorshidi, Aghabozorgi, Wah, & Groups concept to speed up finding the nearest neighborhood process.
Herawan, 2014): The Groups concept develops a distinct graph-based structure on the
Single machine clustering techniques and multiple machine clus­ data such that each vertex represents a group. There is one edge between
tering techniques (parallel clustering algorithms). In the following, the two groups that are reachable for each other; the samples that are close
papers and studies in the context of the two above techniques are to each other are integrated in a group. In this algorithm, each data
reviewed. Considering these classes, the proposed method is a single sample in the dataset is classified as master or slave. The G-DBSCAN is
machine clustering method. This class of methods is studied more implemented in two phases. In the first phase, the DBSCAN is imple­
accurately. mented for a fast epsilon-neighborhood operation. The improper values
of the parameter for constructing the hierarchical index reduce the
2.1. Single Machine clustering performance compared to real-time implementation.
Another algorithm, called K-DBSCAN was presented in 2020 (Gho­
This class of algorithms is implemented on a single machine and lizadeh, Saadatfar, & Hanafi, 2020). This algorithm comprises three
employs the computational power of one machine (Shirkhorshidi, general steps. In the first step, the K-means++ algorithm is applied to
Aghabozorgi, Wah, & Herawan, 2014). These algorithms are based on the whole dataset, aiming to divide the data to smaller parts, where each
two main methods: data reduction techniques (reducing the number of part is called a group. In the second step, the DBSCAN is applied to each
samples or dimensions), and the techniques based on reducing the K-Means++ group independently. Dividing the data to smaller groups
computations by approximating the computations or optimizing the and applying the DBSCAN to each group separately can reduce the
algorithm itself. Some of these algorithms are described below. computations required to measure the distance from other points to
The AnyDBC algorithm was presented in 2016 (Mai, Assent, & detect if a point is the core. In the third step, the clusters created in
Storgaard, 2016). In this study, a novel approach, called anytime, has different groups are integrated. In other words, this step examines if
been presented to address the run-time of the DBSCAN algorithm, which there are DBSCAN clusters in the adjacent K-means++ groups to be
reduces the range query and decreases the label propagation time of the integrated or not. To reduce the computations resulting from finding the
DBSCAN. clusters that can be integrated, two prunes are presented for the cases
The AnyDBC algorithm compresses data to smaller density- that should be checked. The first prune examines the distance of K-
connected subsets, called primitive clusters, and labels the objects means++ groups from each other and eliminates the possibility of
based on connected components of the primitive clusters to reduce the integrating the internal clusters of the groups that their distance from
label propagation time. Also, the AnyDBC learns the current cluster each other is large. The second prune examines the internal clusters of
structure of the data iteratively and actively instead of making range two groups to integrate them or eliminate the ones that cannot be in­
queries for all objects. It selects some of the best samples to filter the tegrated. Finally, to integrate the selected clusters, the DBSCAN algo­
cluster in each iteration. Therefore, the number of queries is decreased rithm is implemented on the data of this cluster. One disadvantage of

2
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501

this method is that it ignores noises while integrating the border clusters, is that it outperforms the NQ-DBSCAN and AnyDBC methods. The time
reducing the quality of the clustering process. complexity of the BLOCK-DBSCAN method in the worst-case is O(n2)
In (Gunawan & Berg, 2013), a method has been presented that cre­ and O(nlogn) in the middle case.
ates a grid structure on the data and determines the membership of each The h-DBSCAN (Weng, Gou, & Fan, 2021) presents a simple and fast
√̅̅̅
sample to the grid cells that their size is ε. If a cell has a minimum of method to improve the efficiency of DBSCAN algorithm. This method
MinPts samples, all samples of the cell are identified as core point reduces the execution time in two aspects. The first one is to reduce the
(because the maximum distance in each cell is ε). If the samples of a cell number of points presented to DBSCAN and second one is to apply the
are smaller than MinPts, instead of calculating the distance from all HNSW technique instead of the linear search structure.
points of the database, it is sufficient to calculate the distance of each In (Sanchez, Castillo, Castro, & Melin, 2014), a method for finding
data sample from any data in a maximum of 21 neighboring cells. This fuzzy information granules from multivariate data through a gravita­
technique reduces the computational costs, but it can only be applied to tional inspired clustering algorithm is proposed. The algorithm named
2D data space. FGGCA, incorporates the theory of granular computing, which adopts
The HCA-DBSCAN algorithm (Mathur, Mehta, & Singh, 2019) em­ the cluster size with respect to the context of the given data. The FGGCA
ploys the grid-based approach in the dataset such that all points are in an is an unsupervised clustering algorithm, mainly because the algorithm
area with a radius of epsilon. Therefore, if one of the points in the area finds and suggests to a number of clusters.
belongs to a specific cluster, other points of the area also belong to that IT2FPCM (Rubio, et al., 2017) is an extension of the Fuzzy Possi­
cluster. This key feature is used to achieve a significant computational bilistic C-means (FPCM) algorithm based on Type-2 Fuzzy Logic con­
speed compared to other improvements of the DBSCAN algorithm. cepts to improve the efficiency of FPCM algorithm and to enhance its
Finally, by specifying the representative points and using the layering ability of handling uncertainty and make it less susceptible to noise. The
concept, the grid is scanned in depth and thus the required computa­ parameters used in this article are not the optimal parameters ones.
tional are reduced. Among advantages of this method is reducing the The BIRCHSCAN (de Moura Ventorim, Luchi, Loureiros Rodrigues, &
clustering run-time while preserving the clustering quality and accu­ Miguel Varejão, 2021) presents a new method to apply DBSCAN to a
racy. However, one disadvantage of this method is that its time order for reduced set of elements in order to cluster the entire dataset. This
low-dimensional data is O(nlogn) while its time complexity for high- method samples from large datasets to obtain an approximate of the
dimensional data increases to O(n3/2). In fact, its run-time for higher clustering solution for DBSCAN algorithm. The BIRCHSCAN consists of
dimensional data increases. four steps. The initial step is responsible for clustering the data using
In (Li, 2020), an improved DBSCAN algorithm based on neighbor BIRCH. The second step generates the sample by obtaining the centroids
similarity has been presented. Since the time-consuming part of the of the elements selected on the previous step. The third step runs the
DBSCAN algorithm is finding the neighbors at the distance of ε for each DBSCAN algorithm over this sample. The last step clusters the entire
data, this paper employs the cover tree to restore the neighbors of each dataset. The main difficulty of applying BIRCHSCAN is regarded to the
data point in parallel. Also, it employs the triangle inequality to filter definition of the parameter δ which generates the value of threshold
many of the unnecessary distance calculations, which reduces the dis­ used in BIRCH that directly influences the sampling and it is not intui­
tance calculation in the clustering process significantly. tively defined.
This idea accelerates the original DBSCAN algorithm to a great extent
while its results are still accurate. This algorithm comprises four phases.
In the first phase, a hierarchical cover tree is created for the dataset. In 2.2. Clustering based on multiple Machines
the second phase, the neighbors of each data are initialized with null
value and the unprocessed data are initialized with − 2. In the third This class includes the algorithms that are implemented on multiple
phase, a number of data among the unprocessed data is selected and a machines and employs the computational power and resources of mul­
query tree is developed. Then, the cover tree is used to restore the tiple machines. Parallel clustering and map-reduce based clustering are
nearest neighbor of each data. Considering the theories presented in the two main techniques of this class.
paper, a part of the outliers and the core points are identified without Sinha et al. (Sinha & Jana, 2016) have presented a method in which
unnecessary search; this operation continues until all data points are the initial dataset is divided into smaller sections, and distributed among
identified. In the fourth step, all core points are integrated and different multiple nodes of a cluster of computers. The K-means algorithm is
clusters are formed. Finally, in the fifth step, the border points are implemented on two phases. In the first phase, by selecting a high value
allocated to the nearest cluster. for the number of clusters, the initial clusters are created by the K-means
The NQ-DBSCAN algorithm (Chen, et al., 2018) is a novel local algorithm. In the second phase, the center of the clusters obtained in the
search technique that reduces all unnecessary distance calculations first phase, are integrated according to the formula presented in the
efficiently accurate. This algorithm selects p random point from the paper and a predefined threshold. In other words, the distance between
dataset that has not been clustered. If the point p is identified as the core, all centers of the clusters is calculated and compared with the threshold.
it is extended to find all density-reachable points, and all these points are If the distance between two or more centers is less than the threshold,
considered to belong to one cluster. Two theories have been presented to they are integrated with each other and form a bigger cluster. Unlike the
find the non-core points faster. Also, to find the core points, only the main K-means algorithm, in which the initial centers are selected
distance 2*ε is examined instead of the whole dataset to find the randomly. This paper employs probability sampling for specifying the
ε-neighborhood of the points that reduces the computation cost. The center of the clusters to select better initial centers such that the number
indexing method used in this paper is the quadtree hierarchical tree that of convergence round is decreased. Among disadvantages of this paper,
cannot be applied to distributed datasets of high dimension. high run-time compared to other methods can be mentioned.
The BLOCK-DBSCAN (Chen, et al., 2021) that was presented in 2021 The authors of (He, et al., 2011) have proposed a parallel DBSCAN
obtains the correct clusters for big data, even high-dimensional data, algorithm, which its procedure is divided into four steps. In the first step,
effectively and quickly. In this method, additional distance calculations the total data record is summarized and grid-division is created. In the
are filtered using two techniques. The first technique employs the ε/2 second step, the main DBSCAN is implemented for each subspace. The
method to restore the internal core blocks faster. the second technique third step manages the cross-border issues while integrating the sub­
employs a fast and approximate method to find out if two internal core spaces obtained from the previous step. It finds a list of clusters of the
blocks are density-reachable or not. In addition to these two techniques, adjacent subspaces for each border that should be integrated. In the last
a cover tree is used for range exploration algorithm to speed up the step, cluster id mapping is first created for the whole dataset; then, the
density calculation process. The advantage of BLOCK-DBSCAN method local id is substituted by a global code for all dataset points.

3
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501

2.3. The DBSCAN algorithm 3.1.1. Operational dataset


To find the data samples of a cluster and extend the cluster properly,
The DBSCAN algorithm (Ester, Kriegel, Sander, & Xu, 1996) starts by the DBSCAN algorithm operates locally and there is no hop in this al­
a random point p of the dataset that has not been visited previously. The gorithm. In other words, the next data selected by this algorithm to
number of data in the ε-neighborhood of p is identified. If the number of examine if it is core or not, is the neighboring data of the previous
neighbors in the ε-neighborhood of p is less than MinPts, p is not the core examined points.
point and it is labeled as noise, and the algorithm selects another random Since DBSCAN wants to find the neighbors of each data sample, it
point of the dataset. Otherwise, if the number of neighbors of p in the must calculate the distance of this sample from all other ones. But it
ε-neighborhood is more than or equal to the MinPts, p is marked as the should be noted that since the goal is to find the neighbors, too far data
core point and a new cluster is formed and the points p takes the label of samples can be neglected; In other words, calculating the distance from
this cluster. too far data sample can be pruned. To achieve this goal, a local space of
As can be seen in the pseudo-code of the DBSCAN algorithm, the closer data samples that will also have a spherical shape with a specific
point p and its neighbors in the ε-neighborhood are added to the list N radius is considered as the operational dataset. In this dataset, instead of
and transmitted to the expandCluster function for more expansion. In calculating the distance from all the data points, only the distance from
the expandCluster function, each neighbor in the ε-neighborhood of p, the data of the same operational dataset is calculated to identify its
for example p1, is examined and if they are not visited before and it is a neighbors. Therefore, calculation pruning has occurred, and the addi­
core point, its neighbors are also added to list N for more expansion. tional distance calculations from far data are eliminated. This concept
Otherwise, if p1 is not the core point, it takes the cluster’s label and it is helps to reduce computational overhead and run time of the DBSCAN
not expanded anymore. This process continues until the list N is algorithm significantly. Fig. 1 shows the operational dataset that in­
emptied. Then, the DBSCAN algorithm selects another random point of cludes the data that would be examined in an early future according to
the dataset that has not been visited before, and continues identifying the DBSCAN algorithm scan.Fig. 2.
another cluster according to the above procedure. This process con­ In Fig. 1, P is the starting point and the center of the operational
tinues until all points of the database are visited and no cluster is found. dataset. Size is the radius and determines the size of the operational
dataset. Therefore, the operational dataset, that is called OP in brief,
3. The proposed Algorithm: OP-DBSCAN includes all data in a spherical space with radius of size and center of P.
The mathematical representation of the operational dataset is shown in
The main idea in designing the proposed method is to use local data Eq. (1):
to find neighbors. Due to the often soft and local movement of the
OperationalDataset = {x|x ∈ D. |x − P| ≤ size }n ∈ R.size = n*ε. n > 1
DBSCAN algorithm in finding clusters, the implementation of this idea
(1)
has been achieved by defining a smaller set of data called the operational
dataset. In which, D is the initial database, x is a sample in the database, P is
the center of the operational dataset, |x-P| is the Euclidean distance
3.1. Using the concept of operational dataset between data P and ×, and finally, n*ε is the radius of the operational
dataset.
As mentioned, the DBSCAN algorithm is of time order O(n2). The The radius of the OP can be decreased and increased. its value can be
run-time of this algorithm is significantly affected by finding the adjusted by the user. The radius of OP is a multiple of epsilon (parameter
neighbors of each data to obtain the data density. Therefore, the of the DBSCAN algorithm radius) as n*ε.
DBSCAN algorithm requires heavy computations for big data, which If the OP size is large (large n), it includes a larger space of the initial
reduces the clustering speed, and increases the run-time. The proposed dataset; therefore, more points are likely to exist in this space. Although
method tries to reduce the distance calculations as much as possible. To larger operational dataset can lead to less update overhead, the distance
explain the proposed algorithm, it is first required to describe the from more data samples must be calculated for determining the label of
operational dataset and potential dataset. each data sample.

Fig. 1. The operational dataset.

4
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501

Fig. 2. (a) Calculating distance of P from all points of the dataset. (b) Constituting the OP with center of P and radius of n*ε.

On the contrary, the smaller the OP size is, the less the distance parameter selected by the user as the radius of OP, all data that their
calculation cost would be. However, as the size of the OP dataset de­ distance from P is smaller than or equal to this input parameter is
creases, the overhead associated with updating it will increase. There­ considered as the OP and the rest of data is considered as PD. Therefore,
fore, there is a trade-off between the radius of the OP and the update to detect whether the data in the OP is core or not (to identify the
overhead of the OP. The optimal radius of OP is calculated through neighbors of each data), instead of calculating the distance from the
calculating time for various radiuses of the OP via trial and error. whole initial dataset samples, it is calculated from just OP dataset
Increasing or decreasing the OP radius affects the algorithm’s run-time, samples. This causes a significant reduction in computation cost.
but it does not affect the clustering quality. The important point is that
although the size of the OP affects the time of the algorithm, using this 3.2.2. Second Phase: Running the DBSCAN algorithm on the OP
set with any reasonable size greatly reduces the run time compared to After generating the OP, the DBSCAN algorithm is applied to it. In
the original DBSCAN algorithm. this algorithm, the samples are scanned in depth first. When the
neighbors of a sample are identified and it turns out that the sample is a
3.1.2. The potential dataset core one, all its neighbors are entered to the stack. Then, a neighbor is
The potential dataset includes the data that is too far from the removed from the stack, its neighbors are found and added to the stack
samples that are currently under investigation and thus ignored under the same conditions. This process continues until the neighbors of
temporarily in order to reduce the calculation. In other words, the po­ all data inside the stack are restored and there is no data in the stack (the
tential dataset includes all data of the initial dataset, except the data stack is empty). Thus, a list of the data that belongs to one cluster is
existing in the OP. The name potential dataset is adopted because the identified. The ExpandCluster function in Fig. 7 shows the execution
data of this set has the potential to constitute another OP in an early process of DBSCAN.
future. The mathematical representation of the potential dataset is also The important point while scanning the samples is that the distance
given in Eq. (2): of each sample form the center of the current OP should be smaller or
equal to (n-1)*ε. In other words, the data of OP that their distance from
Potential Dataset = {x|(x ∈ D)&&(x ∕
∈ OP) }orPD = D − OP (2)
the center of the OP is smaller than or equal to (n-1)*ε are scanned, and
Where D is the initial dataset, OP is the operational dataset, PD is the the rest of the data, which are between n*ε and (n-1)*ε, are considered
abbreviation of the potential dataset, and × is a sample in the initial as the candid dataset, which is used to update the OP.
dataset. As shown in Fig. 3, scanning the samples between n*ε and (n-1)*ε is
neglected temporarily because considering the current OP, only a part of
3.2. Phases of OP-DBSCAN algorithm the neighbors of these samples (the candid set) is accessible and other
neighbors are outside the OP. For example, in Fig. 3, the point M is
Now, we describe the proposed algorithm. The proposed method is between n*ε and (n-1)*ε. As can be seen, a part of the ε-neighbors of M
comprised of three main phases, which are described in the following. are outside the OP. Since only the distance with samples of the OP is
calculated to identify the neighbors, the neighbors of M that are outside
3.2.1. First Phase: Constituting the OP and PD the OP are not accessible (r and q).Fig. 4..
In this step, the main dataset is divided into OP and PD. Similar to Theorem: In the OP, all the neighbors of the sample that is less than
DBSCAN, a random data P is selected from the initial dataset, and dis­ or equal to (n-1)*eps from the center of the OP, are accessible.
tance of P from all data existing in the dataset is calculated. By calcu­ Proof: According to Fig. 3, the point A is on a circle of radius (n-1)*ε
lating these distances and considering the definition of OP, the OP and and center of P. The circle OP includes the center P and radius of n*eps,
PD can be obtained from the main dataset without any additional cal­ and the circle OP-1 includes the center P and radius of (n-1)*ε. The
culations compared to the conventional DBSCAN. difference of the radius of the two circles (OP and OP-1) is epsilon. The
Considering the parameters ε and MinPts, if the number of data ε-neighborhoods of the point A include all samples in a circle with center
existing in the neighborhood of P is less than MinPts, the point P is of A and radius of ε, where this circle is internal tangent with OP;
labeled as noise, and the algorithm goes to another random point of the therefore, all ε-neighbors of A are inside the OP circle, but if a point is
database. If the data P is identified as core, depending on the value of the farther than (n-1)*ε from the center P (the point is between (n-1)*ε and

5
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501

3.2.3.1. Updating location of the OP. First, we should introduce the


candid set.
Candid set: the samples with distance between n*ε and (n-1)*ε from
the sample P, are called the candid set. They are called candidate
because while updating the OP, if the candid set is not empty, the center
of the next OP is selected from the samples of this set; if the candid set is
empty, the center of the next OP would be a random point of the initial
dataset. As mentioned before, the samples of the OP that their distance
from the center of the OP exceeds (n-1)*ε, enter the candid set. The
mathematical representation of the candid set is given in Eq. (3).
Candid Set = {Vx ∈ D|d(x.P)〈size&&d(x.P) 〉size − ε } (3)
[ ] [ ]
Candid Set = OPn*ε U OP(n− 1)*ε − OPn*ε ∩ OP(n− 1)*ε (4)

On the other hand, the candid set can be represented as in Eq. (4)
using the union and intersection of the sets. The symbol OP n*ε repre­
sents the OP of radius n*ε with the center of P and OP (n-1)*ε represents
the OP of radius (n-1)*ε with center of P.
If at the time of the update, the center of the new OP is selected as a
point outside the candid set and the old OP, the cluster might be sliced
and the algorithm will not have a correct number of the clusters. In fact,
it identifies more clusters than the real number of the clusters. There­
fore, to prevent cluster breakdown and perform clustering correctly and
efficiently, the next OP center is selected among the samples of the
candid set.
Fig. 3. Scanning the samples of the OP and constituting the candid set. As seen in Fig. 5, the clusters A and B continue to the outside of the
current OP (OP1). The point P1 belongs to the cluster A. In Fig. 5(a),
while updating, the OP2 center is selected among the candid set points
ε), some ε-neighbors of that point would be outside the OP circle. If a
(P1) and all accessible points of the new center adopt the center cluster
point is at a distance smaller than (n-1)*ε from the central data P (it is
number (cluster A). In Fig. 5(b), while updating the OP, a random point
inside the OP-1 circle), in the worst case, a part of its ε-neighbors are
of the initial dataset (P2) is selected as the center of the second OP. As
inside the OP-1 circle and the other neighbors are inside the OP circle;
can be seen, the density-connected samples of P2 have constituted the
since the OP-1 circle is the subset of the OP circle, all neighborhoods
independent cluster of C, while all samples of clusters A and C belong to
would be inside the OP circle.
the same cluster. Similarly, for clusters B and D, all samples of these two
clusters belong to one cluster.
3.2.3. Third Phase: Updating the operational dataset
Calculating the distance from other samples and finding the neigh­
bors for each data point is limited to the operational dataset that sample 3.2.3.2. Update time of the operational dataset. The operational dataset
belongs to. Therefore, to scan and restore the neighbors of the data that is updated when the neighbors of all samples of OP(n-1)*ε or all samples at
are not inside the current OP, the OP should be updated. Two main a distance of (n-1)*ε or smaller from the OP center are restored.
questions are: “when should the OP be updated?”, and “which sample After updating and obtaining the new OP, obviously this set overlaps
should be used to update the OP (as the center of the new OP)?”. with the previous OP (As shown in Fig. 6). Therefore, there are data that
have been scanned before in the new OP. Eventually, the OP is updated

Fig. 4. Representation of the Candid set.

6
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501

Fig. 5a. Using the samples of the candid set as the OP center while updating.

Fig. 5b. Using a random point of the initial dataset as the center of another OP while updating (instead of using the candid set).

provided that inequality (5) is satisfied. Since in each OP, the distance of each data from all other data of the
same OP is calculated, the time complexity of this section is n*k, where k
VOP > M − VB (5)
is the size of the OP. While updating the OP, and constituting new OPs,
In Eq. (5), VOP is a variable that counts the data scanned in the the distance of a sample from all other samples of the main dataset is
current dataset and M is the total number of samples at the distance of calculated; since the number of update operations is considered as u, the
(n-1)*ε or smaller from the OP center, VB is the number of samples that time complexity of this phase is n*u. Therefore, the time complexity of
have been scanned in the previous OP. The sum of VOP and VB is always the OP-DBSCAN algorithm is O(nk + nu). The larger is the OP (larger k),
equal to M. The pseudo-code of the OP-DBSCAN algorithm is shown in the update overhead would be smaller (smaller u) and vice versa, the
Fig. 7. smaller the top size is (smaller k), the larger the update overhead would
be. According to Eq. (6), it is clear that the sum of k and u is very smaller
3.3. Time complexity analysis of the proposed algorithm than n.
Table 1 shows the effect of changes of radius of the OP on the run-
The DBSCAN algorithm calculates the distance of each sample from time of the OP-DBSCAN algorithm in several datasets. As can be infer­
all samples of the main dataset. Therefore, its time complexity is O(n2). red from Table 2, the radius of the OP should not be very large or very
But in the OP-DBSCAN algorithm, the distance of each sample of the OP small, because it increases the run-time of the OP-DBSCAN.
from the other members of the same OP is calculated. On the other hand,
in this algorithm, there exists the OP update overhead. Therefore, the 4. Experiments setup
time complexity of the proposed algorithm is as given in Eq. (6).
In this section, some evaluation metrics are presented first. Then, the
O(nk + nu).k + u << n (6)
proposed method is compared with DBSCAN (Ester, Kriegel, Sander, &

7
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501

clustering method in finding exact clusters and comparing the perfor­


mance of different methods is essential. To compare efficiency and
quality of the proposed method with four other methods, two clustering
quality validation metrics are used, which are described in the
following:

4.3.1. Davies-Boulding index (Davies & Bouldin, 1979)


This index, which is called DB in brief, does not depend on the
number of clusters or the clustering algorithm. This index calculates the
average maximum intra-cluster scattering to the inter-cluster scattering
ratio. The smaller is DB, the clustering performance is better. In other
words, the DB index employs the similarity between two clusters (Rij),
which is defined based on the scattering of a cluster (Si) and the
dissimilarity between two clusters (Dij).
The DB index is described as in Eq. (7):
1∑k
DB = maxRij (7)
k i=1 i∕
=j

Where k is the number of clusters, Rij is the intra-cluster scattering to


inter-cluster scattering ratio of clusters i and j, and Rij is defined as
follows:
Fig. 6. Overlapping samples of two sets (OP1 and OP2).
si + sj
Rij = (8)
Xu, 1996), HCA-DBSCAN (Mathur, Mehta, & Singh, 2019), fast Density- dij
Grid (Brown, Japa, & Shi, 2019), and K-DBSCAN (Gholizadeh, Saa­ In which Si is the mean distance between each point of the ith cluster
datfar, & Hanafi, 2020) in terms of qualitative evaluation metrics and and its center, and dij is the distance between the center of two clusters i
run-time. and j.

4.1. The datasets used in the experiments 4.3.2. Silhouette index (Rousseeuw, 1986)
Another method used to evaluate clustering, is the Silhouette index,
Since the main idea of this paper is to reduce the run-time of the which is represented by SI. This measure depends on the cohesion of the
DBSCAN algorithm, to compare the proposed method with four other clusters and their separability. The SI value of each point represents its
methods, several datasets of different dimensions and different number membership to its cluster compared to the adjacent cluster. To calculate
of samples are used. The datasets using in this study are adopted from SI, we need two main concepts:
UCI and GitHub such that the required diversity in terms of number of The mean distance of one point of the cluster from other points
dimensions and number of samples is satisfied. of the cluster: this value is represented by a(i) and calculated as in Eq.
Table 2 shows the characteristics of 9 employed datasets, including (9):
the name of the dataset, number of samples, number of dimensions,
epsilon and MinPts parameters for each dataset. 1 ∑ni
a(i) = d(xi .xl ) (9)
Considering the type of the dataset, the epsilon and MinPts of each ni l=1

dataset are specified. When comparing the proposed algorithm with the In which, ni is the number of members of the ith cluster, and d(xi,xl) is
other four algorithms, in order to have the same test conditions, the the distance of data xi from other data of the same cluster. a(i) can be
value of epsilon and MinPts parameters is determined for each dataset considered as a measure to evaluate the membership of xi to its cluster.
based on its properties and is considered the same for all methods during The smaller is a(i), the membership of the point to its cluster is higher.
the experiments (Table 2). Then, the proposed method is compared with Mean distance of a point from other clusters: for a point xi, its
four other methods considering these parameters under the same mean distance from points of other clusters is calculated. The cluster
conditions. with minimum mean distance for xi, is called the adjacent cluster. The
mean distance of xi from the points of the adjacent cluster is represented
4.2. Evaluation conditions by b(i).
1∑
To compare the proposed algorithm with DBSCAN and the three b(i) = min1≤l≤k (d(xi .ym )) (10)
nl ym ∈cl
mentioned methods, each method implemented and executed under the
same conditions on a computational machine with a quad-core CPU, 40 Where xi is the data of the ith cluster, d(xi,ym) is the distance of xi
MB disc, and 64 GB RAM. To obtain more accurate results, all unnec­ from ym in lth cluster and nl is the number of members of the lth cluster.
essary services of the operating system are disabled during execution of Therefore, the SI for xi is calculated using Eq. (11).
the algorithms. All datasets are normalized in the range of [0,1), and the
OP-DBSCAN algorithm and four other algorithms are applied to the s(i) =
b(i) − a(i)
(11)
normalized datasets. The purpose was to compare the clustering quality max(b(i), a(i))
(regarding clustering evaluation metrics) and run-time of the algorithms Therefore, if a(i) is smaller than b(i), SI is positive and if b(i) is
under the same conditions regarding hardware and common parame­ smaller than a(i), SI is negative, indicating weak clustering; because xi is
ters. The results presented in all experiments are the average of ten similar to the adjacent cluster instead of being similar to its cluster.
different runs. Considering the above equation, SI varies between − 1 and + 1. Values
close to 1 describe good match between the point and its cluster
4.3. Qualitative evaluation metrics compared to the adjacent cluster. If SI is close to 1 for all points of the
clusters, clustering is performed correctly. While small SI values indicate
Using proper evaluation metrics for examining the efficiency of a

8
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501

Fig. 7. Pseudo-code of the OP-DBSCAN Algorithm.

Table 1
Effect of changes of radius of the OP on the run-time of the OP-DBSCAN.
Dataset OP = 2*ε OP = 4*ε OP = 6*ε OP = 8*ε OP = 10*ε OP = 20*ε

Abalone 0.6072 0.4850 0.4328 0.4254 0.5085 0.58017


MAGIC 20.5358 8.4419 7.4395 7.0835 7.8031 28.2929
FARS 103.26 48.0892 180.5895 265.3311 368.0851 789.6889
3D Road Network 882.57992 289.1028 134.7067 88.0222 87.1958 195.3034

9
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501

Table 2 By comparing the qualitative evaluation metrics of the proposed


Characteristics of the dataset used in this study. algorithm and 4 other methods, it can be concluded that the proposed
Dataset Number of Number of Epsilon MinPts algorithm is almost the same as the DBSCAN algorithm regarding the
samples dimension clustering quality, while it performs clustering faster than DBSCAN and
Abalone 4177 8 0.035 4 three other algorithms.
MAGIC 19,020 10 0.06 4 Fig. 8 compares the run-time of the 5 algorithms. Fig. 9 compares SI
Educational Process 230,318 13 0.05 4 and Fig. 10 compares DB for the 5 algorithms. These diagrams are drawn
mining to better show the trend of changes in the performance of the proposed
Shuttle 58,000 9 0.02 5
MoCap Hand Postures 78,095 11 0.007 4
method compared to other methods on different datasets and by
FARS 106,565 10 0.08 4 changing the data size.
Skin Segmentation 245,057 3 0.01 4
News Aggregator 422,937 5 0.01 4 6. Conclusion and future works
3D Road Network 434,874 4 0.01 4
Cover type 581,012 5 0.01 4
Poker hand 1,000,000 10 0.25 8 In this paper, a clustering algorithm for big data, called OP-DBSCAN
is presented, which is an improved version of the well-known DBSCAN
algorithm. The main idea employed in this paper is using the operational
weak clustering, this might be due to improper selection of the number dataset and potential dataset concept to prune the data space and reduce
of clusters. the calculations used to find the neighbors, ensuring that the search
In other words, if the mean SI is calculated for the samples of each space of a sample for finding the ε-neighborhood is always restricted to a
cluster, an index is obtained to evaluate each cluster. The average SI is smaller space compared to the total data space.. Due to the often soft and
also a metric to evaluate the clustering performance. local movement of the DBSCAN algorithm in finding clusters, the
implementation of this idea has been achieved by defining a smaller set
5. Results analysis of data called the operational dataset.
The OP-DBSCAN consists of three phases. The first phase is the
The results of comparing the evaluation metrics and run-time for the Constituting the Operational dataset and Potential dataset. The second
5 algorithms are given in Table 3 and Table 4. one is the implementation of DBSCAN algorithm and next phase is the
According to Table 3 and Table 4, it should be noted that the original updating of the operational data set. Therefore, the OP-DBSCAN algo­
DBSCAN algorithm runs out of memory in large datasets (datasets with rithm is classified as calculation reduction methods. The experiments on
more than 200,000 samples), while other methods do not face this different datasets show that the proposed algorithm has a higher speed
problem. The proposed method, in general, has a shorter execution time compared to DBSCAN and three other algorithms, including HCA-
compared to other competitive methods, especially in the case of large DBSCAN, fast Density-Grid, and K-DBSCAN, while preserving clus­
data sets. For example, in the case of the FARS dataset (the largest tering quality.
dataset on which the original DBSCAN execution was successful), the Among advantages of the proposed algorithm, increasing the clus­
execution time of the proposed method is more than 140 times that of tering speed while preserving its quality can be mentioned. Also, instead
the original DBSCAN and more than 6 times that of the K-DBSCAN of parameters of the DBSCAN, the OP-DBSCAN requires only one input
method, Decreased. Regarding the two smaller data sets (Abalone and parameter (operational dataset size) from the user.
MAGIC), due to the small number of samples, the overhead caused by Since the multiple machine-based clustering techniques are more
the update has overcome and the execution time of the proposed method scalable and faster than single machine-based techniques, using the
is, in general, longer than other methods. map-reduce framework in the proposed algorithm can be studied in
Based on the times specified in Table 4, the percentage of perfor­ future studies. Also, the samples’ density in bigger radiuses can be used
mance improvement (PPI) for the OP-DBSCAN algorithm with respect to to guess the location of samples in a cluster. Therefore, the process of
the DBSCAN can be calculated through Eq. (12). labeling samples (for example, as cores or noises) can be accelerated and
|tDBSCAN− tp | a considerable amount of calculation omitted. This idea can be used to
PPI = (12) improve the proposed algorithm in future studies. Considering a dy­
tDBSCAN
namic size for the operational dataset based on local density is another
In this equation, tDBSCAN represents the DBSCAN execution time, tp idea which can be followed in the future works.
represents the execution time of the proposed algorithm. Table 5 shows
the percentage of performance improvement for the proposed
algorithm.

Table 3
Comparing the evaluation metrics for 5 algorithms on 9 different datasets.
Dataset Abalone MAGIC Educational Shuttle MoCap FARS Skin New Road Cover Poker
Evaluation Criteria Segmentation aggregator Network Type hand

DB OP-DBSCAN 1.4170 1.0916 1.022 0.7693 0.7546 1.4351 0.8411 1.035 0.8085 0.9015 1.012
DBSCAN 2.0524 1.4197 1.054 1.8097 1.0755 1.7289 – – – – –
HCA- 2.0560 1.5963 1.365 1.8602 1.0779 1.8516 1.4283 1.319 1.1746 1.132 1.2506
DBSCAN
K-DBSCAN 2.0626 1.4795 1.289 1.8535 1.0783 1.8238 1.3912 1.218 1.2299 1.142 1.2381
Density-grid 2.0615 1.6359 1.519 1.8823 1.0786 1.8235 1.4681 1.345 1.2194 1.1415 1.2578
SI OP-DBSCAN − 0.276 − 0.516 − 0.418 − 0.405 − 0.468 − 0.436 − 0.128 − 0.212 − 0.325 − 0.596 − 0.561
DBSCAN − 0.638 − 0.776 − 0.430 − 0.431 − 0.645 − 0.519 – – – – –
HCA- − 0.651 − 0.778 − 0.437 − 0.439 − 0.646 − 0.731 − 0.148 − 0.456 − 0.541 − 0.712 − 0.832
DBSCAN
K-DBSCAN − 0.655 − 0.778 − 0.432 − 0.443 − 0.646 − 0.688 − 0.146 − 0.325 − 0.535 − 0.714 − 0.833
Density-grid − 0.642 − 0.779 − 0.465 − 0.445 − 0.647 − 0.751 − 0.148 − 0.418 − 0.544 − 0.712 − 0.834

10
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501

Table 4
Comparing the run-time of the proposed method and 4 other methods on 11 different datasets.
Running Abalone MAGIC Educational Shuttle MoCap FARS Skin New Road Cover Poker
Time Segmentation aggregator Network Type hand
(s)

OP-DBSCAN 0.4197 7.0835 9.458 20.2553 56.8786 48.0892 123.3931 75.45 77.5716 122.1612 615.4721
DBSCAN 0.3195 4.4409 9.895 128.28 3755.8 16950.74 – – – – –
HCA- 0.3726 4.0123 9.756 104.95 586.70 1193.5 2058.1 3745.8 3967.5 4108.3 12413.42
DBSCAN
Density-Grid 0.3905 5.5327 8.569 114.68 672.98 1627.6 3575.4 4536.7 4756.7 4825.7 13526.63
K-DBSCAN 0.4537 6.2015 8.781 82.893 177.96 299.15 462.47 789.2 806.19 733.51 1789.7

CRediT authorship contribution statement


Table 5
Percentage of improvement by the proposed algorithm in the execution time of
Nooshin Hanafi: Methodology, Software, Investigation, Validation,
DBSCAN.
Writing – original draft. Hamid Saadatfar: Conceptualization, Formal
Dataset Educational Shuttle MoCap FARS analysis, Writing – review & editing, Supervision.
PPI 4.41% 84.21% 98.48% 99.71%

Declaration of Competing Interest

The authors declare that they have no known competing financial

Fig. 8. Comparing run-time of 5 algorithms (DBSCN, HCA-DBSCAN, Density-Grid DBSCAN, K-DBSCAN, and OP-DBSCAN).

Fig. 9. Changes of SI for the 5 algorithms on different datasets.

11
N. Hanafi and H. Saadatfar Expert Systems With Applications 203 (2022) 117501

Fig. 10. Changes of DB for the 5 algorithms on different datasets.

interests or personal relationships that could have appeared to influence 17th International Conference on Parallel and Distributed Systems. Tainan, Taiwan:
IEEE.
the work reported in this paper.
Hou, J., Liu, W., E, X., & Cui, H. (2016). Towards parameter-independent data clustering
and image segmentation. Pattern Recognition, 60, 25-36.
References Kesavaraj, G., & Sukumaran, S. (2013). A study on classification techniques in data
mining. 2013 Fourth International Conference on Computing, Communications and
Ananthi, V., Balasubramaniam, P., & Kalaiselvi, T. (2015). A new fuzzy clustering Networking Technologies (ICCCNT). Tiruchengode, India: IEEE.
algorithm for the segmentation of brain tumor. Soft Computing, 4859–4879. Kumar, K., & Reddy, A. M. (2016). A fast DBSCAN clustering algorithm by accelerating
Baranidharan, B., & Santhi, B. (2016). DUCF: Distributed load balancing Unequal neighbor searching using Groups method. Pattern Recognition, 39–48.
Clustering in wireless. Applied Soft Computing, 495–506. Li, S.-S. (2020). An Improved DBSCAN Algorithm Based on the Neighbor Similarity and
Berkhin, P. (2006). A Survey of Clustering Data Mining Techniques. In I. J. Kogan, Fast Nearest Neighbor Query. IEEE Access, 47468–47476.
C. Nicholas, & M. Teboulle (Eds.), Grouping Multidimensional Data (pp. 25–71). Mahesh, B. (2020). Machine Learning Algorithms - A Review. International Journal of
Springer. Science and Research (IJSR), 9(1).
Brown, D., Japa, A., & Shi, Y. (2019). A Fast Density-Grid Based Clustering Method. Las Mai, S., Assent, I., & Storgaard, M. (2016). In AnyDBC: An Efficient Anytime Density-based
Vegas, NV, USA: IEEE. Clustering Algorithm for Very Large Complex Datasets (pp. 1025–1034). ACM Press.
Chen, Y., Tang, S., Bouguila, N., Wang, C., Du, J., & Li, H. (2018). A Fast Clustering Mathur, V., Mehta, J., & Singh, S. (2019). HCA-DBSCAN: HyperCube Accelerated Density
Algorithm based on pruning unnecessary distance computations in DBSCAN for Based Spatial Clustering for Applications with Noise. Sets and Partitions workshop at
High-Dimensional Data. Pattern Recognition, 83, 375–387. NeurIPS 2019.
Chen, Y., Zhou, L., Bouguila, N., Wang, C., Chen, Y., & Du, J. (2021). BLOCK-DBSCAN: Qiao, S., Li, T., Li, H., Peng, J., & Chen, H. (2012). A new blockmodeling based
Fast clustering for large scale data. Pattern Recognition, 109, Article 107624. hierarchical clustering algorithm for web social networks. Engineering Applications of
Davies, D., & Bouldin, D. (1979). A Cluster Separation Measure. IEEE Transactions on Artificial Intelligence, 25, 640–647.
Pattern Analysis and Machine Intelligence, PAMI-1, 224–227. Rousseeuw, P. (1986). Silhouettes: A graphical aid to the interpretation and validation of
de Moura Ventorim, I., Luchi, D., Loureiros Rodrigues, A., & Miguel Varejão, F. (2021). cluster analysis. Journal of Computational and Applied Mathematics, 53–65.
BIRCHSCAN: A sampling method for applying DBSCAN to large datasets. Expert Rubio, E., Castillo, O., Valdez, F., Melin, P., Gonzalez, I. C., & Martinez, G. (2017). An
Systems with Applications, 184, Article 115518. Extension of the Fuzzy Possibilistic Clustering Algorithm Using Type-2 Fuzzy Logic
Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). A Density-Based Algorithm for Technique. Advances in Fuzzy Systems, 2017, 1–23.
Discovering Clusters in Large Spatial Databases with Noise. KDD-96 Proceedings. Sanchez, M., Castillo, O., Castro, J., & Melin, P. (2014). Fuzzy granular gravitational
Feyyad, U. (1996). Data mining and knowledge discovery: Making sense out of data. IEEE clustering algorithm for multivariate data. Information Sciences, 279, 498–511.
Expert, 11, 20–25. Shirkhorshidi, A., Aghabozorgi, S., Wah, T., & Herawan, T. (2014). In Big Data Clustering:
Gholizadeh, N., Saadatfar, H., & Hanafi, N. (2020). K-DBSCAN: An improved DBSCAN A Review (pp. 707–720). Springer International Publishing.
algorithm for big data. The Journal of, Supercomputing(77), 6214–6235. Sinha, A., & Jana, P. (2016). A Novel K-Means based Clustering Algorithm for Big Data.
Gunawan, A., & Berg, M. (2013). A faster algorithm for DBSCAN. In Master’s thesis. 2016 International Conference on Advances in Computing, Communications and
Hahsler, M., & Bolaos, M. (2016). Clustering Data Streams Based on Shared Density Informatics (ICACCI). Jaipur, India: IEEE.
Between Micro-Clusters. IEEE Transactions on Knowledge and Data Engineering, 28, Weng, S., Gou, J., & Fan, Z. (2021). h -DBSCAN: A simple fast DBSCAN algorithm for big
1449–1461. data. In Proceedings of The 13th Asian Conference on Machine Learning (pp. 81–96).
He, Y., Tan, H., Luo, W., Mao, H., Ma, D., Feng, S., & Fan, J. (2011). MR-DBSCAN: An Zerhari, B., Lahcen, A., & Mouline, S. (2015). Big Data Clustering: Algorithms and
Efficient Parallel Density-Based Clustering Algorithm Using MapReduce. 2011 IEEE Challenges. International Conference on Big Data, Cloud and Applications. Tetuan,
Morocco.

12

You might also like