Parallel Implementation of OPTICS Algorithm
Parallel Implementation of OPTICS Algorithm
ϵ-neighborhood of a point p
A point p’s ϵ-neighborhood are points within a radius of ϵ from p (including p).
Core point
If the number of points lying inside the ϵ-neighborhood of p is ≥ M inP ts, then p is a core point. In Fig 1, p is
a core point.
Border point
A point q is a border point if q falls within the neighborhood of a core point, but it is not a core point. In Fig 1,
q is a border point.
1
Noise point
A point r is a noise point if it is neither a core point nor a border point. In Fig 1, s and r are noise points.
Directly Density-Reachability
A point q is directly density-reachable from a point p if p is a core point and q is in p’s ϵ-neighborhood. In Fig
1, q is directly density-reachable from p.
Density-Reachability
A point q is density-reachable from a point p based on ϵ if there is a chain of point p1 , p2 , . . . , pn , and
p1 = p, pn = q, and p(i+1) is directly density-reachable from pi . In Fig 1, s is density-reachable from p.
Figure 1: An illustration of core point, border point, and noise point, directly density-reachability, and density-
reachability, when M inP ts = 4. p is a core point, q is a border point, s and r are noise points. q is directly
density-reachable from p. s is density-reachable from p.
Core distance
For a point p and given ϵ, M inP ts , the core distance is defined as the minimum radius distance that makes p
as a core point. Specifically, if p is not a core point, the core distance is undefined.
(
undefined |Nϵ (p)| < M inP ts
cd(p) = M inP ts
(1)
d(p, Nϵ (p)) |Nϵ (p)| ≥ M inP ts
Reachability-distance
For a point p and point q, and given ϵ, M inP ts , the reachability-distance of q related to p is defined as the
maximum value between the core distance of p and the distance between p and q. Specifically, if p is not a core
point, the reachability distance is undefined.
(
undefined |Nϵ (p)| < M inP ts
rd(q, p) = (2)
max{cd(p), d(p, q)} |Nϵ (p)| ≥ M inP ts
Note that every point p in the dataset has these two properties.
The objective of OPTICS is to output an ordered list of points in the dataset, and each point has its computed
core distance and reachability distance. The input of OPTICS is a dataset consists of data points, with param-
eters ϵ and M inP ts , while ϵ is default to infinity. First, the algorithm initializes the set of core points Ω = ∅.
Second, it traverses each point in the dataset and append all core points into Ω. Third, OPTICS randomly picks
an object o in Ω to process. It marks o as processed and append it to the ordered list. Then, it computes the
reachability-distance of each unvisited point in ϵ-neighborhood of o, and append them to a set of seeds Seeds
according to the reachability-distance. Later, OPTICS picks a seed s with smallest reachability-distance from
Seeds, mark it as visited and append it to the ordered list. If s is a core point OPTICS will append all the
unvisited neighbor of s to the Seeds and re-compute the reachability-distance. Repeatedly process the objects
2
in Ω and Seeds until they are empty.
Algorithm 1 demonstrates the pseudo code of OPTICS and Algorithm 2 is the pseudo code of update process.
Algorithm 1: OPTICS
Input: Dataset D, ϵ, M inP ts
Output: OrderedList
// Find all core points in D
CorePoints = CorePointsQuery(D, ϵ,M inpts);
// Compute Core Distance for each core point
CoreDists = ComputeCoreDists(D, ϵ,M inpts);
foreach unprocessed point p in CorePoints do
// Find all p’s ϵ-neighbor, including p
N = RegionQuery(p,ϵ);
mark p as processed;
append p to the OrderedList;
if N ⩾ M inP ts then
Seeds = empty priority queue;
Update(N, p, Seeds, CoreDists);
for each next q in Seeds do
N’ = RegionQuery(q,ϵ);
mark q as processed ;
append q to the OrderList;
if N’ ⩾ M inP ts then
Update(N’,q, Seeds, CoreDists);
end
end
end
end
return OrderedList
Algorithm 2: Update
Input: N , p, Seeds, CoreDists
// Find Core Distance of p
CoreDist = CoreDists[p];
foreach o in N do
if o is not processed then
// Compute reachability-distance of p related to o
ReachDist = max(CoreDist, dist(p,o));
if reachability distance of o == NULL then
reachability distance of o = ReachDist;
Insert (o, ReachDist) into Seeds;
end
if reach dist < reachability distance of o then
reachability distance of o = ReachDist;
Remove (o, ReachDist) from Seeds;
end
end
end
3
3 Spark Implementation and Parallelization
3.1 Ball Tree
Structured data improves our algorithm’s efficiency. In this project, we applied ball tree to enhance computing
speed. Ball tree is the binary tree which separate by hypersphere. Every node in ball tree partitions the data
into two disjoint sets. If there is an intersection between two hyperspheres, points are decided with distance to
the ball’s center. Following algorithm describes how to construct a ball tree.
Algorithm 3: Construct balltree
Input: D, an arrat of data points.
Output: B, the root of a constructed ball tree.
4
sortBy() function in Spark easily. Figure 2 demonstrates the whole process of seed update.
However, after seed update, key with largest value will be pop and appended to the Order list. Then, it turns
to next stage of seed update, until the seed is empty. Since the RDD of seed is dynamic and the terminated
condition is related to seed itself, it is impossible to implement in parallel. Hence, we keep this part unchanged.
Theoretically, the time complexity is reduce from O(N ∗ N ) to O(N ∗ Np ).
4 Experiment Results
4.1 Dataset description
In this part, we will test our program in different contexts, including some clustering benchmark datasets [3].
Clustering benchmark datasets [3] cover the different shapes or dimensions of node distribution, which can
show the performance of our program in some extreme situations. To compare the speed between Spark-based
OPTICS and vanilla OPTICS, we chose four clustering datasets of various sizes. Table 1 demonstrates the
statistics of each dataset.
Clustering Dataset
Dataset Node Cluster Dimension
A1 set [3] 3,000 20 2
S1 set [3] 5,000 15 2
t4.8k [3] 8,000 6 2
Birch-set2 [3] 100,000 100 2
5
For each dataset, the number of partitions is set to 32, 48, and 64 to test the relation between partition number
and speed.
6
Figure 4: Visualization of OPTICS Cluster on S1 dataset
7
4.4 Result Analysis
Figure 7 is a snapshot of CPU and RAM usage when running the Spark OPTICS on S1 dataset with partition =
32. The average usage of CPU is stable around 50%. It means that around half of CPU resources is idle during
the run.
Figure 7: CPU and memory usage when running on S1 set with partition = 32
Figure 8 shows the percentage of time cost of the computation of reachability matrix, seed update, and
others. “others” includes time spent on library import, load data, building ball tree, and core distance. For all
dataset, the program spent most of time on computing reachability matrix. And with size increase on dataset, the
portion of time spent on reachability matrix x tends to increase. In practical, the map function of r reachability
matrix consists of two steps: the first part is to create a 1 ∗ n vector to store the row matrix; the second step is
a for loop to fill the vector using the maximum value between core distance of given core point and distance
between input data and given core point, when is done by calling a distance function. Since all points are core
points in OPTICS, the operation of distance calculation and comparison exactly occurs N ∗ N times. It may
explain the reason that the compute of reachability matrix spent most of time.
5 Future Works
In our project, we parallelize and compare the computational cost for building reachability distance matrix,
seed update and other aspects of running time including computing core distance and constructing ball tree.
Two-dimensional datasets were also utilized for comparison in our experiment. For further improvement, com-
bination of reachability distance matrix with structure of ball tree can be considered to improve the efficiency
of computational process. Moreover, for better performance of running on Spark, we can also do some opti-
mization for memory usage based on reducing the number of collects, etc. In addition to using two-dimensional
data only, the experimental accuracy of higher-dimensional data can be further performed and tested.
8
Figure 8: Percentage of time cost on reachability matrix, update function on each dataset
9
References
[1] Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996, August). A density-based algorithm for discover-
ing clusters in large spatial databases with noise. In kdd (Vol. 96, No. 34, pp. 226-231).
[2] Ankerst, M., Breunig, M. M., Kriegel, H. P., & Sander, J. (1999). OPTICS: Ordering points to identify
the clustering structure. ACM Sigmod record, 28(2), 49-60.
[3] P. Fränti and S. Sieranoja. K-means properties on six clustering benchmark datasets. Applied Intel-
ligence, 48 (12), 4743-4759, December 2018
[4] A. Reiss and D. Stricker. Introducing a New Benchmarked Dataset for Activity Monitoring. The 16th
IEEE International Symposium on Wearable Computers (ISWC), 2012.
[5] Liu, T.; Moore, A. and Gray, A. (2006). New Algorithms for Efficient High-Dimensional Nonpara-
metric Classification. Journal of Machine Learning Research. 7: 1135–1158.
10