0% found this document useful (0 votes)

69 views10 pages

Spark-Optimized OPTICS Algorithm

This document describes the parallel implementation of the OPTICS clustering algorithm using Spark. The original OPTICS algorithm is sequential, but the authors implemented it using Spark to optimize performance on large datasets. They utilized Spark RDDs, partitioning, and parallelization designs to derive local results from each partition and combine them into the global result. A ball tree index structure was also used to improve search efficiency. The authors compared the Spark implementation to the original OPTICS algorithm to demonstrate reduced time and space consumption for clustering large datasets using Spark.

Uploaded by

zii.wang626

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

69 views10 pages

Spark-Optimized OPTICS Algorithm

Uploaded by

zii.wang626

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Parallel Implementation of OPTICS Algorithm Based on Spark

QIU Yaowen, 20784389

WANG Wanying, 20788725
GUO Yuchen, 20793419
LONG Yuepeng, 20806228

May 14, 2022

1 Introduction and Motivation

Cluster analysis is an essential technology in database mining. For users, clustering can be used as an effective
data processing method to let users have a deeper understanding of the corresponding data sets visually. In
addition, it can also be used as a preprocessing technology in combination with other algorithms after detecting
the input data group in the database. In addition, this technology can also be widely used in anomaly detection
and other fields.
A popular method is density-based hierarchical clustering, namely DBSCAN (Density-Based Spatial Clustering
of Applications with Noise). However, the performance of DBSCAN is very sensitive to the setting of initial
input parameters. In this case, the parameter setting becomes particularly difficult in the general case without
experts’ setting. Therefore, for the setting of non-sensitive parameters, an optical (sorting points to identify the
clustering structure) algorithm is proposed, which is friendly to general users.
In the big data era, a huge amount of data floods into databases ceaselessly. In large dataset scenarios, there
still exists in-adaptation for OPTICS algorithm since its high complexity of temporal and spatial characteristics.
With the enhancement of cloud and parallel computing, Spark may provide us with an effective method to solve
the existing problem of OPTICS.
In this project, we implemented a Spark version of OPTICS algorithm. The original OPTICS algorithm is
sequentially executed. We made full use of Spark techniques such as Spark RDD, partition methods, and
parallelization designs for optimizing a good transformation structure of deriving the local result from each
partition and combining these results to the global result. We made comparisons of performance based on our
selected dataset and evaluation metrics. Furthermore, we indicated how Spark version of OPTICS algorithm
speeds up the process of clustering in terms of reducing the time and space consumption.

2 OPTICS Algorithm Description

OPTICS, or Ordering points to identify the clustering structure, is a density-based clustering algorithm that
improves the DBSCAN algorithm by reducing the sensitivity of input parameters. Before describing OPTICS,
it is necessary to give definitions of terms used in OPTICS.

ϵ-neighborhood of a point p
A point p’s ϵ-neighborhood are points within a radius of ϵ from p (including p).
Core point
If the number of points lying inside the ϵ-neighborhood of p is ≥ M inP ts, then p is a core point. In Fig 1, p is
a core point.
Border point
A point q is a border point if q falls within the neighborhood of a core point, but it is not a core point. In Fig 1,
q is a border point.

1
Noise point
A point r is a noise point if it is neither a core point nor a border point. In Fig 1, s and r are noise points.
Directly Density-Reachability
A point q is directly density-reachable from a point p if p is a core point and q is in p’s ϵ-neighborhood. In Fig
1, q is directly density-reachable from p.
Density-Reachability
A point q is density-reachable from a point p based on ϵ if there is a chain of point p1 , p2 , . . . , pn , and
p1 = p, pn = q, and p(i+1) is directly density-reachable from pi . In Fig 1, s is density-reachable from p.

Figure 1: An illustration of core point, border point, and noise point, directly density-reachability, and density-
reachability, when M inP ts = 4. p is a core point, q is a border point, s and r are noise points. q is directly
density-reachable from p. s is density-reachable from p.

Core distance
For a point p and given ϵ, M inP ts , the core distance is defined as the minimum radius distance that makes p
as a core point. Specifically, if p is not a core point, the core distance is undefined.
(
undefined |Nϵ (p)| < M inP ts
cd(p) = M inP ts
(1)
d(p, Nϵ (p)) |Nϵ (p)| ≥ M inP ts

Reachability-distance
For a point p and point q, and given ϵ, M inP ts , the reachability-distance of q related to p is defined as the
maximum value between the core distance of p and the distance between p and q. Specifically, if p is not a core
point, the reachability distance is undefined.
(
undefined |Nϵ (p)| < M inP ts
rd(q, p) = (2)
max{cd(p), d(p, q)} |Nϵ (p)| ≥ M inP ts

Note that every point p in the dataset has these two properties.
The objective of OPTICS is to output an ordered list of points in the dataset, and each point has its computed
core distance and reachability distance. The input of OPTICS is a dataset consists of data points, with param-
eters ϵ and M inP ts , while ϵ is default to infinity. First, the algorithm initializes the set of core points Ω = ∅.
Second, it traverses each point in the dataset and append all core points into Ω. Third, OPTICS randomly picks
an object o in Ω to process. It marks o as processed and append it to the ordered list. Then, it computes the
reachability-distance of each unvisited point in ϵ-neighborhood of o, and append them to a set of seeds Seeds
according to the reachability-distance. Later, OPTICS picks a seed s with smallest reachability-distance from
Seeds, mark it as visited and append it to the ordered list. If s is a core point OPTICS will append all the
unvisited neighbor of s to the Seeds and re-compute the reachability-distance. Repeatedly process the objects

2
in Ω and Seeds until they are empty.
Algorithm 1 demonstrates the pseudo code of OPTICS and Algorithm 2 is the pseudo code of update process.
Algorithm 1: OPTICS
Input: Dataset D, ϵ, M inP ts
Output: OrderedList
// Find all core points in D
CorePoints = CorePointsQuery(D, ϵ,M inpts);
// Compute Core Distance for each core point
CoreDists = ComputeCoreDists(D, ϵ,M inpts);
foreach unprocessed point p in CorePoints do
// Find all p’s ϵ-neighbor, including p
N = RegionQuery(p,ϵ);
mark p as processed;
append p to the OrderedList;
if N ⩾ M inP ts then
Seeds = empty priority queue;
Update(N, p, Seeds, CoreDists);
for each next q in Seeds do
N’ = RegionQuery(q,ϵ);
mark q as processed ;
append q to the OrderList;
if N’ ⩾ M inP ts then
Update(N’,q, Seeds, CoreDists);
end
end
end
end
return OrderedList
Algorithm 2: Update
Input: N , p, Seeds, CoreDists
// Find Core Distance of p
CoreDist = CoreDists[p];
foreach o in N do
if o is not processed then
// Compute reachability-distance of p related to o
ReachDist = max(CoreDist, dist(p,o));
if reachability distance of o == NULL then
reachability distance of o = ReachDist;
Insert (o, ReachDist) into Seeds;
end
if reach dist < reachability distance of o then
reachability distance of o = ReachDist;
Remove (o, ReachDist) from Seeds;
end
end
end

3
3 Spark Implementation and Parallelization
3.1 Ball Tree
Structured data improves our algorithm’s efficiency. In this project, we applied ball tree to enhance computing
speed. Ball tree is the binary tree which separate by hypersphere. Every node in ball tree partitions the data
into two disjoint sets. If there is an intersection between two hyperspheres, points are decided with distance to
the ball’s center. Following algorithm describes how to construct a ball tree.
Algorithm 3: Construct balltree
Input: D, an arrat of data points.
Output: B, the root of a constructed ball tree.

if a single point remains then

create a leaf B containing the single point in D
return B
end
else
let c be the dimension of greatest sperad
let p be the central point selected considering c
let L, R be the sets of points lying to the left and right of the median along dimension c
create B with two children:
[Link] := p
B.child1 := Construct balltree(L)
B.child2 := Construct balltree(R)
let [Link] be maximum distance from p among children
return B
end
During searching process in OPTICS, ball trees use the property that, for any point outside the ball, the
distance to any point inside ball is greater than or equal to the distance between the given point and ball’s
surface. Using this property, can improve searching when apply clustering algorithm on ball tree.

3.2 Core points, Core distance, and Reachability Distance

Compared to DBSCAN, the computation of core points is tricky. Since ϵ is default to infinity, all points in the
data are core points. The only thing to do is assigning a list of indices of all data to variable which indicates
core points.
The core distance of a point is defined as the distance between the point and M inP tsth neighbor. From the
ball tree structure, it is easily to search M inP tsth neighbor for each point. Furthermore, the list of M inP ts
neighbors of a point can be built during the search. The total complexity is O(( Np ∗ (log Np )), where N is the
number of points and p is the number of workers.
The reachability distance is a N ∗ N matrix, where each element rd(q, p) is defined as max{cd(p), d(p, q)},
if ϵ is infinity. To make it in parallel, indexes of points can be partitioned, and each index can be mapped to
a row of the matrix in each partition. Since a RDD cannot refer variables in other RDDs, the Core distance is
collected as a python dictionary in driver and broadcasted to each worker. The total complexity is O( Np ∗ N ).

3.3 Update Seed

The update part is divided into two sections. The first one is seed update, which can be done in parallel, and
the second one is seed pruning, which must be done in sequence. Recall the pseudo-code of update, each (key,
value) pair will be updated if the value is smaller than the reachability distance of current point o. And the
modification of one key-value pair does not influence others. This property allows the update of Seed done
in parallel. Later, the seed should be sorted by its value in descending order. This can be implemented using

4
sortBy() function in Spark easily. Figure 2 demonstrates the whole process of seed update.

Figure 2: The process of updating Seed

However, after seed update, key with largest value will be pop and appended to the Order list. Then, it turns
to next stage of seed update, until the seed is empty. Since the RDD of seed is dynamic and the terminated
condition is related to seed itself, it is impossible to implement in parallel. Hence, we keep this part unchanged.
Theoretically, the time complexity is reduce from O(N ∗ N ) to O(N ∗ Np ).

4 Experiment Results
4.1 Dataset description
In this part, we will test our program in different contexts, including some clustering benchmark datasets [3].
Clustering benchmark datasets [3] cover the different shapes or dimensions of node distribution, which can
show the performance of our program in some extreme situations. To compare the speed between Spark-based
OPTICS and vanilla OPTICS, we chose four clustering datasets of various sizes. Table 1 demonstrates the
statistics of each dataset.

Clustering Dataset
Dataset Node Cluster Dimension
A1 set [3] 3,000 20 2
S1 set [3] 5,000 15 2
t4.8k [3] 8,000 6 2
Birch-set2 [3] 100,000 100 2

Table 1: Some clustering datasets tested in the project

4.2 Experiment settings

OPTICS still requires M inP ts, a parameter that has a significant impact on the clustering result and is time-
costly to be tuned to fit each dataset. In addition, the project itself focuses more on the performance of the
Spark-based algorithm, but not the vanilla one. Hence, we set M inP ts as a fixed value 15, which is not the
best value for every dataset, but is still good enough.
We build a Spark cluster on Azure Databricks platform. Due to the quota limitation, a maximum number of 3
workers is allowed. All driver and workers are running on Standard DS3 v2 machines, with 14.00 GB RAM,
and an Intel Xeon E5-2673 v3 CPU with 4 cores.

5
For each dataset, the number of partitions is set to 32, 48, and 64 to test the relation between partition number
and speed.

4.3 Experiment Result

Table 2 demonstrates the running time of Spark-based OPTICS under each number of partitions on each dataset.
Compared to running time on a single Standard DS3 v2 machine, the running time on Spark is faster. The
number of partitions also has an impact on the speed. For A1 and S1 datasets, 48 partitions are faster than
both 32 and 64 partitions. A possible reason is the assignment of data to each partition cost more time on the
driver program. The running time on t4.8k dataset is close under each partition, while the case is not true on
Birch-set2. When the partition number is set to 64, it is about half an hour faster than the 32 partitions. Most
of time is spent on the calculation of reachability matrix and update, while the fixed time cost of assignment
by the driver program is a small percentage of total running time. Since the number of nodes in Birch-set2 is
100,000, it is slow to run on a single machine, thus we stopped execution when the running time exceeds 2
hours.
Figure 3, 4, 5 and 6 demonstrate the visualization results of OPTICS using M inP ts=15 with the threshold
selection principle. It is clear that M inP ts=15 works well on Birch-set2, but leads to lots of unexpected
outliers in other datasets.

Time (Ours with partitions)

Dataset Time (Single machine)
32 48 64
A1 set [3] 51s 45s 47s 1m 32s
S1 set [3] 1m 30s 1m 26s 1m 26s 2m 51s
t4.8k [3] 3m 15s 3m 14s 3m 15s 5m 47s
Birch-set2 [3] 1h 47m 1h 23m 1h 16m >2h

Table 2: Running Time on each dataset

Figure 3: Visualization of OPTICS Cluster on A1 dataset

6
Figure 4: Visualization of OPTICS Cluster on S1 dataset

Figure 5: Visualization of OPTICS Cluster on t4.8k dataset

Figure 6: Visualization of OPTICS Cluster on Birch-set2 dataset

7
4.4 Result Analysis
Figure 7 is a snapshot of CPU and RAM usage when running the Spark OPTICS on S1 dataset with partition =
32. The average usage of CPU is stable around 50%. It means that around half of CPU resources is idle during
the run.

Figure 7: CPU and memory usage when running on S1 set with partition = 32

Figure 8 shows the percentage of time cost of the computation of reachability matrix, seed update, and
others. “others” includes time spent on library import, load data, building ball tree, and core distance. For all
dataset, the program spent most of time on computing reachability matrix. And with size increase on dataset, the
portion of time spent on reachability matrix x tends to increase. In practical, the map function of r reachability
matrix consists of two steps: the first part is to create a 1 ∗ n vector to store the row matrix; the second step is
a for loop to fill the vector using the maximum value between core distance of given core point and distance
between input data and given core point, when is done by calling a distance function. Since all points are core
points in OPTICS, the operation of distance calculation and comparison exactly occurs N ∗ N times. It may
explain the reason that the compute of reachability matrix spent most of time.

5 Future Works
In our project, we parallelize and compare the computational cost for building reachability distance matrix,
seed update and other aspects of running time including computing core distance and constructing ball tree.
Two-dimensional datasets were also utilized for comparison in our experiment. For further improvement, com-
bination of reachability distance matrix with structure of ball tree can be considered to improve the efficiency
of computational process. Moreover, for better performance of running on Spark, we can also do some opti-
mization for memory usage based on reducing the number of collects, etc. In addition to using two-dimensional
data only, the experimental accuracy of higher-dimensional data can be further performed and tested.

8
Figure 8: Percentage of time cost on reachability matrix, update function on each dataset

9
References

[1] Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996, August). A density-based algorithm for discover-
ing clusters in large spatial databases with noise. In kdd (Vol. 96, No. 34, pp. 226-231).

[2] Ankerst, M., Breunig, M. M., Kriegel, H. P., & Sander, J. (1999). OPTICS: Ordering points to identify
the clustering structure. ACM Sigmod record, 28(2), 49-60.

[3] P. Fränti and S. Sieranoja. K-means properties on six clustering benchmark datasets. Applied Intel-
ligence, 48 (12), 4743-4759, December 2018

[4] A. Reiss and D. Stricker. Introducing a New Benchmarked Dataset for Activity Monitoring. The 16th
IEEE International Symposium on Wearable Computers (ISWC), 2012.

[5] Liu, T.; Moore, A. and Gray, A. (2006). New Algorithms for Efficient High-Dimensional Nonpara-
metric Classification. Journal of Machine Learning Research. 7: 1135–1158.

Density and Grid Based Clustering
No ratings yet
Density and Grid Based Clustering
5 pages
OPTICS: Ordering Points To Identify The Clustering Structure
No ratings yet
OPTICS: Ordering Points To Identify The Clustering Structure
10 pages
Optics
No ratings yet
Optics
3 pages
Density Based Clustering Methods
No ratings yet
Density Based Clustering Methods
14 pages
Optics Algorithm
No ratings yet
Optics Algorithm
10 pages
Live Classroom 5
No ratings yet
Live Classroom 5
64 pages
Density Based Clustering Technique
No ratings yet
Density Based Clustering Technique
54 pages
Density-Based Clustering Methods Explained
No ratings yet
Density-Based Clustering Methods Explained
51 pages
Density Based Clustering
No ratings yet
Density Based Clustering
17 pages
Density Based Clustering
No ratings yet
Density Based Clustering
17 pages
Density-Based Clustering Guide
No ratings yet
Density-Based Clustering Guide
21 pages
CSE4014 - High Performance Computing (EPJ) : Submitted by Project Guide
No ratings yet
CSE4014 - High Performance Computing (EPJ) : Submitted by Project Guide
12 pages
Fuzzy Clustering
No ratings yet
Fuzzy Clustering
6 pages
Open Lecture 13 - DBSCAN PDF
No ratings yet
Open Lecture 13 - DBSCAN PDF
33 pages
4.6 Dbscan
No ratings yet
4.6 Dbscan
27 pages
Dbscan and Optics
No ratings yet
Dbscan and Optics
28 pages
DS143 Group 13 Presentation-1
No ratings yet
DS143 Group 13 Presentation-1
27 pages
ML Module 5
No ratings yet
ML Module 5
15 pages
M6
No ratings yet
M6
23 pages
A Comparative Study of K-Means, DBSCAN and OPTICS
No ratings yet
A Comparative Study of K-Means, DBSCAN and OPTICS
6 pages
Tfocus 4 MSTD-05
No ratings yet
Tfocus 4 MSTD-05
20 pages
Ktustudents - In: 1. Hierarchical Methods
No ratings yet
Ktustudents - In: 1. Hierarchical Methods
21 pages
What Is Dbscan
No ratings yet
What Is Dbscan
2 pages
DBSCAN Algorithm and Time Complexity
No ratings yet
DBSCAN Algorithm and Time Complexity
22 pages
DBSCAN
No ratings yet
DBSCAN
42 pages
DBSCAN
No ratings yet
DBSCAN
27 pages
DBSCAN: Density-Based Clustering Explained
No ratings yet
DBSCAN: Density-Based Clustering Explained
17 pages
DBSCAN: Density-Based Clustering Guide
No ratings yet
DBSCAN: Density-Based Clustering Guide
18 pages
Closest Pair of Coordinates
No ratings yet
Closest Pair of Coordinates
24 pages
Density-Based Clustering Methods Explained
No ratings yet
Density-Based Clustering Methods Explained
52 pages
Overview of Density-Based Clustering
No ratings yet
Overview of Density-Based Clustering
52 pages
Density-Based Clustering Methods Overview
No ratings yet
Density-Based Clustering Methods Overview
52 pages
DBSCAN Presentation
No ratings yet
DBSCAN Presentation
10 pages
DBSCAN Clustering Explained
No ratings yet
DBSCAN Clustering Explained
3 pages
DBSCAN (Density-Based Spatial Clustering of Applications With
No ratings yet
DBSCAN (Density-Based Spatial Clustering of Applications With
27 pages
Density Based Clustering Methods
No ratings yet
Density Based Clustering Methods
15 pages
DB Scan Clustering
No ratings yet
DB Scan Clustering
11 pages
Understanding DBSCAN Clustering Algorithm
No ratings yet
Understanding DBSCAN Clustering Algorithm
30 pages
Clustering
No ratings yet
Clustering
75 pages
Density-Based Clustering Methods
No ratings yet
Density-Based Clustering Methods
14 pages
Unit 8 DBSCAN
No ratings yet
Unit 8 DBSCAN
53 pages
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
No ratings yet
Data Mining: Hierarchical Clustering, DBSCAN The EM Algorithm
63 pages
Dbscan
No ratings yet
Dbscan
18 pages
DBSCAN Algorithm for Data Scientists
No ratings yet
DBSCAN Algorithm for Data Scientists
10 pages
Density Based Clustering
No ratings yet
Density Based Clustering
19 pages
ML Exp 9
No ratings yet
ML Exp 9
5 pages
XGBoost: Scalable Tree Boosting System
No ratings yet
XGBoost: Scalable Tree Boosting System
23 pages
OBB-Tree for Fast Collision Detection
No ratings yet
OBB-Tree for Fast Collision Detection
32 pages
DBScan Algorithm Overview and Concepts
No ratings yet
DBScan Algorithm Overview and Concepts
8 pages
DBSCAN Clustering
No ratings yet
DBSCAN Clustering
6 pages
Cluster Analysis
No ratings yet
Cluster Analysis
22 pages
Data Mining
No ratings yet
Data Mining
3 pages
Lesson 4.1 - Unsupervised Learning Partitioning Methods
No ratings yet
Lesson 4.1 - Unsupervised Learning Partitioning Methods
32 pages
Enhanced DBSCAN for Clustering
No ratings yet
Enhanced DBSCAN for Clustering
5 pages
Clustering
No ratings yet
Clustering
12 pages
DWDM Unit-3
No ratings yet
DWDM Unit-3
9 pages
DBSCAN - Introduction in Machine Learning.
No ratings yet
DBSCAN - Introduction in Machine Learning.
3 pages
Dbscan R PKG Description
No ratings yet
Dbscan R PKG Description
55 pages
DM Lect 8 - Clustering - DBSCAN
No ratings yet
DM Lect 8 - Clustering - DBSCAN
22 pages
Specsheet-Eng-Tts 36-A
No ratings yet
Specsheet-Eng-Tts 36-A
7 pages
Instruction PowerPoint
No ratings yet
Instruction PowerPoint
3 pages
Unlocking Various TV
No ratings yet
Unlocking Various TV
8 pages
Bu39j06b40 01en
No ratings yet
Bu39j06b40 01en
4 pages
Supply Chain Finance & Automation Expert
No ratings yet
Supply Chain Finance & Automation Expert
1 page
Hitachi Astemo Production Line Leadership
No ratings yet
Hitachi Astemo Production Line Leadership
2 pages
Network Design Document
100% (2)
Network Design Document
7 pages
Unit 5 Os Notes
No ratings yet
Unit 5 Os Notes
16 pages
Big Survival Collection - 400 Survival Lessons - Lacy Bean
100% (12)
Big Survival Collection - 400 Survival Lessons - Lacy Bean
236 pages
Business Strategy Analysis - HIKVISION
No ratings yet
Business Strategy Analysis - HIKVISION
13 pages
AI in Pharmaceutical Advancements
No ratings yet
AI in Pharmaceutical Advancements
12 pages
Iot Nov 19 (E Next - In)
No ratings yet
Iot Nov 19 (E Next - In)
1 page
Revised Syllabus - Fishery Arts - Final
No ratings yet
Revised Syllabus - Fishery Arts - Final
8 pages
Aes Api682 01
100% (2)
Aes Api682 01
9 pages
ROC809 - ModScan32 Modbus Setup
No ratings yet
ROC809 - ModScan32 Modbus Setup
9 pages
Cloud Computing Module-2
No ratings yet
Cloud Computing Module-2
6 pages
REQ73098 Data Integrity Officer
No ratings yet
REQ73098 Data Integrity Officer
5 pages
Manual de Tarjeta
No ratings yet
Manual de Tarjeta
28 pages
IEC 104 and 61850 Training
100% (1)
IEC 104 and 61850 Training
50 pages
Magic An - Industrial Strength - Tool.2010
No ratings yet
Magic An - Industrial Strength - Tool.2010
4 pages
Kumar Shwetank: Contact Work Experience
No ratings yet
Kumar Shwetank: Contact Work Experience
1 page
Matrix Converter DTC for Induction Motors
No ratings yet
Matrix Converter DTC for Induction Motors
29 pages
FFC Dental Clinic Business Website Proposal
0% (1)
FFC Dental Clinic Business Website Proposal
8 pages
Unit 1
No ratings yet
Unit 1
152 pages
213 Manual SpectroPlate
No ratings yet
213 Manual SpectroPlate
44 pages
Pioneer VSX-45TX Service Manual
No ratings yet
Pioneer VSX-45TX Service Manual
152 pages
How Is The Assignment of The MPI DP Interface Defined
100% (1)
How Is The Assignment of The MPI DP Interface Defined
10 pages
True/False Questions on OOP Concepts
100% (1)
True/False Questions on OOP Concepts
5 pages
Executive Summary: Government First Grade Collage, Ranebennurpage 1
No ratings yet
Executive Summary: Government First Grade Collage, Ranebennurpage 1
42 pages
Ec QG16
100% (2)
Ec QG16
320 pages

Spark-Optimized OPTICS Algorithm

Uploaded by

Spark-Optimized OPTICS Algorithm

Uploaded by

Parallel Implementation of OPTICS Algorithm Based on Spark

QIU Yaowen, 20784389

May 14, 2022

1 Introduction and Motivation

2 OPTICS Algorithm Description

if a single point remains then

3.2 Core points, Core distance, and Reachability Distance

3.3 Update Seed

Figure 2: The process of updating Seed

Table 1: Some clustering datasets tested in the project

4.2 Experiment settings

4.3 Experiment Result

Time (Ours with partitions)

Table 2: Running Time on each dataset

Figure 3: Visualization of OPTICS Cluster on A1 dataset

Figure 5: Visualization of OPTICS Cluster on t4.8k dataset

Figure 6: Visualization of OPTICS Cluster on Birch-set2 dataset

You might also like