SlideShare a Scribd company logo
Designing a Machine
Learning algorithm for
Apache Spark
Marco Gaido
Software Engineer and Apache Spark
contributor
2017-10-17
2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
Apache Spark and Machine Learning
3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
What is Apache Spark?
 A fast and general-purpose cluster computing system
– Fast because it allows in memory computing
 It was created for Machine Learning algorithms
– Very slow on MapReduce
– Iterative
 Easy to be used
– The user can implement his business logic using high level API
– Several APIs: Scala, Java, Python, SQL, R
 4 main modules built on top of it:
– Spark Streaming
– SparkSQL
– MLLib
– GraphX
4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
MLLib
 A complete ML library, which aims to cover all ML phases
– Featurization
– Training
– Evaluation
– Persistence
– Prediction
 High level API
 Great performance
5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
Apache Spark and Machine Learning
Implementing an algorithm on Apache Spark
6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
How to write a ML algorithm in MLLib?
 Spark is open source: anybody can contribute or create his/her own version
 As easy as rewriting the implementation using RDDs or DataFrames
 Trivial implementations can be written with few lines of code for many algorithms
 Though, many well-known algorithm are still missing…
WHY?
7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
DBSCAN
 DBSCAN is a widespread density-based clustering algorithm
– Two inputs: a radius (ε) and a number of points (minPts) to decide whether an area is dense or
sparse
 Naïve implementation:
– Find the ε (eps) neighbors of a point p
– If they are at least minPts
• If p already belongs to a cluster, then assign the neighbors
to the same cluster
• Otherwise, create a new cluster containing p and its neighbors
– Repeat until all points have been processed
 Computational complexity: O(N²) in computing
or memory
 A parallel (and reliable) implementation is not trivial at all
3
A
B
C
8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
Apache Spark and Machine Learning
Implementing an algorithm on Apache Spark
Designing an algorithm for Apache Spark
9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Key points
 Shared states should be small (or no shared state at all)
– They have to be kept in memory on all the executors
 The goal computational complexity is O(N/W), where W is the number of executors
– This ensures infinite scalability
– O (N2) is not suitable for Big Data (1M of input data becomes 1T to be analyzed, 1T becomes 1Y)
 Iterating multiple times over the same dataset is fine
– The dataset can be cached in memory
10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
An example: Silhouette
 SPARK-14516: introduced in the next Apache release (2.3.0)
 Measure of the quality of a clustering result
 Implementation of Silhouette algorithm using squared Euclidean distance
 References:
– Design document: https://goo.gl/7cJV64
– Code:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala
11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
 Definition
– For each datum i compute the average dissimilarity with all the data in the same cluster (a(i))
– Compute the average dissimilarity to all the other cluster a pick the smallest one (b(i))
– Then compute the Silhouette coefficient for i:
– Compute the average of the Silhouette coefficient for all points
 Computational complexity
– O(N2): for each point, we need to compute its distance to all the other points
Silhouette
12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
 The problem is computing the average distance of a point X to a cluster C
Squared Euclidean Silhouette
𝑖=1
𝑁
𝑗=1
𝐷
𝑥𝑗 − 𝑐𝑖𝑗
2
𝑁𝐶
… after some old but gold algebra …
𝑁𝐶 𝜉 𝑋 + Ψ𝐶 − 2 𝑗=1
𝐷
𝑌𝐶 𝑗
𝑥𝑗
𝑁𝐶
Where 𝜉 𝑋 is a constant which can be precomputed for each point X, Ψ𝐶, 𝑌𝐶 , 𝑁𝐶 are
constant (actually 𝑌𝐶 is a vector) precomputed for each cluster
13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
 With the previous equation, each point Silhouette coefficient can be computed without
computing the distance to all the other points
– We precompute the cluster values (ie. the state)
– We use the above formula for each point for all the clusters
– We compute the average of the Silhouette coefficients
 We can assume the number of cluster is rather small
– Then, our shared state is small
 The overall complexity is O(N C D / W)
– We can assume that C and D are much lower than N, then O(N/W) → infinite scalability
Squared Euclidean Silhouette (2) 𝑁𝐶 𝜉 𝑋 + Ψ𝐶 − 2 𝑗=1
𝐷
𝑌𝐶 𝑗
𝑥𝑗
𝑁𝐶
C1
Ψ𝐶1
𝑌𝐶1
𝑁𝐶1
C2
Ψ𝐶2
𝑌𝐶2
𝑁𝐶2
C3
Ψ𝐶3
𝑌𝐶3
𝑁𝐶3
14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
1
10
100
1000
10000
0 20000 40000 60000 80000 100000 120000 140000 160000
Time(seconds)
Dataset cardinality (N)
Single thread tests on different datasets
Naïve Silhouette Squared Euclidean Silhouette
Performance comparison
15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
Apache Spark and Machine Learning
Implementing an algorithm on Apache Spark
Designing an algorithm for Apache Spark
Takeaways
16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Takeaways
 Think, design your algorithms for Apache Spark
– Don’t implement them with Spark
 Everything you do, you must consider parallelism
 Shared states and information are a bottleneck to scalability
– Keep them small!
 If your algorithm is O(N2), re-think it
17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Thank You, Q&A

More Related Content

PPTX
Optimized Floating-point Complex number multiplier on FPGA
Dr. Pushpa Kotipalli
 
PDF
Optimization of graph storage using GoFFish
Anushree Prasanna Kumar
 
PPTX
Design & implementation of high speed carry select adder
ssingh7603
 
PDF
Why i need to learn so much math for my phd research
Crypto Cg
 
PPSX
Presentation on ILU
Mohammad Mathin
 
PDF
Design of high speed adders for efficient digital design blocks
Bharath Chary
 
PDF
Mini Project on 4 BIT SERIAL MULTIPLIER
j naga sai
 
PPTX
Scilab: Computing Tool For Engineers
Naren P.R.
 
Optimized Floating-point Complex number multiplier on FPGA
Dr. Pushpa Kotipalli
 
Optimization of graph storage using GoFFish
Anushree Prasanna Kumar
 
Design & implementation of high speed carry select adder
ssingh7603
 
Why i need to learn so much math for my phd research
Crypto Cg
 
Presentation on ILU
Mohammad Mathin
 
Design of high speed adders for efficient digital design blocks
Bharath Chary
 
Mini Project on 4 BIT SERIAL MULTIPLIER
j naga sai
 
Scilab: Computing Tool For Engineers
Naren P.R.
 

What's hot (20)

PDF
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
DB Tsai
 
DOCX
DESIGN OF SIMULATION DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SAIKIR...
Saikiran perfect
 
PPTX
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...
Saikiran Panjala
 
PDF
Haskell Accelerate
Steve Severance
 
PPTX
Machine learning
Software Infrastructure
 
PPT
8 Bit A L U
stevencollins
 
PPTX
Transformer Mods for Document Length Inputs
Sujit Pal
 
DOCX
DESIGN AND PERFORMANCE ANALYSIS OF BINARY ADDERS_edited
Shital Badaik
 
PPT
IEEE/RSJ IROS 2008 Real-time Tracker
c.choi
 
PDF
Design and Verification of Area Efficient Carry Select Adder
ijsrd.com
 
PPT
32-bit unsigned multiplier by using CSLA & CLAA
Ganesh Sambasivarao
 
PPTX
Algorithm(BFS, PRIM, DIJKSTRA, LCS)
TanvirAhammed22
 
PDF
Introduction to OpenSees by Frank McKenna
openseesdays
 
PPTX
Design and implementation of low power
Surendra Bommavarapu
 
PDF
Extracting a Rails Engine to a separated application
Jônatas Paganini
 
PPTX
SMiLE: Design and Development of an ISS Payload for Liquid Behavior Study in ...
Mohamed Elhariry
 
PDF
Implementation of Low Power and Area Efficient Carry Select Adder
inventionjournals
 
PDF
Intro to Elixir
Eduardo Nunes Pereira
 
PPTX
Keep Calm and Distributed Tracing
Angelo Simone Scotto
 
PPTX
Karnaugh map or K-map method
Abdullah Moin
 
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
DB Tsai
 
DESIGN OF SIMULATION DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SAIKIR...
Saikiran perfect
 
DESIGN AND SIMULATION OF DIFFERENT 8-BIT MULTIPLIERS USING VERILOG CODE BY SA...
Saikiran Panjala
 
Haskell Accelerate
Steve Severance
 
Machine learning
Software Infrastructure
 
8 Bit A L U
stevencollins
 
Transformer Mods for Document Length Inputs
Sujit Pal
 
DESIGN AND PERFORMANCE ANALYSIS OF BINARY ADDERS_edited
Shital Badaik
 
IEEE/RSJ IROS 2008 Real-time Tracker
c.choi
 
Design and Verification of Area Efficient Carry Select Adder
ijsrd.com
 
32-bit unsigned multiplier by using CSLA & CLAA
Ganesh Sambasivarao
 
Algorithm(BFS, PRIM, DIJKSTRA, LCS)
TanvirAhammed22
 
Introduction to OpenSees by Frank McKenna
openseesdays
 
Design and implementation of low power
Surendra Bommavarapu
 
Extracting a Rails Engine to a separated application
Jônatas Paganini
 
SMiLE: Design and Development of an ISS Payload for Liquid Behavior Study in ...
Mohamed Elhariry
 
Implementation of Low Power and Area Efficient Carry Select Adder
inventionjournals
 
Intro to Elixir
Eduardo Nunes Pereira
 
Keep Calm and Distributed Tracing
Angelo Simone Scotto
 
Karnaugh map or K-map method
Abdullah Moin
 
Ad

Similar to Designing a machine learning algorithm for Apache Spark (20)

PDF
Recent Developments in Spark MLlib and Beyond
Xiangrui Meng
 
PPTX
Machine Learning With Spark
Shivaji Dutta
 
PDF
Large-Scale Machine Learning with Apache Spark
DB Tsai
 
PDF
Recent Developments in Spark MLlib and Beyond
DataWorks Summit
 
PPTX
MLconf NYC Xiangrui Meng
MLconf
 
PPTX
Spark MLlib - Training Material
Bryan Yang
 
PPTX
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
PDF
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
PPTX
Spark-Zeppelin-ML on HWX
Kirk Haslbeck
 
PPTX
Machine learning with Spark
Khalid Salama
 
PPTX
Learning spark ch11 - Machine Learning with MLlib
phanleson
 
PDF
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
PDF
Petit Déjeuner Datastax 14-04-15 Courbo Spark : exemple de Machine Learning s...
OCTO Technology
 
PDF
MLlib: Spark's Machine Learning Library
jeykottalam
 
PDF
CourboSpark
Christophe Salperwyck
 
PPTX
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Josef A. Habdank
 
PDF
Spark m llib
Milad Alshomary
 
PDF
Bringing Algebraic Semantics to Mahout
sscdotopen
 
PDF
Intro to Machine Learning with TF- workshop
Prottay Karim
 
PDF
Introduction to Big Data Science
Albert Bifet
 
Recent Developments in Spark MLlib and Beyond
Xiangrui Meng
 
Machine Learning With Spark
Shivaji Dutta
 
Large-Scale Machine Learning with Apache Spark
DB Tsai
 
Recent Developments in Spark MLlib and Beyond
DataWorks Summit
 
MLconf NYC Xiangrui Meng
MLconf
 
Spark MLlib - Training Material
Bryan Yang
 
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Spark Summit
 
Spark-Zeppelin-ML on HWX
Kirk Haslbeck
 
Machine learning with Spark
Khalid Salama
 
Learning spark ch11 - Machine Learning with MLlib
phanleson
 
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Petit Déjeuner Datastax 14-04-15 Courbo Spark : exemple de Machine Learning s...
OCTO Technology
 
MLlib: Spark's Machine Learning Library
jeykottalam
 
Prediction as a service with ensemble model in SparkML and Python ScikitLearn
Josef A. Habdank
 
Spark m llib
Milad Alshomary
 
Bringing Algebraic Semantics to Mahout
sscdotopen
 
Intro to Machine Learning with TF- workshop
Prottay Karim
 
Introduction to Big Data Science
Albert Bifet
 
Ad

Recently uploaded (20)

PPT
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
PPTX
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
PPT
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
PPTX
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
PDF
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
PDF
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
PDF
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
PPTX
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
PPTX
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
PPTX
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
PDF
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
PDF
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
PDF
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
PPTX
Introduction to computer chapter one 2017.pptx
mensunmarley
 
PPTX
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
PDF
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
PPTX
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
PPTX
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
PDF
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
PPTX
Power BI in Business Intelligence with AI
KPR Institute of Engineering and Technology
 
Real Life Application of Set theory, Relations and Functions
manavparmar205
 
Introduction-to-Python-Programming-Language (1).pptx
dhyeysapariya
 
Grade 5 PPT_Science_Q2_W6_Methods of reproduction.ppt
AaronBaluyut
 
Pipeline Automatic Leak Detection for Water Distribution Systems
Sione Palu
 
The_Future_of_Data_Analytics_by_CA_Suvidha_Chaplot_UPDATED.pdf
CA Suvidha Chaplot
 
Research about a FoodFolio app for personalized dietary tracking and health o...
AustinLiamAndres
 
TIC ACTIVIDAD 1geeeeeeeeeeeeeeeeeeeeeeeeeeeeeer3.pdf
Thais Ruiz
 
short term project on AI Driven Data Analytics
JMJCollegeComputerde
 
White Blue Simple Modern Enhancing Sales Strategy Presentation_20250724_21093...
RamNeymarjr
 
Complete_STATA_Introduction_Beginner.pptx
mbayekebe
 
Technical Writing Module-I Complete Notes.pdf
VedprakashArya13
 
SUMMER INTERNSHIP REPORT[1] (AutoRecovered) (6) (1).pdf
pandeydiksha814
 
717629748-Databricks-Certified-Data-Engineer-Professional-Dumps-by-Ball-21-03...
pedelli41
 
Introduction to computer chapter one 2017.pptx
mensunmarley
 
IP_Journal_Articles_2025IP_Journal_Articles_2025
mishell212144
 
Company Presentation pada Perusahaan ADB.pdf
didikfahmi
 
INFO8116 - Week 10 - Slides.pptx big data architecture
guddipatel10
 
Data Security Breach: Immediate Action Plan
varmabhuvan266
 
WISE main accomplishments for ISQOLS award July 2025.pdf
StatsCommunications
 
Power BI in Business Intelligence with AI
KPR Institute of Engineering and Technology
 

Designing a machine learning algorithm for Apache Spark

  • 1. Designing a Machine Learning algorithm for Apache Spark Marco Gaido Software Engineer and Apache Spark contributor 2017-10-17
  • 2. 2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Agenda Apache Spark and Machine Learning
  • 3. 3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved What is Apache Spark?  A fast and general-purpose cluster computing system – Fast because it allows in memory computing  It was created for Machine Learning algorithms – Very slow on MapReduce – Iterative  Easy to be used – The user can implement his business logic using high level API – Several APIs: Scala, Java, Python, SQL, R  4 main modules built on top of it: – Spark Streaming – SparkSQL – MLLib – GraphX
  • 4. 4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved MLLib  A complete ML library, which aims to cover all ML phases – Featurization – Training – Evaluation – Persistence – Prediction  High level API  Great performance
  • 5. 5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Agenda Apache Spark and Machine Learning Implementing an algorithm on Apache Spark
  • 6. 6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved How to write a ML algorithm in MLLib?  Spark is open source: anybody can contribute or create his/her own version  As easy as rewriting the implementation using RDDs or DataFrames  Trivial implementations can be written with few lines of code for many algorithms  Though, many well-known algorithm are still missing… WHY?
  • 7. 7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved DBSCAN  DBSCAN is a widespread density-based clustering algorithm – Two inputs: a radius (ε) and a number of points (minPts) to decide whether an area is dense or sparse  Naïve implementation: – Find the ε (eps) neighbors of a point p – If they are at least minPts • If p already belongs to a cluster, then assign the neighbors to the same cluster • Otherwise, create a new cluster containing p and its neighbors – Repeat until all points have been processed  Computational complexity: O(N²) in computing or memory  A parallel (and reliable) implementation is not trivial at all 3 A B C
  • 8. 8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Agenda Apache Spark and Machine Learning Implementing an algorithm on Apache Spark Designing an algorithm for Apache Spark
  • 9. 9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Key points  Shared states should be small (or no shared state at all) – They have to be kept in memory on all the executors  The goal computational complexity is O(N/W), where W is the number of executors – This ensures infinite scalability – O (N2) is not suitable for Big Data (1M of input data becomes 1T to be analyzed, 1T becomes 1Y)  Iterating multiple times over the same dataset is fine – The dataset can be cached in memory
  • 10. 10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved An example: Silhouette  SPARK-14516: introduced in the next Apache release (2.3.0)  Measure of the quality of a clustering result  Implementation of Silhouette algorithm using squared Euclidean distance  References: – Design document: https://goo.gl/7cJV64 – Code: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala
  • 11. 11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved  Definition – For each datum i compute the average dissimilarity with all the data in the same cluster (a(i)) – Compute the average dissimilarity to all the other cluster a pick the smallest one (b(i)) – Then compute the Silhouette coefficient for i: – Compute the average of the Silhouette coefficient for all points  Computational complexity – O(N2): for each point, we need to compute its distance to all the other points Silhouette
  • 12. 12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved  The problem is computing the average distance of a point X to a cluster C Squared Euclidean Silhouette 𝑖=1 𝑁 𝑗=1 𝐷 𝑥𝑗 − 𝑐𝑖𝑗 2 𝑁𝐶 … after some old but gold algebra … 𝑁𝐶 𝜉 𝑋 + Ψ𝐶 − 2 𝑗=1 𝐷 𝑌𝐶 𝑗 𝑥𝑗 𝑁𝐶 Where 𝜉 𝑋 is a constant which can be precomputed for each point X, Ψ𝐶, 𝑌𝐶 , 𝑁𝐶 are constant (actually 𝑌𝐶 is a vector) precomputed for each cluster
  • 13. 13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved  With the previous equation, each point Silhouette coefficient can be computed without computing the distance to all the other points – We precompute the cluster values (ie. the state) – We use the above formula for each point for all the clusters – We compute the average of the Silhouette coefficients  We can assume the number of cluster is rather small – Then, our shared state is small  The overall complexity is O(N C D / W) – We can assume that C and D are much lower than N, then O(N/W) → infinite scalability Squared Euclidean Silhouette (2) 𝑁𝐶 𝜉 𝑋 + Ψ𝐶 − 2 𝑗=1 𝐷 𝑌𝐶 𝑗 𝑥𝑗 𝑁𝐶 C1 Ψ𝐶1 𝑌𝐶1 𝑁𝐶1 C2 Ψ𝐶2 𝑌𝐶2 𝑁𝐶2 C3 Ψ𝐶3 𝑌𝐶3 𝑁𝐶3
  • 14. 14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved 1 10 100 1000 10000 0 20000 40000 60000 80000 100000 120000 140000 160000 Time(seconds) Dataset cardinality (N) Single thread tests on different datasets Naïve Silhouette Squared Euclidean Silhouette Performance comparison
  • 15. 15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Agenda Apache Spark and Machine Learning Implementing an algorithm on Apache Spark Designing an algorithm for Apache Spark Takeaways
  • 16. 16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Takeaways  Think, design your algorithms for Apache Spark – Don’t implement them with Spark  Everything you do, you must consider parallelism  Shared states and information are a bottleneck to scalability – Keep them small!  If your algorithm is O(N2), re-think it
  • 17. 17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Thank You, Q&A

Editor's Notes

  • #5: High level API DataFrame abstraction Thought to be used by Data Scientist without Spark knowledge But Spark expertize is needed to have good performance Great performance Parallel and scalable In memory caching and computing
  • #14: With the previous equation, each point Silhouette coefficient can be computed without computing the distance to all the other points We precompute the needed values for the clusters (ie. We precompute our state) For each cluster we need to compute 2 constant and one vector We can assume the number of cluster is rather small Then, our shared state is small We compute the above formula for all the clusters for each point, and - with these computed average distances - we compute the Silhouette coefficient for each point The average of all the Silhouette coefficients is computed Thus, the computational complexity of the needed steps is: O(N D / W), it requires a one-pass aggregation over the entire dataset O(N C D / W), for each point we compute the average distance to all the clusters O(N / W), it requires a one-pass aggregation over the entire dataset We need 2 passes over the dataset: one to precompute the state one to compute the coefficients and their average
  • #15: The comparison is fair: no parallelism is exploited. Only thanks to the computational complexity. This is for small dataset, our implementation enables also to compute it over larger ones.