Designing a machine learning algorithm for Apache Spark

Designing a Machine
Learning algorithm for
Apache Spark
Marco Gaido
Software Engineer and Apache Spark
contributor
2017-10-17

2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
Apache Spark and Machine Learning

What is Apache Spark?
 A fast and general-purpose cluster computing system
– Fast because it allows in memory computing
 It was created for Machine Learning algorithms
– Very slow on MapReduce
– Iterative
 Easy to be used
– The user can implement his business logic using high level API
– Several APIs: Scala, Java, Python, SQL, R
 4 main modules built on top of it:
– Spark Streaming
– SparkSQL
– MLLib
– GraphX

MLLib
 A complete ML library, which aims to cover all ML phases
– Featurization
– Training
– Evaluation
– Persistence
– Prediction
 High level API
 Great performance

Agenda
Implementing an algorithm on Apache Spark

How to write a ML algorithm in MLLib?
 Spark is open source: anybody can contribute or create his/her own version
 As easy as rewriting the implementation using RDDs or DataFrames
 Trivial implementations can be written with few lines of code for many algorithms
 Though, many well-known algorithm are still missing…
WHY?

DBSCAN
 DBSCAN is a widespread density-based clustering algorithm
– Two inputs: a radius (ε) and a number of points (minPts) to decide whether an area is dense or
sparse
 Naïve implementation:
– Find the ε (eps) neighbors of a point p
– If they are at least minPts
• If p already belongs to a cluster, then assign the neighbors
to the same cluster
• Otherwise, create a new cluster containing p and its neighbors
– Repeat until all points have been processed
 Computational complexity: O(N²) in computing
or memory
 A parallel (and reliable) implementation is not trivial at all
3
A
B
C

Agenda
Designing an algorithm for Apache Spark

Key points
 Shared states should be small (or no shared state at all)
– They have to be kept in memory on all the executors
 The goal computational complexity is O(N/W), where W is the number of executors
– This ensures infinite scalability
– O (N2) is not suitable for Big Data (1M of input data becomes 1T to be analyzed, 1T becomes 1Y)
 Iterating multiple times over the same dataset is fine
– The dataset can be cached in memory

An example: Silhouette
 SPARK-14516: introduced in the next Apache release (2.3.0)
 Measure of the quality of a clustering result
 Implementation of Silhouette algorithm using squared Euclidean distance
 References:
– Design document: https://goo.gl/7cJV64
– Code:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/evaluation/ClusteringEvaluator.scala

 Definition
– For each datum i compute the average dissimilarity with all the data in the same cluster (a(i))
– Compute the average dissimilarity to all the other cluster a pick the smallest one (b(i))
– Then compute the Silhouette coefficient for i:
– Compute the average of the Silhouette coefficient for all points
 Computational complexity
– O(N2): for each point, we need to compute its distance to all the other points
Silhouette

 The problem is computing the average distance of a point X to a cluster C
Squared Euclidean Silhouette
𝑖=1
𝑁
𝑗=1
𝐷
𝑥𝑗 − 𝑐𝑖𝑗
2
𝑁𝐶
… after some old but gold algebra …
𝑁𝐶 𝜉 𝑋 + Ψ𝐶 − 2 𝑗=1
𝐷
𝑌𝐶 𝑗
𝑥𝑗
𝑁𝐶
Where 𝜉 𝑋 is a constant which can be precomputed for each point X, Ψ𝐶, 𝑌𝐶 , 𝑁𝐶 are
constant (actually 𝑌𝐶 is a vector) precomputed for each cluster

 With the previous equation, each point Silhouette coefficient can be computed without
computing the distance to all the other points
– We precompute the cluster values (ie. the state)
– We use the above formula for each point for all the clusters
– We compute the average of the Silhouette coefficients
 We can assume the number of cluster is rather small
– Then, our shared state is small
 The overall complexity is O(N C D / W)
– We can assume that C and D are much lower than N, then O(N/W) → infinite scalability
Squared Euclidean Silhouette (2) 𝑁𝐶 𝜉 𝑋 + Ψ𝐶 − 2 𝑗=1
𝐷
𝑌𝐶 𝑗
𝑥𝑗
𝑁𝐶
C1
Ψ𝐶1
𝑌𝐶1
𝑁𝐶1
C2
Ψ𝐶2
𝑌𝐶2
𝑁𝐶2
C3
Ψ𝐶3
𝑌𝐶3
𝑁𝐶3

1
10
100
1000
10000
0 20000 40000 60000 80000 100000 120000 140000 160000
Time(seconds)
Dataset cardinality (N)
Single thread tests on different datasets
Naïve Silhouette Squared Euclidean Silhouette
Performance comparison

Agenda
Designing an algorithm for Apache Spark
Takeaways

Takeaways
 Think, design your algorithms for Apache Spark
– Don’t implement them with Spark
 Everything you do, you must consider parallelism
 Shared states and information are a bottleneck to scalability
– Keep them small!
 If your algorithm is O(N2), re-think it

Thank You, Q&A

Designing a machine learning algorithm for Apache Spark

More Related Content

What's hot (20)

Similar to Designing a machine learning algorithm for Apache Spark (20)

Recently uploaded (20)

Designing a machine learning algorithm for Apache Spark

Editor's Notes