Lecture 6 - Spark ML
Lecture 6 - Spark ML
Spark Mllib
MACHINE LEARNING
1
06/10/2024
MACHINE LEARNING?
2
06/10/2024
How about
automating it?
Program Learns to Play Mario
Observes the game & presses keys
Maximises Score
3
06/10/2024
So?
• Program Learnt to play Mario and other games
• Without any need of programming
4
06/10/2024
10
5
06/10/2024
MACHINE LEARNING
Spark-MLb
il
11
12
6
06/10/2024
13
14
7
06/10/2024
15
16
8
06/10/2024
17
Machine Learning
18
9
06/10/2024
19
Reinforcement
Dynamic environment, perform a certain
goal
20
10
06/10/2024
Supervised
Regression
Reinforcement
21
Spam? No
22
11
06/10/2024
Predicting a continuous-valued
attribute associated with an object.
23
24
12
06/10/2024
25
26
13
06/10/2024
27
MlLib STRUCTURE
ML Algorithms Featurization
Common learning algorithms
e.g. classification, regression, clustering, Feature extraction, Transformation, Dimensionality
and collaborative filtering reduction, and Selection
Pipelines Persistence
Tools for constructing, evaluating, Saving and load algorithms, models,
and tuning ML Pipelines and Pipelines
Utilities
Linear algebra, statistics, data handling, etc.
28
14
06/10/2024
29
PIPELINES
DataFrame:This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a
variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors,
true labels, and predictions.
Transformer: A Transformer is an algorithm which can transform one DataFrame into another
DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a
DataFrame with predictions.
Parameter: All Transformers and Estimators now share a common API for specifying parameters.
30
15
06/10/2024
PIPELINES
31
32
16
06/10/2024
More Details>>
33
Dimensionality reduction:
https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html
Feature extraction and transformation:
https://spark.apache.org/docs/latest/mllib-feature-extraction.html
Frequent pattern mining:
https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html
Evaluation metrics:
https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html
PMML model export:
https://spark.apache.org/docs/latest/mllib-pmml-model-export.html
Optimization (developer):
https://spark.apache.org/docs/latest/mllib-optimization.html
34
17
06/10/2024
Clustering
Regression
Active learning
Collaborative filtering
35
36
18
06/10/2024
CLUSTERING
E.g. archaeological dig
Distance North
Grouping data according
to similarity
Distance East
37
CLUSTERING
E.g. archaeological dig
Distance North
Grouping data
according to
similarity
Distance East
38
19
06/10/2024
K-MEANS ALGORITHM
Benefits E.g. archaeological dig
Distance North
• Popular
• Fast
• Conceptually
straightforward
Distance East
39
K-MEANS: PRELIMINARIES
Data: Collection of values
data = lines.map(line=>
Feature 2
parseVector(line))
Feature 1
40
20
06/10/2024
K-MEANS: PRELIMINARIES
Dissimilarity:
Squared Euclidean distance
Feature 2
dist = p.squaredDist(q)
Feature 1
41
K-MEANS: PRELIMINARIES
K = Number of clusters
Feature 2
Feature 1
42
21
06/10/2024
K-MEANS: PRELIMINARIES
K = Number of clusters
Feature 2
Data assignments to clusters
S1, S2,. . ., SK
Feature 1
43
K-MEANS ALGORITHM
• Initialize K cluster centers
• Repeat until convergence:
Assign each data point to
the cluster with the closest
Feature 2
center.
Assign each cluster center to
be the mean of its cluster’s
data points.
Feature 1
44
22
06/10/2024
K-MEANS ALGORITHM
• Initialize K cluster centers
• Repeat until convergence:
Assign each data point to
the cluster with the closest
Feature 2
center.
Assign each cluster center to
be the mean of its cluster’s
data points.
Feature 1
45
K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)
Feature 2
46
23
06/10/2024
K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)
Feature 2
• Repeat until convergence:
Assign each data point to
the cluster with the closest
center.
Assign each cluster center to
be the mean of its cluster’s
data points.
Feature 1
47
K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)
Feature 2
48
24
06/10/2024
K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)
Feature 2
• Repeat until convergence:
closest = data.map(p =>
(closestPoint(p,centers),p))
Assign each cluster center to
be the mean of its cluster’s
data points.
Feature 1
49
K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)
Feature 2
(closestPoint(p,centers),p))
Assign each cluster center to
be the mean of its cluster’s
data points.
Feature 1
50
25
06/10/2024
K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)
Feature 2
• Repeat until convergence:
closest = data.map(p =>
(closestPoint(p,centers),p))
Assign each cluster center to
be the mean of its cluster’s
data points.
Feature 1
51
K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)
Feature 2
(closestPoint(p,centers),p))
pointsGroup =
closest.groupByKey()
Feature 1
52
26
06/10/2024
K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)
Feature 2
• Repeat until convergence:
closest = data.map(p =>
(closestPoint(p,centers),p))
pointsGroup =
closest.groupByKey()
newCenters = pointsGroup.mapValues(
ps => average(ps))
Feature 1
53
K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)
Feature 2
(closestPoint(p,centers),p))
pointsGroup =
closest.groupByKey()
newCenters = pointsGroup.mapValues(
ps => average(ps))
Feature 1
54
27
06/10/2024
K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)
Feature 2
• Repeat until convergence:
(closestPoint(p,centers),p))
pointsGroup =
closest.groupByKey()
newCenters = pointsGroup.mapValues(
ps => average(ps))
Feature 1
55
K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)
Feature 2
56
28
06/10/2024
K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)
Feature 2
• Repeat until convergence:
while (dist(centers, newCenters) > ɛ)
closest = data.map(p =>
(closestPoint(p,centers),p))
pointsGroup =
closest.groupByKey()
newCenters =pointsGroup.mapValues(
ps => average(ps))
Feature 1
57
K-MEANS ALGORITHM
centers = data.takeSample(
false, K, seed)
while (d > ɛ)
{
closest = data.map(p =>
Feature 2
(closestPoint(p,centers),p))
pointsGroup =
closest.groupByKey()
newCenters =pointsGroup.mapValues(
ps => average(ps))
d = distance(centers, newCenters)
centers = newCenters.map(_)
}
Feature 1
58
29
06/10/2024
EASE OF USE
§ Interactive shell:
Useful for featurization, pre-processing data
§ Lines of code for K-Means
- Spark ~ 90 lines – (Part of hands-on tutorial !)
- Hadoop/Mahout ~ 4 files, > 300 lines
59
PERFORMANCE
K-Means Logistic Regression
274
HadoopBinMem H ad oo pB inMem
184
250
Iteration time (s)
200
Iteration time (s)
Spark
197
Spark
200
157
150
116
143
111
121
150
106
100
80
76
87
100
62
61
50
33
50
15
0 0
25 50 100
25 50 100
Number of machines Number of machines
[Zaharia et. al, NSDI’12]
60
30
06/10/2024
CONCLUSION
§ Spark: Framework for cluster computing
61
31