0% found this document useful (0 votes)
9 views31 pages

Lecture 6 - Spark ML

The document provides an overview of machine learning, its definitions, types, and applications, particularly focusing on Spark MLlib. It discusses various machine learning techniques such as classification, regression, clustering, and collaborative filtering, as well as the K-means algorithm. Additionally, it highlights the tools and libraries available for machine learning, emphasizing the scalability and efficiency of Spark for data processing.

Uploaded by

Tuân Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views31 pages

Lecture 6 - Spark ML

The document provides an overview of machine learning, its definitions, types, and applications, particularly focusing on Spark MLlib. It discusses various machine learning techniques such as classification, regression, clustering, and collaborative filtering, as well as the K-means algorithm. Additionally, it highlights the tools and libraries available for machine learning, emphasizing the scalability and efficiency of Spark for data processing.

Uploaded by

Tuân Nguyễn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

06/10/2024

Spark Mllib

Instructor: Van-Dang Tran, Ph.D.

MACHINE LEARNING

“Programming Computers to optimize


performance using Example Data or Past
Experience”

1
06/10/2024

MACHINE LEARNING?

Field of study that gives "computers


the ability to learn without being
explicitly programmed."

-- Arthur Samuel, 1959

HAVE YOU PLAYED MARIO?


How much time did it take you to learn & win the princess?

2
06/10/2024

HOW ABOUT AUTOMATING IT?

How about
automating it?
Program Learns to Play Mario
Observes the game & presses keys
Maximises Score

3
06/10/2024

So?
• Program Learnt to play Mario and other games
• Without any need of programming

4
06/10/2024

Question: To make this program learn any other games such


as PacMan we will have to …

1. Write new rules as per the game


2. Just hook it to new game and let it play for a while

Question: To make this program learn any other games such as


PacMan we will have to …

1. Write new rules as per the game


2. Just hook it to new game and let it play for a while

10

5
06/10/2024

MACHINE LEARNING

• Branch of Artificial Intelligence


• Design and Development of Algorithms
• Computers Evolve Behaviour based on Empirical Data

Spark-MLb
il

11

MACHINE LEARNING - APPLICATIONS


Recommend Friends, Dates, Products to end-user.

12

6
06/10/2024

MACHINE LEARNING - APPLICATIONS

Classify content into predefined groups.

13

MACHINE LEARNING - APPLICATIONS


Identify key topics in large Collections of Text.

14

7
06/10/2024

MACHINE LEARNING - APPLICATIONS


Computer Vision - Identifying Objects

15

MACHINE LEARNING - APPLICATIONS


Natural Language Processing

16

8
06/10/2024

MACHINE LEARNING - APPLICATIONS

• Find Similar content based on Object Properties.


• Detect Anomalies within given data.
• Ranking Search Results with User Feedback Learning.
• Classifying DNA sequences.
• Sentiment Analysis/ Opinion Mining
• BioInformatics.
• Speech and HandWriting Recognition.

17

MACHINE LEARNING - TYPES?

Given example inputs & outputs, learn to


Supervised
map inputs to outputs

Machine Learning

18

9
06/10/2024

MACHINE LEARNING - TYPES?

Supervised Given example inputs & outputs, learn


to map inputs to outputs

Machine Learning Unsupervised No labels given, find structure

19

MACHINE LEARNING - TYPES?


Supervised
Given example inputs & outputs, learn
to map inputs to outputs

Machine Learning Unsupervised No labels given, find structure

Reinforcement
Dynamic environment, perform a certain
goal

20

10
06/10/2024

MACHINE LEARNING - TYPES?


Classification

Supervised

Regression

Machine Learning Unsupervised Clustering

Reinforcement

21

MACHINE LEARNING - CLASSIFICATION?


Check
Email

Spam? No

Yes We Use Logistic Regression

22

11
06/10/2024

MACHINE LEARNING - REGRESSION?

Predicting a continuous-valued
attribute associated with an object.

In linear regression, we draw all possible lines


going through the points such that it is closest
to all.

23

MACHINE LEARNING - CLUSTERING?

• To form a cluster based on


some definition of nearness

24

12
06/10/2024

MACHINE LEARNING - TOOLS

DATA SIZE CLASSFICATION TOOLS

Lines Sample Data Analysis and Whiteboard,…


Visualization
KBs - low MBs Prototype Analysis and Matlab, Octave, R,
Data Visualization Processing,
MBs - low GBs NumPy, SciPy,
Analysis
Online Data Weka,
Flare, AmCharts,
Visualization
Raphael, Protovis
GBs - TBs - PBs Analysis MLlib, SparkR, GraphX,
Big Data Mahout, Giraph

25

MACHINE LEARNING USING SPARK


• Spark RDDs à efficient data sharing

• In-memory caching accelerates performance


• Up to 20x faster than Hadoop

• Easy to use high-level programming interface


• Express complex algorithms ~100 lines.

26

13
06/10/2024

MACHINE LEARNING LIBRARY (MLlib)


Goal is to make practical machine learning scalable and easy

Consists of common learning algorithms and utilities, including:


• Classification
• Regression
• Clustering
• Collaborative filtering
• Dimensionality reduction
• Lower-level optimization primitives
• Higher-level pipeline APIs

27

MlLib STRUCTURE

ML Algorithms Featurization
Common learning algorithms
e.g. classification, regression, clustering, Feature extraction, Transformation, Dimensionality
and collaborative filtering reduction, and Selection

Pipelines Persistence
Tools for constructing, evaluating, Saving and load algorithms, models,
and tuning ML Pipelines and Pipelines

Utilities
Linear algebra, statistics, data handling, etc.

28

14
06/10/2024

MLLIB - COLLABORATIVE FILTERING


• Commonly used for recommender systems
• Techniques aim to fill in the missing entries of a user-item association
matrix

• Supports model-based collaborative filtering,


• Users and products are described by a small set of latent factors that can
be used to predict missing entries.
• MLlib uses the alternating least squares (ALS) algorithm to learn these
latent factors.

29

PIPELINES
DataFrame:This ML API uses DataFrame from Spark SQL as an ML dataset, which can hold a
variety of data types. E.g., a DataFrame could have different columns storing text, feature vectors,
true labels, and predictions.

Transformer: A Transformer is an algorithm which can transform one DataFrame into another
DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame with features into a
DataFrame with predictions.

Estimator: An Estimator is an algorithm which can be fit on a DataFrame to produce a Transformer.


E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.

Pipeline: A Pipeline chains multiple Transformers and Estimators together to specify an ML


workflow.

Parameter: All Transformers and Estimators now share a common API for specifying parameters.

30

15
06/10/2024

PIPELINES

31

spark.mllib - BASIC STATISTICS


Summary statistics
Correlations
Stratified sampling
Hypothesis testing
Random data generation
Kernel density estimation
See https://spark.apache.org/docs/latest/mllib-statistics.html

32

16
06/10/2024

MLlib - CLASSIFICATION AND REGRESSION


MLlib supports various methods:
Binary Classification
linear SVMs, logistic regression, decision trees, random forests,
gradient-boosted trees, naive Bayes
Multiclass Classification
logistic regression, decision trees, random forests, naive Bayes
Regression
linear least squares, Lasso, ridge regression, decision trees,
random forests, gradient-boosted trees, isotonic regression

More Details>>

33

MlLib - Other Classes of Algorithms

Dimensionality reduction:
https://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html
Feature extraction and transformation:
https://spark.apache.org/docs/latest/mllib-feature-extraction.html
Frequent pattern mining:
https://spark.apache.org/docs/latest/mllib-frequent-pattern-mining.html
Evaluation metrics:
https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html
PMML model export:
https://spark.apache.org/docs/latest/mllib-pmml-model-export.html
Optimization (developer):
https://spark.apache.org/docs/latest/mllib-optimization.html

34

17
06/10/2024

MACHINE LEARNING TECHNIQUES


Classification

Clustering

Regression

Active learning

Collaborative filtering

35

K-Means Clustering using Spark

Focus: Implementation and Performance

36

18
06/10/2024

CLUSTERING
E.g. archaeological dig

Distance North
Grouping data according
to similarity

Distance East

37

CLUSTERING
E.g. archaeological dig
Distance North

Grouping data
according to
similarity

Distance East

38

19
06/10/2024

K-MEANS ALGORITHM
Benefits E.g. archaeological dig

Distance North
• Popular
• Fast
• Conceptually
straightforward

Distance East

39

K-MEANS: PRELIMINARIES
Data: Collection of values

data = lines.map(line=>
Feature 2

parseVector(line))

Feature 1

40

20
06/10/2024

K-MEANS: PRELIMINARIES
Dissimilarity:
Squared Euclidean distance

Feature 2
dist = p.squaredDist(q)

Feature 1

41

K-MEANS: PRELIMINARIES
K = Number of clusters
Feature 2

Data assignments to clusters


S1, S2,. . ., SK

Feature 1

42

21
06/10/2024

K-MEANS: PRELIMINARIES
K = Number of clusters

Feature 2
Data assignments to clusters
S1, S2,. . ., SK

Feature 1

43

K-MEANS ALGORITHM
• Initialize K cluster centers
• Repeat until convergence:
Assign each data point to
the cluster with the closest
Feature 2

center.
Assign each cluster center to
be the mean of its cluster’s
data points.

Feature 1

44

22
06/10/2024

K-MEANS ALGORITHM
• Initialize K cluster centers
• Repeat until convergence:
Assign each data point to
the cluster with the closest

Feature 2
center.
Assign each cluster center to
be the mean of its cluster’s
data points.

Feature 1

45

K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)
Feature 2

• Repeat until convergence:


Assign each data point to
the cluster with the closest
center.
Assign each cluster center to
be the mean of its cluster’s
data points.
Feature 1

46

23
06/10/2024

K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)

Feature 2
• Repeat until convergence:
Assign each data point to
the cluster with the closest
center.
Assign each cluster center to
be the mean of its cluster’s
data points.
Feature 1

47

K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)
Feature 2

• Repeat until convergence:


Assign each data point to
the cluster with the closest
center.
Assign each cluster center to
be the mean of its cluster’s
data points.
Feature 1

48

24
06/10/2024

K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)

Feature 2
• Repeat until convergence:
closest = data.map(p =>

(closestPoint(p,centers),p))
Assign each cluster center to
be the mean of its cluster’s
data points.
Feature 1

49

K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)
Feature 2

• Repeat until convergence:


closest = data.map(p =>

(closestPoint(p,centers),p))
Assign each cluster center to
be the mean of its cluster’s
data points.
Feature 1

50

25
06/10/2024

K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)

Feature 2
• Repeat until convergence:
closest = data.map(p =>

(closestPoint(p,centers),p))
Assign each cluster center to
be the mean of its cluster’s
data points.
Feature 1

51

K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)
Feature 2

• Repeat until convergence:


closest = data.map(p =>

(closestPoint(p,centers),p))
pointsGroup =
closest.groupByKey()

Feature 1

52

26
06/10/2024

K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)

Feature 2
• Repeat until convergence:
closest = data.map(p =>

(closestPoint(p,centers),p))
pointsGroup =
closest.groupByKey()
newCenters = pointsGroup.mapValues(
ps => average(ps))
Feature 1

53

K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)
Feature 2

• Repeat until convergence:


closest = data.map(p =>

(closestPoint(p,centers),p))
pointsGroup =
closest.groupByKey()
newCenters = pointsGroup.mapValues(
ps => average(ps))
Feature 1

54

27
06/10/2024

K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)

Feature 2
• Repeat until convergence:

closest = data.map(p =>

(closestPoint(p,centers),p))
pointsGroup =
closest.groupByKey()
newCenters = pointsGroup.mapValues(
ps => average(ps))
Feature 1

55

K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)
Feature 2

• Repeat until convergence:


while (dist(centers, newCenters) > ɛ)
closest = data.map(p =>
(closestPoint(p,centers),p))
pointsGroup =
closest.groupByKey()
newCenters =pointsGroup.mapValues(
ps => average(ps))
Feature 1

56

28
06/10/2024

K-MEANS ALGORITHM
• Initialize K cluster centers
centers = data.takeSample(
false, K, seed)

Feature 2
• Repeat until convergence:
while (dist(centers, newCenters) > ɛ)
closest = data.map(p =>
(closestPoint(p,centers),p))
pointsGroup =
closest.groupByKey()
newCenters =pointsGroup.mapValues(
ps => average(ps))
Feature 1

57

K-MEANS ALGORITHM
centers = data.takeSample(
false, K, seed)
while (d > ɛ)
{
closest = data.map(p =>
Feature 2

(closestPoint(p,centers),p))
pointsGroup =
closest.groupByKey()
newCenters =pointsGroup.mapValues(
ps => average(ps))
d = distance(centers, newCenters)

centers = newCenters.map(_)
}
Feature 1

58

29
06/10/2024

EASE OF USE
§ Interactive shell:
Useful for featurization, pre-processing data
§ Lines of code for K-Means
- Spark ~ 90 lines – (Part of hands-on tutorial !)
- Hadoop/Mahout ~ 4 files, > 300 lines

59

PERFORMANCE
K-Means Logistic Regression
274

300 Hadoop 250 H ad oo p

HadoopBinMem H ad oo pB inMem
184

250
Iteration time (s)

200
Iteration time (s)

Spark
197

Spark
200
157

150
116
143

111
121

150
106

100
80

76
87

100
62
61

50
33

50
15

0 0
25 50 100
25 50 100
Number of machines Number of machines
[Zaharia et. al, NSDI’12]

60

30
06/10/2024

CONCLUSION
§ Spark: Framework for cluster computing

§ Fast and easy machine learning programs

§ K means clustering using Spark

Examples and more: www.spark-project.org

61

31

You might also like