Slide 11 Spark ML
Slide 11 Spark ML
Spark ML
Instructor: Trong-Hop Do
August 15th 2021
S3Lab
Smart Software System Laboratory
1
“Big data is at the foundation of all the
megatrends that are happening today, from
social to mobile to cloud to gaming.”
– Chris Lynch, Vertica Systems
Big Data 2
Spark Machine Learning
3
Spark Mllib Utilities
Linear Algebra
● API Guide
http://spark.apache.org/docs/3.0.1/api/python/pyspark.ml.html#module-pyspark.ml.linalg
http://spark.apache.org/docs/latest/mllib-data-types.html
4
Spark Mllib Utilities
Linear Algebra
5
Spark Mllib Utilities
Linear Algebra
6
Spark Mllib Utilities
Linear Algebra
7
Spark Mllib Utilities
Linear Algebra
8
Spark Mllib Utilities
Linear Algebra
9
Spark Mllib Utilities
Linear Algebra
10
Spark Mllib Utilities
Data Sources
http://spark.apache.org/docs/3.0.1/api/python/pyspark.ml.html#module-pyspark.ml.image
https://spark.apache.org/docs/latest/ml-datasource#image-data-source
11
Spark Mllib Utilities
Data Sources
12
Spark Mllib Utilities
Data Sources
13
Spark Mllib Utilities
Data Sources
14
Spark Mllib Utilities
Data Sources
15
Spark Mllib Utilities
Sample Data
16
Spark Mllib Utilities
Sample Data
Github: https://github.com/apache/spark/tree/v3.0.1/data/mllib
17
Spark Mllib Utilities
Heppers
● http://spark.apache.org/docs/3.0.1/api/python/pyspark.ml.html#module-pyspark.ml.util
18
Spark Mllib
Learning Resources
● Main guide
○ http://spark.apache.org/docs/3.0.1/ml-guide.html
● API guide
○ http://spark.apache.org/docs/3.0.1/api/python/pyspark.ml.html
● Github
○ https://github.com/apache/spark/tree/v3.0.1/python/pyspark/ml
19
Spark MLlib Pipelines
● Pipelines API
○ https://spark.apache.org/docs/latest/ml-pipeline.html
20
Spark MLlib Pipelines
21
Spark MLlib Pipelines
5 Key Concepts
● DataFrame
● Estimator
Predictor
● Transformer
● Parameters
● Pipeline
22
Spark MLlib Pipelines
Transfomer
23
Spark MLlib Pipelines
Estimator
24
Spark MLlib Pipelines
Estimator vs Transformer
25
Spark MLlib Pipelines
Pipeline
26
Spark MLlib Pipelines
Example
27
Spark MLlib Pipelines
28
Spark MLlib Pipelines
29
Spark MLlib Pipelines
Classification and regression
● https://spark.apache.org/docs/latest/ml-classification-regression.html
30
Spark MLlib Pipelines
Classification and regression
31
Spark MLlib Pipelines
DecisionTreeClassifier
32
Spark MLlib Pipelines
DecisionTreeClassifier – Feature Transformers
33
Spark MLlib Pipelines
DecisionTreeClassifier – Feature Transformers
VectorIndexer helps index categorical features in datasets of Vectors. It can both automatically
decide which features are categorical and convert original values to category indices.
34
Spark MLlib Pipelines
DecisionTreeClassifier
35
Spark MLlib Pipelines
DecisionTreeClassifier
36
Spark MLlib Pipelines
DecisionTreeClassifier
37
Spark MLlib Pipelines
DecisionTreeClassifier
38
Spark MLlib Pipelines
DecisionTreeClassifier
39
Spark MLlib Pipelines
Random Forest Regression
40
Spark MLlib Pipelines
Random Forest Regression
41
Spark MLlib Pipelines
Random Forest Regression
42
Spark MLlib Pipelines
Random Forest Regression
43
Spark MLlib Pipelines
Random Forest Regression
44
Spark MLlib Pipelines
Classification and regression
● https://spark.apache.org/docs/latest/ml-classification-regression.html
45
Spark MLlib Persistence
46
Spark MLlib Persistence
47
Spark MLlib Parameters
48
Spark MLlib Parameters
• MLlib Estimators and Transformers use a uniform API for specifying parameters.
• A Param is a named parameter with self-contained documentation. A ParamMap is a set of (parameter, value) pairs.
• There are two main ways to pass parameters to an algorithm:
• Set parameters for an instance. E.g., if lr is an instance of LogisticRegression, one could call lr.setMaxIter(10) to
make lr.fit() use at most 10 iterations. This API resembles the API used in spark.mllib package.
• Pass a ParamMap to .fit() or .transform(). Any parameters in the ParamMap will override parameters previously
specified via setter methods.
• Parameters belong to specific instances of Estimators and Transformers. E.g. if we have two instances lr1 and lr2,
then we can build a ParamMap with both maxIter parameters: ParamMap({lr1.maxIter: 10, lr2.maxIter: 20}). This is
useful if there are two algorithms with the maxIter parameter in a Pipeline. 49
Spark MLlib Parameters
50
Spark MLlib Parameters
51
Spark MLlib Parameters
52
Spark MLlib Parameters
53
Spark MLlib Parameters
54
Spark MLlib Parameters
55
Spark MLlib Parameters
56
Spark MLlib Featurization
57
Spark MLlib Featurization
58
Spark MLlib Featurization
59
Spark MLlib Featurization
60
Spark MLlib Featurization
61
Spark MLlib Featurization
62
Spark MLlib Featurization
63
Spark MLlib Featurization
64
Spark MLlib Featurization
65
Spark MLlib Statistic
66
Spark MLlib Statistic
67
Spark MLlib Statistic
68
Spark MLlib Recommendation System
69
Spark MLlib Recommendation System
70
Spark MLlib Recommendation System
Content-based recommendation system
71
Spark MLlib Recommendation System
Content-based recommendation system
● Disadvantage
○ Low quality recommendations might occurs as it doesn’t consider what the neighbors
think
○ Unless the user indicates their preference for a new category (e.g. genre of movie), no
new categories will ever get recommended to the user
72
Spark MLlib Recommendation System
Collaborative filtering
73
Spark MLlib Recommendation System
Collaborative filtering
74
Spark MLlib Recommendation System
Collaborative filtering
75
Spark MLlib Recommendation System
Collaborative filtering – memory based
1.0 4.0
76
Spark MLlib Recommendation System
Collaborative filtering – model based
77
Spark MLlib Recommendation System
Collaborative filtering – model based
78
Spark MLlib Recommendation System
Collaborative filtering – model based
79
Spark MLlib Recommendation System
Collaborative filtering – model based
80
Spark MLlib Recommendation System
Collaborative filtering – model based
81
Spark MLlib Recommendation System
Collaborative filtering – model based
82
Spark MLlib Recommendation System
Collaborative filtering – model based
83
Spark MLlib Recommendation System
Collaborative filtering – Altenating Least Squares
84
Spark MLlib Recommendation System
Collaborative filtering – pyspark.ml
85
Spark MLlib Recommendation System
Collaborative filtering – pyspark.ml
86
Spark MLlib Recommendation System
Collaborative filtering – pyspark.ml
87
Spark MLlib Recommendation System
Collaborative filtering – pyspark.ml
88
Spark MLlib Recommendation System
Collaborative filtering – pyspark.ml
89
Spark MLlib Recommendation System
Collaborative filtering – pyspark.ml
90
Spark MLlib Recommendation System
Collaborative filtering – pyspark.ml
91
Spark MLlib Recommendation System
Collaborative filtering – pyspark.ml
92
Spark MLlib Recommendation System
Collaborative filtering – pyspark.ml
93
Spark MLlib Recommendation System
Collaborative filtering – pyspark.ml
94
Spark MLlib NLP
95
Spark MLlib NLP
96
Spark MLlib NLP
(2019)
97
Spark MLlib NLP
98
Spark MLlib NLP
Data exploration
99
Spark MLlib NLP
Data exploration: test data
100
Spark MLlib NLP
Data exploration: test data
101
Spark MLlib NLP
Data exploration: test data
102
Spark MLlib NLP
Data exploration: train data
103
Spark MLlib NLP
Data exploration: train data
104
Spark MLlib NLP
Data exploration: train data
105
Spark MLlib NLP
Data exploration: distribution of polarity
106
Spark MLlib NLP
Data exploration: distribution of polarity
107
Spark MLlib NLP
Data exploration
108
Spark MLlib NLP
Data cleaning - Converting Date column
109
Spark MLlib NLP
Data cleaning - Converting Date column
110
Spark MLlib NLP
Data cleaning - Cleaning the tweet text
111
Spark MLlib NLP
Data cleaning - Cleaning the tweet text
112
Spark MLlib NLP
Data cleaning - Cleaning the tweet text
113
Spark MLlib NLP
Data cleaning - Cleaning the tweet text
114
Spark MLlib NLP
Data cleaning - Cleaning the tweet text
115
Spark MLlib NLP
Data cleaning - Cleaning the tweet text
116
Spark MLlib NLP
Data cleaning - Cleaning the tweet text
117
Spark MLlib NLP
Data cleaning - Cleaning the tweet text
118
Spark MLlib NLP
Data exploration
119
Spark MLlib NLP
Data cleaning - Cleaning the tweet text
120
Spark MLlib NLP
Data cleaning - Cleaning the tweet text
121
Spark MLlib NLP
Data cleaning - Cleaning the tweet text
122
Spark MLlib NLP
Building model
123
124
Spark MLlib NLP
Building model
125
Spark MLlib NLP
Building model
127
Spark MLlib NLP
Building model
131
Spark MLlib NLP
Building model
132
Spark MLlib NLP
Building model
133
Spark MLlib NLP
Building model
134
Spark MLlib NLP
Building model
135
Spark MLlib NLP
Building model
136
Spark MLlib NLP
Building model
137
Run PySpark on Kaggle Notebook
138
Run PySpark on Kaggle Notebook
139
Run PySpark on Kaggle Notebook
● After finishing -> click Save Version -> click Save & Run All (Commit)
140
Run PySpark on Kaggle Notebook
● Scroll down your notebook or click on Output to see your output file -> click Submit
141
Run PySpark on Kaggle Notebook
142
Run PySpark on Kaggle Notebook
● After the competition finished, click Share -> Public -> Save
143
Run PySpark on Kaggle Notebook
144
Projects
145
Projects
146
Projects
Projects
Projects
Projects
Projects
Projects
Q&A
153
Big Data