0% found this document useful (0 votes)
64 views153 pages

Slide 11 Spark ML

Uploaded by

putinphuc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views153 pages

Slide 11 Spark ML

Uploaded by

putinphuc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 153

Big Data

Spark ML
Instructor: Trong-Hop Do
August 15th 2021

S3Lab
Smart Software System Laboratory

1
“Big data is at the foundation of all the
megatrends that are happening today, from
social to mobile to cloud to gaming.”
– Chris Lynch, Vertica Systems

Big Data 2
Spark Machine Learning

3
Spark Mllib Utilities
Linear Algebra
● API Guide

http://spark.apache.org/docs/3.0.1/api/python/pyspark.ml.html#module-pyspark.ml.linalg

● MLlib RDD-Base Guide

http://spark.apache.org/docs/latest/mllib-data-types.html

4
Spark Mllib Utilities
Linear Algebra

5
Spark Mllib Utilities
Linear Algebra

6
Spark Mllib Utilities
Linear Algebra

7
Spark Mllib Utilities
Linear Algebra

8
Spark Mllib Utilities
Linear Algebra

9
Spark Mllib Utilities
Linear Algebra

10
Spark Mllib Utilities
Data Sources

http://spark.apache.org/docs/3.0.1/api/python/pyspark.ml.html#module-pyspark.ml.image

https://spark.apache.org/docs/latest/ml-datasource#image-data-source

11
Spark Mllib Utilities
Data Sources

12
Spark Mllib Utilities
Data Sources

13
Spark Mllib Utilities
Data Sources

14
Spark Mllib Utilities
Data Sources

15
Spark Mllib Utilities
Sample Data

16
Spark Mllib Utilities
Sample Data
Github: https://github.com/apache/spark/tree/v3.0.1/data/mllib

17
Spark Mllib Utilities
Heppers
● http://spark.apache.org/docs/3.0.1/api/python/pyspark.ml.html#module-pyspark.ml.util

18
Spark Mllib
Learning Resources
● Main guide

○ http://spark.apache.org/docs/3.0.1/ml-guide.html

● API guide

○ http://spark.apache.org/docs/3.0.1/api/python/pyspark.ml.html

● Github

○ https://github.com/apache/spark/tree/v3.0.1/python/pyspark/ml
19
Spark MLlib Pipelines

● Pipelines API

○ Provide a Uniform Set of High-level APIs Built on Top of DataFrames


that Help Users Create and Tune Practical Machine Learning Pipelines

○ https://spark.apache.org/docs/latest/ml-pipeline.html

20
Spark MLlib Pipelines

● Algorithms Working Together: Assembled in a Pipeline

21
Spark MLlib Pipelines
5 Key Concepts

● DataFrame
● Estimator
Predictor
● Transformer
● Parameters
● Pipeline

22
Spark MLlib Pipelines
Transfomer

23
Spark MLlib Pipelines
Estimator

24
Spark MLlib Pipelines
Estimator vs Transformer

25
Spark MLlib Pipelines
Pipeline

26
Spark MLlib Pipelines
Example

27
Spark MLlib Pipelines

28
Spark MLlib Pipelines

29
Spark MLlib Pipelines
Classification and regression
● https://spark.apache.org/docs/latest/ml-classification-regression.html

30
Spark MLlib Pipelines
Classification and regression

31
Spark MLlib Pipelines
DecisionTreeClassifier

32
Spark MLlib Pipelines
DecisionTreeClassifier – Feature Transformers

StringIndexer encodes a string column of labels to a column of label indices

33
Spark MLlib Pipelines
DecisionTreeClassifier – Feature Transformers

VectorIndexer helps index categorical features in datasets of Vectors. It can both automatically
decide which features are categorical and convert original values to category indices.

34
Spark MLlib Pipelines
DecisionTreeClassifier

35
Spark MLlib Pipelines
DecisionTreeClassifier

36
Spark MLlib Pipelines
DecisionTreeClassifier

37
Spark MLlib Pipelines
DecisionTreeClassifier

38
Spark MLlib Pipelines
DecisionTreeClassifier

39
Spark MLlib Pipelines
Random Forest Regression

40
Spark MLlib Pipelines
Random Forest Regression

41
Spark MLlib Pipelines
Random Forest Regression

42
Spark MLlib Pipelines
Random Forest Regression

43
Spark MLlib Pipelines
Random Forest Regression

44
Spark MLlib Pipelines
Classification and regression
● https://spark.apache.org/docs/latest/ml-classification-regression.html

45
Spark MLlib Persistence

46
Spark MLlib Persistence

47
Spark MLlib Parameters

48
Spark MLlib Parameters

• MLlib Estimators and Transformers use a uniform API for specifying parameters.
• A Param is a named parameter with self-contained documentation. A ParamMap is a set of (parameter, value) pairs.
• There are two main ways to pass parameters to an algorithm:

• Set parameters for an instance. E.g., if lr is an instance of LogisticRegression, one could call lr.setMaxIter(10) to
make lr.fit() use at most 10 iterations. This API resembles the API used in spark.mllib package.

• Pass a ParamMap to .fit() or .transform(). Any parameters in the ParamMap will override parameters previously
specified via setter methods.

• Parameters belong to specific instances of Estimators and Transformers. E.g. if we have two instances lr1 and lr2,
then we can build a ParamMap with both maxIter parameters: ParamMap({lr1.maxIter: 10, lr2.maxIter: 20}). This is
useful if there are two algorithms with the maxIter parameter in a Pipeline. 49
Spark MLlib Parameters

50
Spark MLlib Parameters

51
Spark MLlib Parameters

52
Spark MLlib Parameters

53
Spark MLlib Parameters

54
Spark MLlib Parameters

55
Spark MLlib Parameters

56
Spark MLlib Featurization

57
Spark MLlib Featurization

58
Spark MLlib Featurization

59
Spark MLlib Featurization

60
Spark MLlib Featurization

61
Spark MLlib Featurization

62
Spark MLlib Featurization

63
Spark MLlib Featurization

64
Spark MLlib Featurization

65
Spark MLlib Statistic

66
Spark MLlib Statistic

67
Spark MLlib Statistic

68
Spark MLlib Recommendation System

69
Spark MLlib Recommendation System

70
Spark MLlib Recommendation System
Content-based recommendation system

71
Spark MLlib Recommendation System
Content-based recommendation system

● Disadvantage

○ Finding appropriate features is hard

○ Recommendations for new users (how to know their profile)

○ Low quality recommendations might occurs as it doesn’t consider what the neighbors
think

○ Unless the user indicates their preference for a new category (e.g. genre of movie), no
new categories will ever get recommended to the user
72
Spark MLlib Recommendation System
Collaborative filtering

● Make filtered predictions around the interest of users by


collecting preferences or information from many users
● Working under the assumption that user mikght like
items popular among their neighbors

73
Spark MLlib Recommendation System
Collaborative filtering

74
Spark MLlib Recommendation System
Collaborative filtering

75
Spark MLlib Recommendation System
Collaborative filtering – memory based

1.0 4.0

76
Spark MLlib Recommendation System
Collaborative filtering – model based

77
Spark MLlib Recommendation System
Collaborative filtering – model based

78
Spark MLlib Recommendation System
Collaborative filtering – model based

79
Spark MLlib Recommendation System
Collaborative filtering – model based

80
Spark MLlib Recommendation System
Collaborative filtering – model based

81
Spark MLlib Recommendation System
Collaborative filtering – model based

82
Spark MLlib Recommendation System
Collaborative filtering – model based

83
Spark MLlib Recommendation System
Collaborative filtering – Altenating Least Squares

84
Spark MLlib Recommendation System
Collaborative filtering – pyspark.ml

85
Spark MLlib Recommendation System
Collaborative filtering – pyspark.ml

86
Spark MLlib Recommendation System
Collaborative filtering – pyspark.ml

87
Spark MLlib Recommendation System
Collaborative filtering – pyspark.ml

88
Spark MLlib Recommendation System
Collaborative filtering – pyspark.ml

89
Spark MLlib Recommendation System
Collaborative filtering – pyspark.ml

90
Spark MLlib Recommendation System
Collaborative filtering – pyspark.ml

91
Spark MLlib Recommendation System
Collaborative filtering – pyspark.ml

92
Spark MLlib Recommendation System
Collaborative filtering – pyspark.ml

93
Spark MLlib Recommendation System
Collaborative filtering – pyspark.ml

94
Spark MLlib NLP

95
Spark MLlib NLP

96
Spark MLlib NLP

(2019)

97
Spark MLlib NLP

98
Spark MLlib NLP
Data exploration

99
Spark MLlib NLP
Data exploration: test data

100
Spark MLlib NLP
Data exploration: test data

101
Spark MLlib NLP
Data exploration: test data

102
Spark MLlib NLP
Data exploration: train data

103
Spark MLlib NLP
Data exploration: train data

104
Spark MLlib NLP
Data exploration: train data

105
Spark MLlib NLP
Data exploration: distribution of polarity

106
Spark MLlib NLP
Data exploration: distribution of polarity

107
Spark MLlib NLP
Data exploration

108
Spark MLlib NLP
Data cleaning - Converting Date column

109
Spark MLlib NLP
Data cleaning - Converting Date column

110
Spark MLlib NLP
Data cleaning - Cleaning the tweet text

● Remove email-addresses and URLs


● Extract and then remove username (@mentions)
● Extract and then remove hastag (#hash-tag)

111
Spark MLlib NLP
Data cleaning - Cleaning the tweet text

112
Spark MLlib NLP
Data cleaning - Cleaning the tweet text

113
Spark MLlib NLP
Data cleaning - Cleaning the tweet text

114
Spark MLlib NLP
Data cleaning - Cleaning the tweet text

115
Spark MLlib NLP
Data cleaning - Cleaning the tweet text

116
Spark MLlib NLP
Data cleaning - Cleaning the tweet text

117
Spark MLlib NLP
Data cleaning - Cleaning the tweet text

118
Spark MLlib NLP
Data exploration

119
Spark MLlib NLP
Data cleaning - Cleaning the tweet text

120
Spark MLlib NLP
Data cleaning - Cleaning the tweet text

121
Spark MLlib NLP
Data cleaning - Cleaning the tweet text

122
Spark MLlib NLP
Building model

123
124
Spark MLlib NLP
Building model

125
Spark MLlib NLP
Building model

127
Spark MLlib NLP
Building model

131
Spark MLlib NLP
Building model

132
Spark MLlib NLP
Building model

133
Spark MLlib NLP
Building model

134
Spark MLlib NLP
Building model

135
Spark MLlib NLP
Building model

136
Spark MLlib NLP
Building model

137
Run PySpark on Kaggle Notebook

● Click Code -> New Notebook

138
Run PySpark on Kaggle Notebook

● Make sure Internet is turned On on Kaggle Notebook


● Run !pip install pyspark

139
Run PySpark on Kaggle Notebook

● After finishing -> click Save Version -> click Save & Run All (Commit)

140
Run PySpark on Kaggle Notebook

● Scroll down your notebook or click on Output to see your output file -> click Submit

141
Run PySpark on Kaggle Notebook

● Check if your submission is Successful and check your Score

142
Run PySpark on Kaggle Notebook

● After the competition finished, click Share -> Public -> Save

143
Run PySpark on Kaggle Notebook

● Pin default version

144
Projects

145
Projects

146
Projects
Projects
Projects
Projects
Projects
Projects
Q&A

Cảm ơn đã theo dõi


Chúng tôi hy vọng cùng nhau đi đến thành công.

153
Big Data

You might also like