Spark MLIB
Spark MLIB
Spark - Mlib
• MLlib is Spark’s machine learning (ML) library
• It is built on Apache Spark, a fast and general engine for large-
scale data processing
• Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
• Write applications quickly in Java, Scala, or Python.
2
Mlib Libraries
• It make practical machine learning scalable and easy.
• At a high level, it provides tools such as:
• ML Algorithms: common learning algorithms such as
classification, regression, clustering, and collaborative filtering
• Featurization: feature extraction, transformation, dimensionality
reduction, and selection
• Pipelines: tools for constructing, evaluating, and tuning ML
Pipelines
• Persistence: saving and load algorithms, models, and Pipelines
• Utilities: linear algebra, statistics, data handling, etc.
3
Spark Mlib
• Initially MLib was RDD based.
• As of Spark 2.0, the RDD- based APIs in the spark.mllib package have
entered maintenance mode.
• The primary Machine Learning API for Spark is now the DataFrame-based
API in the spark.ml package.
• DataFrames provide a more user-friendly API than RDDs. The many benefits
of DataFrames include Spark Datasources, SQL/DataFrame queries,
Tungsten and Catalyst optimizations, and uniform APIs across languages.
• The DataFrame-based API for MLlib provides a uniform API across ML
algorithms and across multiple languages.
• DataFrames facilitate practical ML Pipelines, particularly feature
transformations
4
Main Concept
• Dataframe
• Flexible datatype from Spark SQL allows parallization
• Transformer
• Algorithm that transform one dataframe to another
• Estimator
• Machine learning algorithm that fit on a dataframe, returning a model
• Parameter
• Uniform structures for transformer and estimator
• Pipeline
• Chain of transformer and estimator
5
Dataframe
• Similar to table in RDBS
• It can be created from files, RDD, and other data
sources
• It supports a different data types such as: text,
Images, and structured data
• Data can be accessed in columns
6
Transformers
• Algorithm that transform one dataframe to another by
appending one or more columns
• It has method name transform()
• Transformers
• Feature transformer
• Tokenization, normalization, hashing
• Learned models
• Results from estimator
7
Estimator
• It is a learning algorithm
• It gets fitted on training data
• Have fit() method which takes dataframe as input
• It returns a trained model which is transformer
8
Parameters
• There are two main ways by which we can pass
parameters
• We can set a parameter or we can pass ParamMap
to fit() or transform().
• Param is a named parameter.
• ParamMap is set of (parameter, value) pairs.
9
Pipeline
• A Pipeline is specified as a sequence of stages, and each stage is
either a Transformer or an Estimator
10
Pipeline example
• It is simple text document workflow. It contains three stages.
• The first two (Tokenizer and HashingTF) are Transformers , and
the third (LogisticRegression) is an Estimator.
• Pipelines and PipelineModels help to ensure that training and test
data go through identical feature processing steps.
11
Pipeline
• A Pipeline is an Estimator.
• Thus, after a Pipeline’s fit() method runs, it produces a
PipelineModel, which is a Transformer.
12
Extracting, transforming and
selecting features
• Extraction: Extracting features from “raw” data
13
MinMaxScaler
• MinMaxScaler transforms a dataset of Vector rows, rescaling each
feature to a specific range (often [0, 1]). It takes parameters:
14
Min-max Scaler
import org.apache.spark.ml.feature.MinMaxScaler
import org.apache.spark.ml.linalg.Vectors
Val dataFrame = spark.createDataFrame(Seq(
(0, Vectors.dense(1.0, 0.1, -1.0)),
(1, Vectors.dense(2.0, 1.1, 1.0)),
(2, Vectors.dense(3.0, 10.1, 3.0)) )).toDF("id", "features")
15
StandardScaler
• StandardScaler transforms a dataset of Vector rows, normalizing each feature
to have unit standard deviation and/or zero mean.
16
StandardScaler
import org.apache.spark.ml.feature.StandardScaler
val dataFrame =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
17
VectorAssembler
• It is a transformer that combines a given list of columns into a
single vector column.
• It is useful for combining raw features and features generated by
different feature transformers into a single feature vector, in order
to train ML models
18
VectorAssembler
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame( Seq((0, 18, 1.0,
Vectors.dense(0.0, 10.0, 0.5), 1.0)) ).toDF("id", "hour", "mobile",
"userFeatures", "clicked")
19
StringIndexer
• It encodes a string column of labels to a column of label indices.
StringIndexer can encode multiple columns. Suppose dataframe
with id and category is given with three labels a, b, c.
• Applying StringIndexer with category as the input column and
categoryIndex as the output column. “a” gets index 0 because it
is the most frequent, followed by “c” with index 1 and “b” with
index 2.
20
StringIndexer
import org.apache.spark.ml.feature.StringIndexer
val df = spark.createDataFrame(
Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")) )
.toDF("id", "category")
21
OneHotEncoder
• A one-hot encoder that maps a column of category indices
to a column of binary vectors, with at most a single one-
value per row that indicates the input category index.
22
PCA
• PCA is a statistical procedure that uses an orthogonal
transformation to convert a set of observations of possibly
correlated variables into a set of values of linearly uncorrelated
variables called principal components.
• A PCA class trains a model to project vectors to a low-dimensional
space using PCA.
• The given example shows how to project 5-dimensional feature
vectors into 3-dimensional principal components
23
PCA Example
import org.apache.spark.ml.feature.PCA
import org.apache.spark.ml.linalg.Vectors
val data = Array( Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))), Vectors.dense(2.0, 0.0, 3.0,
4.0, 5.0), Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0) )
Val df=spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")
24
More Feature Transformer
• Tokenizer: Tokenization is the process of taking text (such as a
sentence) and breaking it into individual terms (usually words). A
simple Tokenizer class provides this functionality.
25
Binarizer
• Binarization is the process of thresholding numerical features to
binary (0/1) features.
26
Binarizer
import org.apache.spark.ml.feature.Binarizer
val data = Array((0, 0.1), (1, 0.8), (2, 0.2))
val dataFrame = spark.createDataFrame(data).toDF("id", "feature")
val binarizer: Binarizer = new Binarizer()
.setInputCol("feature")
.setOutputCol("binarized_feature") .setThreshold(0.5)
val binarizedDataFrame = binarizer.transform(dataFrame)
binarizedDataFrame.show()
27
Feature Selectors
• VectorSlicer
• VectorSlicer is a transformer that takes a feature vector and outputs
a new feature vector with a sub-array of the original features.
• It is useful for extracting features from a vector column.
28
VectorSlicer
import java.util.Arrays import
org.apache.spark.ml.attribute.{Attribute, AttributeGroup,
NumericAttribute}
import org.apache.spark.ml.feature.VectorSlicer
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.StructType
29
VectorSlicer
• val defaultAttr = NumericAttribute.defaultAttr
• val attrs = Array("f1", "f2", "f3").map(defaultAttr.withName)
• val attrGroup = new AttributeGroup("userFeatures",
attrs.asInstanceOf[Array[Attribute]])
• val dataset = spark.createDataFrame(data,
StructType(Array(attrGroup.toStructField())))
• val slicer = new
VectorSlicer().setInputCol("userFeatures").setOutputCol("features")
slicer.setIndices(Array(1)).setNames(Array("f3")) // or
slicer.setIndices(Array(1, 2)), or slicer.setNames(Array("f2", "f3")) val
output = slicer.transform(dataset) output.show(false)
30
VectorIndexer
• VectorIndexer helps index categorical features in datasets of
Vectors.
• It can both automatically decide which features are categorical and
convert original values to category indices
31
Basic Statistics
• Correlation: It calculates the pairwise correlations among many
series
import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row
val data = Seq( Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))),
Vectors.dense(4.0, 5.0, 0.0, 3.0),
Vectors.dense(6.0, 7.0, 0.0, 8.0), Vectors.sparse(4, Seq((0, 9.0), (3, 1.0))) )
val df = data.map(Tuple1.apply).toDF("features")
val Row(coeff1: Matrix) = Correlation.corr(df, "features").head println(s"Pearson
correlation matrix:\n $coeff1")
32
Summarizer
• It provides vector column summary statistics
for Dataframe through Summarizer.
• Available metrics are the column-wise max, min, mean, sum,
variance, std, and number of nonzeros, as well as the total count.
33
Summarizer
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.stat.Summarizer
val data = Seq( (Vectors.dense(2.0, 3.0, 5.0), 1.0), (Vectors.dense(4.0, 6.0, 7.0), 2.0) )
val df = data.toDF("features", "weight")
34
Multi-Layer Perceptron
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator //
val data = spark.read.format("libsvm")
.load("data/mllib/sample_multiclass_classification_data.txt")
val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
// specify layers for the neural network: // input layer of size 4 (features), two intermediate
of size 5 and 4 // and output of size 3 (classes)
val layers = Array[Int](4, 5, 4, 3) // create the trainer and set its parameters
35
Multi-Layer Perceptron
train the model val model = trainer.fit(train)
// compute accuracy on the test set
val result = model.transform(test)
val predictionAndLabels = result.select("prediction", "label")
36
Naïve Bayes
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") /
/ Split the data into training and test sets (30% held out for testing)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
val model = new NaiveBayes() .fit(trainingData)
val predictions = model.transform(testData)
predictions.show()
// Select (prediction, true label) and compute test error
val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label")
.setPredictionCol("prediction") .setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test set accuracy = $accuracy)
37
Linear Support Vector Machine
import org.apache.spark.ml.classification.LinearSVC
// Load training data val training =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
38
Decision Tree
import org.apache.spark.ml.Pipeline
Import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer,
VectorIndexer}
//Index labels, adding metadata to the label column. // Fit on whole dataset to include all
labels in index.
val labelIndexer = new StringIndexer() .setInputCol("label")
.setOutputCol("indexedLabel") .fit(data)
39
Decision Tree
// Automatically identify categorical features, and index them.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)
// Split the data into training and test sets (30% held out for
testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7,
0.3))
40
Decision Tree
// Train a DecisionTree model.
val dt = new DecisionTreeClassifier() .setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")
41
/
Decision Tree
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
//Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("indexedLabel")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${(1.0 - accuracy)}")
val treeModel =
model.stages(2).asInstanceOf[DecisionTreeClassificationModel]
println(s"Learned classification tree model:\n
${treeModel.toDebugString}")
42
Random Forest
import org.apache.spark.ml.Pipeline
Import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer,
VectorIndexer}
//Index labels, adding metadata to the label column. // Fit on whole dataset to include all
labels in index.
val labelIndexer = new StringIndexer() .setInputCol("label")
.setOutputCol("indexedLabel") .fit(data)
43
Random Forest
// Automatically identify categorical features, and index them.
// Split the data into training and test sets (30% held out for
testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))
44
Random Forest
// Train a DecisionTree model.
val rf = new RandomForestClassifier() .setLabelCol("indexedLabel").
setNumTress(10) .
45
K-means Clustering
• // Load and parse the data.
• val data = sc.textFile("kmeans_data.txt")
• val parsedData = data.map(_.split(‘ ').map(_.toDouble)).cache()
• // Cluster the data into two classes using KMeans.
• val clusters = KMeans.train(parsedData, 2, numIterations = 20)
46
K-means clustering
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation.ClusteringEvaluator
// Loads data.
val dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")
47
Logistic Regression
Import org.apache.spark.ml.classification.LogisticRegression
// Load training data
Val labeledDF=spark.read.format("libsvm")
.load("data/mllib/sample_libsvm_data.txt")
Val seed = 5043
val Array(trainingData, testData) = labelDf.randomSplit(Array(0.7, 0.3), seed)
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
// Fit the model
val lrModel = lr.fit(training)
val prediction = lrModel.transform(testData)
48
Model Evaluation
// Extract the summary from the returned LogisticRegressionModel
val trainingSummary = lrModel.binarySummary
// Obtain the objective per iteration.
val objectiveHistory = trainingSummary.objectiveHistory println("objectiveHistory:")
objectiveHistory.foreach(loss => println(loss))
50