0% found this document useful (0 votes)
36 views50 pages

Spark MLIB

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views50 pages

Spark MLIB

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

1

Spark - Mlib
• MLlib is Spark’s machine learning (ML) library
• It is built on Apache Spark, a fast and general engine for large-
scale data processing
• Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
• Write applications quickly in Java, Scala, or Python.

2
Mlib Libraries
• It make practical machine learning scalable and easy.
• At a high level, it provides tools such as:
• ML Algorithms: common learning algorithms such as
classification, regression, clustering, and collaborative filtering
• Featurization: feature extraction, transformation, dimensionality
reduction, and selection
• Pipelines: tools for constructing, evaluating, and tuning ML
Pipelines
• Persistence: saving and load algorithms, models, and Pipelines
• Utilities: linear algebra, statistics, data handling, etc.

3
Spark Mlib
• Initially MLib was RDD based.
• As of Spark 2.0, the RDD- based APIs in the spark.mllib package have
entered maintenance mode.
• The primary Machine Learning API for Spark is now the DataFrame-based
API in the spark.ml package.
• DataFrames provide a more user-friendly API than RDDs. The many benefits
of DataFrames include Spark Datasources, SQL/DataFrame queries,
Tungsten and Catalyst optimizations, and uniform APIs across languages.
• The DataFrame-based API for MLlib provides a uniform API across ML
algorithms and across multiple languages.
• DataFrames facilitate practical ML Pipelines, particularly feature
transformations

4
Main Concept
• Dataframe
• Flexible datatype from Spark SQL allows parallization
• Transformer
• Algorithm that transform one dataframe to another
• Estimator
• Machine learning algorithm that fit on a dataframe, returning a model
• Parameter
• Uniform structures for transformer and estimator
• Pipeline
• Chain of transformer and estimator

5
Dataframe
• Similar to table in RDBS
• It can be created from files, RDD, and other data
sources
• It supports a different data types such as: text,
Images, and structured data
• Data can be accessed in columns

6
Transformers
• Algorithm that transform one dataframe to another by
appending one or more columns
• It has method name transform()
• Transformers
• Feature transformer
• Tokenization, normalization, hashing
• Learned models
• Results from estimator

7
Estimator
• It is a learning algorithm
• It gets fitted on training data
• Have fit() method which takes dataframe as input
• It returns a trained model which is transformer

8
Parameters
• There are two main ways by which we can pass
parameters
• We can set a parameter or we can pass ParamMap
to fit() or transform().
• Param is a named parameter.
• ParamMap is set of (parameter, value) pairs.

9
Pipeline
• A Pipeline is specified as a sequence of stages, and each stage is
either a Transformer or an Estimator

• These stages are run in order, and the input DataFrame is


transformed as it passes through each stage.

• For Transformer stages, the transform() method is called on the


DataFrame.

• For Estimator stages, the fit() method is called to produce a


Transformer (which becomes part of the PipelineModel, or fitted
Pipeline), and that Transformer’s transform() method is called on
the DataFrame .

10
Pipeline example
• It is simple text document workflow. It contains three stages.
• The first two (Tokenizer and HashingTF) are Transformers , and
the third (LogisticRegression) is an Estimator.
• Pipelines and PipelineModels help to ensure that training and test
data go through identical feature processing steps.

11
Pipeline
• A Pipeline is an Estimator.
• Thus, after a Pipeline’s fit() method runs, it produces a
PipelineModel, which is a Transformer.

• The PipelineModel has the same number of stages as the


original Pipeline, but all Estimators in the original Pipeline have
become Transformers.
• When the PipelineModel’s transform() method is called on a test
dataset, the data are passed through the fitted pipeline in order.
• Each stage’s transform() method updates the dataset and passes it
to the next stage.

12
Extracting, transforming and
selecting features
• Extraction: Extracting features from “raw” data

• Transformation: Scaling, converting, or modifying


features

• Selection: Selecting a subset from a larger set of features

13
MinMaxScaler
• MinMaxScaler transforms a dataset of Vector rows, rescaling each
feature to a specific range (often [0, 1]). It takes parameters:

• min: 0.0 by default. Lower bound after transformation, shared by all


features.

• max: 1.0 by default. Upper bound after transformation, shared by all


features.

14
Min-max Scaler
import org.apache.spark.ml.feature.MinMaxScaler
import org.apache.spark.ml.linalg.Vectors
Val dataFrame = spark.createDataFrame(Seq(
(0, Vectors.dense(1.0, 0.1, -1.0)),
(1, Vectors.dense(2.0, 1.1, 1.0)),
(2, Vectors.dense(3.0, 10.1, 3.0)) )).toDF("id", "features")

val scaler = new MinMaxScaler() .setInputCol("features")


.setOutputCol("scaledFeatures")
//Compute summary statistics and generate MinMaxScalerModel
val scalerModel = scaler.fit(dataFrame)
//rescale each feature to range [min, max].
val scaledData = scalerModel.transform(dataFrame)
println(s"Features scaled to range: [${scaler.getMin}, ${scaler.getMax}]")
scaledData.select("features", "scaledFeatures").show()

15
StandardScaler
• StandardScaler transforms a dataset of Vector rows, normalizing each feature
to have unit standard deviation and/or zero mean.

16
StandardScaler
import org.apache.spark.ml.feature.StandardScaler
val dataFrame =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

val scaler = new StandardScaler()


.setInputCol("features")
.setOutputCol("scaledFeatures")
.setWithStd(true)
.setWithMean(false)

// Compute summary statistics by fitting the StandardScaler.


val scalerModel = scaler.fit(dataFrame)
// Normalize each feature to have unit standard deviation.
val scaledData = scalerModel.transform(dataFrame)
scaledData.show()

17
VectorAssembler
• It is a transformer that combines a given list of columns into a
single vector column.
• It is useful for combining raw features and features generated by
different feature transformers into a single feature vector, in order
to train ML models

18
VectorAssembler
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame( Seq((0, 18, 1.0,
Vectors.dense(0.0, 10.0, 0.5), 1.0)) ).toDF("id", "hour", "mobile",
"userFeatures", "clicked")

val assembler = new VectorAssembler() .setInputCols(Array("hour",


"mobile", "userFeatures")) .setOutputCol("features")

val output = assembler.transform(dataset) println("Assembled


columns 'hour', 'mobile', 'userFeatures' to vector column 'features’”)
output.select("features", "clicked").show(false)

19
StringIndexer
• It encodes a string column of labels to a column of label indices.
StringIndexer can encode multiple columns. Suppose dataframe
with id and category is given with three labels a, b, c.
• Applying StringIndexer with category as the input column and
categoryIndex as the output column. “a” gets index 0 because it
is the most frequent, followed by “c” with index 1 and “b” with
index 2.

20
StringIndexer
import org.apache.spark.ml.feature.StringIndexer

val df = spark.createDataFrame(
Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")) )
.toDF("id", "category")

val indexer = new StringIndexer() .setInputCol("category")


.setOutputCol("categoryIndex")

val indexed = indexer.fit(df).transform(df)


indexed.show()

21
OneHotEncoder
• A one-hot encoder that maps a column of category indices
to a column of binary vectors, with at most a single one-
value per row that indicates the input category index.

22
PCA
• PCA is a statistical procedure that uses an orthogonal
transformation to convert a set of observations of possibly
correlated variables into a set of values of linearly uncorrelated
variables called principal components.
• A PCA class trains a model to project vectors to a low-dimensional
space using PCA.
• The given example shows how to project 5-dimensional feature
vectors into 3-dimensional principal components

23
PCA Example
import org.apache.spark.ml.feature.PCA
import org.apache.spark.ml.linalg.Vectors

val data = Array( Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))), Vectors.dense(2.0, 0.0, 3.0,
4.0, 5.0), Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0) )

Val df=spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")

val pca = new PCA()


.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(3) .fit(df)

val result = pca.transform(df).select("pcaFeatures")


result.show(false)

24
More Feature Transformer
• Tokenizer: Tokenization is the process of taking text (such as a
sentence) and breaking it into individual terms (usually words). A
simple Tokenizer class provides this functionality.

• StopWordsRemover: Stop words are words which should be


excluded from the input, typically because the words appear
frequently and don’t carry as much meaning

25
Binarizer
• Binarization is the process of thresholding numerical features to
binary (0/1) features.

• Binarizer takes the common parameters inputCol and outputCol,


as well as the threshold for binarization.

• Feature values greater than the threshold are binarized to 1.0;


values equal to or less than the threshold are binarized to 0.0. Both
Vector and Double types are supported for inputCol

26
Binarizer
import org.apache.spark.ml.feature.Binarizer
val data = Array((0, 0.1), (1, 0.8), (2, 0.2))
val dataFrame = spark.createDataFrame(data).toDF("id", "feature")
val binarizer: Binarizer = new Binarizer()
.setInputCol("feature")
.setOutputCol("binarized_feature") .setThreshold(0.5)
val binarizedDataFrame = binarizer.transform(dataFrame)
binarizedDataFrame.show()

27
Feature Selectors
• VectorSlicer
• VectorSlicer is a transformer that takes a feature vector and outputs
a new feature vector with a sub-array of the original features.
• It is useful for extracting features from a vector column.

28
VectorSlicer
import java.util.Arrays import
org.apache.spark.ml.attribute.{Attribute, AttributeGroup,
NumericAttribute}
import org.apache.spark.ml.feature.VectorSlicer
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.StructType

val data = Arrays.asList( Row(Vectors.sparse(3, Seq((0, -2.0),


(1, 2.3)))), Row(Vectors.dense(-2.0, 2.3, 0.0)) )

29
VectorSlicer
• val defaultAttr = NumericAttribute.defaultAttr
• val attrs = Array("f1", "f2", "f3").map(defaultAttr.withName)
• val attrGroup = new AttributeGroup("userFeatures",
attrs.asInstanceOf[Array[Attribute]])
• val dataset = spark.createDataFrame(data,
StructType(Array(attrGroup.toStructField())))
• val slicer = new
VectorSlicer().setInputCol("userFeatures").setOutputCol("features")
slicer.setIndices(Array(1)).setNames(Array("f3")) // or
slicer.setIndices(Array(1, 2)), or slicer.setNames(Array("f2", "f3")) val
output = slicer.transform(dataset) output.show(false)

30
VectorIndexer
• VectorIndexer helps index categorical features in datasets of
Vectors.
• It can both automatically decide which features are categorical and
convert original values to category indices

31
Basic Statistics
• Correlation: It calculates the pairwise correlations among many
series
import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row
val data = Seq( Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))),
Vectors.dense(4.0, 5.0, 0.0, 3.0),
Vectors.dense(6.0, 7.0, 0.0, 8.0), Vectors.sparse(4, Seq((0, 9.0), (3, 1.0))) )
val df = data.map(Tuple1.apply).toDF("features")
val Row(coeff1: Matrix) = Correlation.corr(df, "features").head println(s"Pearson
correlation matrix:\n $coeff1")

32
Summarizer
• It provides vector column summary statistics
for Dataframe through Summarizer.
• Available metrics are the column-wise max, min, mean, sum,
variance, std, and number of nonzeros, as well as the total count.

33
Summarizer
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.stat.Summarizer

val data = Seq( (Vectors.dense(2.0, 3.0, 5.0), 1.0), (Vectors.dense(4.0, 6.0, 7.0), 2.0) )
val df = data.toDF("features", "weight")

val (meanVal, varianceVal) = df.select(metrics("mean", "variance")


.summary($"features", $"weight").as("summary")) .select("summary.mean",
"summary.variance") .as[(Vector, Vector)].first()
println(s"with weight: mean = ${meanVal}, variance = ${varianceVal}")

val (meanVal2, varianceVal2) = df.select(mean($"features"), variance($"features"))


.as[(Vector, Vector)].first()

println(s"without weight: mean = ${meanVal2}, sum = ${varianceVal2}")

34
Multi-Layer Perceptron
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator //
val data = spark.read.format("libsvm")
.load("data/mllib/sample_multiclass_classification_data.txt")
val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
// specify layers for the neural network: // input layer of size 4 (features), two intermediate
of size 5 and 4 // and output of size 3 (classes)
val layers = Array[Int](4, 5, 4, 3) // create the trainer and set its parameters

val trainer = new MultilayerPerceptronClassifier() .setLayers(layers)


.setBlockSize(128) .setSeed(1234L) .setMaxIter(100)

35
Multi-Layer Perceptron
train the model val model = trainer.fit(train)
// compute accuracy on the test set
val result = model.transform(test)
val predictionAndLabels = result.select("prediction", "label")

val evaluator = new MulticlassClassificationEvaluator()


.setMetricName("accuracy")
println(s"Test set accuracy =
${evaluator.evaluate(predictionAndLabels)}")

36
Naïve Bayes
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") /
/ Split the data into training and test sets (30% held out for testing)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
val model = new NaiveBayes() .fit(trainingData)
val predictions = model.transform(testData)
predictions.show()
// Select (prediction, true label) and compute test error
val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label")
.setPredictionCol("prediction") .setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test set accuracy = $accuracy)
37
Linear Support Vector Machine
import org.apache.spark.ml.classification.LinearSVC
// Load training data val training =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

val lsvc = new LinearSVC()


.setMaxIter(10)
.setRegParam(0.1)
// Fit the model
val lsvcModel = lsvc.fit(training)

// Print the coefficients and intercept for linear svc


println(s"Coefficients: ${lsvcModel.coefficients} Intercept: ${lsvcModel.intercept}")

38
Decision Tree
import org.apache.spark.ml.Pipeline
Import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer,
VectorIndexer}

//Load the data stored in LIBSVM format as a DataFrame.


val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

//Index labels, adding metadata to the label column. // Fit on whole dataset to include all
labels in index.
val labelIndexer = new StringIndexer() .setInputCol("label")
.setOutputCol("indexedLabel") .fit(data)

39
Decision Tree
// Automatically identify categorical features, and index them.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)

// Split the data into training and test sets (30% held out for
testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7,
0.3))

40
Decision Tree
// Train a DecisionTree model.
val dt = new DecisionTreeClassifier() .setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")

//Convert indexed labels back to original labels.

val labelConverter = new IndexToString() .setInputCol("prediction")


.setOutputCol("predictedLabel") .setLabels(labelIndexer.labelsArray(0))

//Chain indexers and tree in a Pipeline.


val pipeline = new Pipeline() .setStages(Array(labelIndexer,
featureIndexer, dt, labelConverter)

41
/

Decision Tree
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
//Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("indexedLabel")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${(1.0 - accuracy)}")
val treeModel =
model.stages(2).asInstanceOf[DecisionTreeClassificationModel]
println(s"Learned classification tree model:\n
${treeModel.toDebugString}")
42
Random Forest
import org.apache.spark.ml.Pipeline
Import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer,
VectorIndexer}

//Load the data stored in LIBSVM format as a DataFrame.


val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

//Index labels, adding metadata to the label column. // Fit on whole dataset to include all
labels in index.
val labelIndexer = new StringIndexer() .setInputCol("label")
.setOutputCol("indexedLabel") .fit(data)

43
Random Forest
// Automatically identify categorical features, and index them.

val featureIndexer = new VectorIndexer()


.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4) // features with > 4 distinct values are treated as continuous.
.fit(data)

// Split the data into training and test sets (30% held out for
testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

44
Random Forest
// Train a DecisionTree model.
val rf = new RandomForestClassifier() .setLabelCol("indexedLabel").
setNumTress(10) .

//Convert indexed labels back to original labels.

val labelConverter = new IndexToString() .setInputCol("prediction")


.setOutputCol("predictedLabel") .setLabels(labelIndexer.labelsArray(0))

//Chain indexers and tree in a Pipeline.


val pipeline = new Pipeline() .setStages(Array(labelIndexer,
featureIndexer, dt, labelConverter)

45
K-means Clustering
• // Load and parse the data.
• val data = sc.textFile("kmeans_data.txt")
• val parsedData = data.map(_.split(‘ ').map(_.toDouble)).cache()
• // Cluster the data into two classes using KMeans.
• val clusters = KMeans.train(parsedData, 2, numIterations = 20)

• // Compute the sum of squared errors.


• val cost = clusters.computeCost(parsedData) println("Sum of
squared errors = " + cost)

46
K-means clustering
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation.ClusteringEvaluator
// Loads data.
val dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

//Trains a k-means model.


val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(dataset)

47
Logistic Regression
Import org.apache.spark.ml.classification.LogisticRegression
// Load training data
Val labeledDF=spark.read.format("libsvm")
.load("data/mllib/sample_libsvm_data.txt")
Val seed = 5043
val Array(trainingData, testData) = labelDf.randomSplit(Array(0.7, 0.3), seed)
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
// Fit the model
val lrModel = lr.fit(training)
val prediction = lrModel.transform(testData)

48
Model Evaluation
// Extract the summary from the returned LogisticRegressionModel
val trainingSummary = lrModel.binarySummary
// Obtain the objective per iteration.
val objectiveHistory = trainingSummary.objectiveHistory println("objectiveHistory:")
objectiveHistory.foreach(loss => println(loss))

// Obtain the receiver-operating characteristic as a dataframe and


areaUnderROC.
val roc = trainingSummary.roc roc.show()
println(s"areaUnderROC: ${trainingSummary.areaUnderROC}")

// Set the model threshold to maximize F-Measure


val fMeasure = trainingSummary.fMeasureByThreshold
val maxFMeasure = fMeasure.select(max("F-Measure")).head().getDouble(0)
val bestThreshold = fMeasure.where($"F-Measure" === maxFMeasure)
.select("threshold").head().getDouble(0) lrModel.setThreshold(bestThreshold)
49
• https://sparkbyexamples.com/

50

You might also like