0% found this document useful (0 votes)

36 views50 pages

Spark MLIB

Uploaded by

vaishnavireddy1809vs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views50 pages

Spark MLIB

Uploaded by

vaishnavireddy1809vs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

1

Spark - Mlib
• MLlib is Spark’s machine learning (ML) library
• It is built on Apache Spark, a fast and general engine for large-
scale data processing
• Run programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
• Write applications quickly in Java, Scala, or Python.

2
Mlib Libraries
• It make practical machine learning scalable and easy.
• At a high level, it provides tools such as:
• ML Algorithms: common learning algorithms such as
classification, regression, clustering, and collaborative filtering
• Featurization: feature extraction, transformation, dimensionality
reduction, and selection
• Pipelines: tools for constructing, evaluating, and tuning ML
Pipelines
• Persistence: saving and load algorithms, models, and Pipelines
• Utilities: linear algebra, statistics, data handling, etc.

3
Spark Mlib
• Initially MLib was RDD based.
• As of Spark 2.0, the RDD- based APIs in the spark.mllib package have
entered maintenance mode.
• The primary Machine Learning API for Spark is now the DataFrame-based
API in the spark.ml package.
• DataFrames provide a more user-friendly API than RDDs. The many benefits
of DataFrames include Spark Datasources, SQL/DataFrame queries,
Tungsten and Catalyst optimizations, and uniform APIs across languages.
• The DataFrame-based API for MLlib provides a uniform API across ML
algorithms and across multiple languages.
• DataFrames facilitate practical ML Pipelines, particularly feature
transformations

4
Main Concept
• Dataframe
• Flexible datatype from Spark SQL allows parallization
• Transformer
• Algorithm that transform one dataframe to another
• Estimator
• Machine learning algorithm that fit on a dataframe, returning a model
• Parameter
• Uniform structures for transformer and estimator
• Pipeline
• Chain of transformer and estimator

5
Dataframe
• Similar to table in RDBS
• It can be created from files, RDD, and other data
sources
• It supports a different data types such as: text,
Images, and structured data
• Data can be accessed in columns

6
Transformers
• Algorithm that transform one dataframe to another by
appending one or more columns
• It has method name transform()
• Transformers
• Feature transformer
• Tokenization, normalization, hashing
• Learned models
• Results from estimator

7
Estimator
• It is a learning algorithm
• It gets fitted on training data
• Have fit() method which takes dataframe as input
• It returns a trained model which is transformer

8
Parameters
• There are two main ways by which we can pass
parameters
• We can set a parameter or we can pass ParamMap
to fit() or transform().
• Param is a named parameter.
• ParamMap is set of (parameter, value) pairs.

9
Pipeline
• A Pipeline is specified as a sequence of stages, and each stage is
either a Transformer or an Estimator

• These stages are run in order, and the input DataFrame is

transformed as it passes through each stage.

• For Transformer stages, the transform() method is called on the

DataFrame.

• For Estimator stages, the fit() method is called to produce a

Transformer (which becomes part of the PipelineModel, or fitted
Pipeline), and that Transformer’s transform() method is called on
the DataFrame .

10
Pipeline example
• It is simple text document workflow. It contains three stages.
• The first two (Tokenizer and HashingTF) are Transformers , and
the third (LogisticRegression) is an Estimator.
• Pipelines and PipelineModels help to ensure that training and test
data go through identical feature processing steps.

11
Pipeline
• A Pipeline is an Estimator.
• Thus, after a Pipeline’s fit() method runs, it produces a
PipelineModel, which is a Transformer.

• The PipelineModel has the same number of stages as the

original Pipeline, but all Estimators in the original Pipeline have
become Transformers.
• When the PipelineModel’s transform() method is called on a test
dataset, the data are passed through the fitted pipeline in order.
• Each stage’s transform() method updates the dataset and passes it
to the next stage.

12
Extracting, transforming and
selecting features
• Extraction: Extracting features from “raw” data

• Transformation: Scaling, converting, or modifying

features

• Selection: Selecting a subset from a larger set of features

13
MinMaxScaler
• MinMaxScaler transforms a dataset of Vector rows, rescaling each
feature to a specific range (often [0, 1]). It takes parameters:

• min: 0.0 by default. Lower bound after transformation, shared by all

features.

• max: 1.0 by default. Upper bound after transformation, shared by all

features.

14
Min-max Scaler
import org.apache.spark.ml.feature.MinMaxScaler
import org.apache.spark.ml.linalg.Vectors
Val dataFrame = spark.createDataFrame(Seq(
(0, Vectors.dense(1.0, 0.1, -1.0)),
(1, Vectors.dense(2.0, 1.1, 1.0)),
(2, Vectors.dense(3.0, 10.1, 3.0)) )).toDF("id", "features")

val scaler = new MinMaxScaler() .setInputCol("features")

.setOutputCol("scaledFeatures")
//Compute summary statistics and generate MinMaxScalerModel
val scalerModel = scaler.fit(dataFrame)
//rescale each feature to range [min, max].
val scaledData = scalerModel.transform(dataFrame)
println(s"Features scaled to range: [${scaler.getMin}, ${scaler.getMax}]")
scaledData.select("features", "scaledFeatures").show()

15
StandardScaler
• StandardScaler transforms a dataset of Vector rows, normalizing each feature
to have unit standard deviation and/or zero mean.

16
StandardScaler
import org.apache.spark.ml.feature.StandardScaler
val dataFrame =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

val scaler = new StandardScaler()

.setInputCol("features")
.setOutputCol("scaledFeatures")
.setWithStd(true)
.setWithMean(false)

// Compute summary statistics by fitting the StandardScaler.

val scalerModel = scaler.fit(dataFrame)
// Normalize each feature to have unit standard deviation.
val scaledData = scalerModel.transform(dataFrame)
scaledData.show()

17
VectorAssembler
• It is a transformer that combines a given list of columns into a
single vector column.
• It is useful for combining raw features and features generated by
different feature transformers into a single feature vector, in order
to train ML models

18
VectorAssembler
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
val dataset = spark.createDataFrame( Seq((0, 18, 1.0,
Vectors.dense(0.0, 10.0, 0.5), 1.0)) ).toDF("id", "hour", "mobile",
"userFeatures", "clicked")

val assembler = new VectorAssembler() .setInputCols(Array("hour",

"mobile", "userFeatures")) .setOutputCol("features")

val output = assembler.transform(dataset) println("Assembled

columns 'hour', 'mobile', 'userFeatures' to vector column 'features’”)
output.select("features", "clicked").show(false)

19
StringIndexer
• It encodes a string column of labels to a column of label indices.
StringIndexer can encode multiple columns. Suppose dataframe
with id and category is given with three labels a, b, c.
• Applying StringIndexer with category as the input column and
categoryIndex as the output column. “a” gets index 0 because it
is the most frequent, followed by “c” with index 1 and “b” with
index 2.

20
StringIndexer
import org.apache.spark.ml.feature.StringIndexer

val df = spark.createDataFrame(
Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")) )
.toDF("id", "category")

val indexer = new StringIndexer() .setInputCol("category")

.setOutputCol("categoryIndex")

val indexed = indexer.fit(df).transform(df)

indexed.show()

21
OneHotEncoder
• A one-hot encoder that maps a column of category indices
to a column of binary vectors, with at most a single one-
value per row that indicates the input category index.

22
PCA
• PCA is a statistical procedure that uses an orthogonal
transformation to convert a set of observations of possibly
correlated variables into a set of values of linearly uncorrelated
variables called principal components.
• A PCA class trains a model to project vectors to a low-dimensional
space using PCA.
• The given example shows how to project 5-dimensional feature
vectors into 3-dimensional principal components

23
PCA Example
import org.apache.spark.ml.feature.PCA
import org.apache.spark.ml.linalg.Vectors

val data = Array( Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))), Vectors.dense(2.0, 0.0, 3.0,
4.0, 5.0), Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0) )

Val df=spark.createDataFrame(data.map(Tuple1.apply)).toDF("features")

val pca = new PCA()

.setInputCol("features")
.setOutputCol("pcaFeatures")
.setK(3) .fit(df)

val result = pca.transform(df).select("pcaFeatures")

result.show(false)

24
More Feature Transformer
• Tokenizer: Tokenization is the process of taking text (such as a
sentence) and breaking it into individual terms (usually words). A
simple Tokenizer class provides this functionality.

• StopWordsRemover: Stop words are words which should be

excluded from the input, typically because the words appear
frequently and don’t carry as much meaning

25
Binarizer
• Binarization is the process of thresholding numerical features to
binary (0/1) features.

• Binarizer takes the common parameters inputCol and outputCol,

as well as the threshold for binarization.

• Feature values greater than the threshold are binarized to 1.0;

values equal to or less than the threshold are binarized to 0.0. Both
Vector and Double types are supported for inputCol

26
Binarizer
import org.apache.spark.ml.feature.Binarizer
val data = Array((0, 0.1), (1, 0.8), (2, 0.2))
val dataFrame = spark.createDataFrame(data).toDF("id", "feature")
val binarizer: Binarizer = new Binarizer()
.setInputCol("feature")
.setOutputCol("binarized_feature") .setThreshold(0.5)
val binarizedDataFrame = binarizer.transform(dataFrame)
binarizedDataFrame.show()

27
Feature Selectors
• VectorSlicer
• VectorSlicer is a transformer that takes a feature vector and outputs
a new feature vector with a sub-array of the original features.
• It is useful for extracting features from a vector column.

28
VectorSlicer
import java.util.Arrays import
org.apache.spark.ml.attribute.{Attribute, AttributeGroup,
NumericAttribute}
import org.apache.spark.ml.feature.VectorSlicer
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.sql.{Row, SparkSession}
import org.apache.spark.sql.types.StructType

val data = Arrays.asList( Row(Vectors.sparse(3, Seq((0, -2.0),

(1, 2.3)))), Row(Vectors.dense(-2.0, 2.3, 0.0)) )

29
VectorSlicer
• val defaultAttr = NumericAttribute.defaultAttr
• val attrs = Array("f1", "f2", "f3").map(defaultAttr.withName)
• val attrGroup = new AttributeGroup("userFeatures",
attrs.asInstanceOf[Array[Attribute]])
• val dataset = spark.createDataFrame(data,
StructType(Array(attrGroup.toStructField())))
• val slicer = new
VectorSlicer().setInputCol("userFeatures").setOutputCol("features")
slicer.setIndices(Array(1)).setNames(Array("f3")) // or
slicer.setIndices(Array(1, 2)), or slicer.setNames(Array("f2", "f3")) val
output = slicer.transform(dataset) output.show(false)

30
VectorIndexer
• VectorIndexer helps index categorical features in datasets of
Vectors.
• It can both automatically decide which features are categorical and
convert original values to category indices

31
Basic Statistics
• Correlation: It calculates the pairwise correlations among many
series
import org.apache.spark.ml.linalg.{Matrix, Vectors}
import org.apache.spark.ml.stat.Correlation
import org.apache.spark.sql.Row
val data = Seq( Vectors.sparse(4, Seq((0, 1.0), (3, -2.0))),
Vectors.dense(4.0, 5.0, 0.0, 3.0),
Vectors.dense(6.0, 7.0, 0.0, 8.0), Vectors.sparse(4, Seq((0, 9.0), (3, 1.0))) )
val df = data.map(Tuple1.apply).toDF("features")
val Row(coeff1: Matrix) = Correlation.corr(df, "features").head println(s"Pearson
correlation matrix:\n $coeff1")

32
Summarizer
• It provides vector column summary statistics
for Dataframe through Summarizer.
• Available metrics are the column-wise max, min, mean, sum,
variance, std, and number of nonzeros, as well as the total count.

33
Summarizer
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.stat.Summarizer

val data = Seq( (Vectors.dense(2.0, 3.0, 5.0), 1.0), (Vectors.dense(4.0, 6.0, 7.0), 2.0) )
val df = data.toDF("features", "weight")

val (meanVal, varianceVal) = df.select(metrics("mean", "variance")

.summary($"features", $"weight").as("summary")) .select("summary.mean",
"summary.variance") .as[(Vector, Vector)].first()
println(s"with weight: mean = ${meanVal}, variance = ${varianceVal}")

val (meanVal2, varianceVal2) = df.select(mean($"features"), variance($"features"))

.as[(Vector, Vector)].first()

println(s"without weight: mean = ${meanVal2}, sum = ${varianceVal2}")

34
Multi-Layer Perceptron
import org.apache.spark.ml.classification.MultilayerPerceptronClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator //
val data = spark.read.format("libsvm")
.load("data/mllib/sample_multiclass_classification_data.txt")
val splits = data.randomSplit(Array(0.6, 0.4), seed = 1234L)
val train = splits(0)
val test = splits(1)
// specify layers for the neural network: // input layer of size 4 (features), two intermediate
of size 5 and 4 // and output of size 3 (classes)
val layers = Array[Int](4, 5, 4, 3) // create the trainer and set its parameters

val trainer = new MultilayerPerceptronClassifier() .setLayers(layers)

.setBlockSize(128) .setSeed(1234L) .setMaxIter(100)

35
Multi-Layer Perceptron
train the model val model = trainer.fit(train)
// compute accuracy on the test set
val result = model.transform(test)
val predictionAndLabels = result.select("prediction", "label")

val evaluator = new MulticlassClassificationEvaluator()

.setMetricName("accuracy")
println(s"Test set accuracy =
${evaluator.evaluate(predictionAndLabels)}")

36
Naïve Bayes
import org.apache.spark.ml.classification.NaiveBayes
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt") /
/ Split the data into training and test sets (30% held out for testing)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
val model = new NaiveBayes() .fit(trainingData)
val predictions = model.transform(testData)
predictions.show()
// Select (prediction, true label) and compute test error
val evaluator = new MulticlassClassificationEvaluator() .setLabelCol("label")
.setPredictionCol("prediction") .setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test set accuracy = $accuracy)
37
Linear Support Vector Machine
import org.apache.spark.ml.classification.LinearSVC
// Load training data val training =
spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

val lsvc = new LinearSVC()

.setMaxIter(10)
.setRegParam(0.1)
// Fit the model
val lsvcModel = lsvc.fit(training)

// Print the coefficients and intercept for linear svc

println(s"Coefficients: ${lsvcModel.coefficients} Intercept: ${lsvcModel.intercept}")

38
Decision Tree
import org.apache.spark.ml.Pipeline
Import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer,
VectorIndexer}

//Load the data stored in LIBSVM format as a DataFrame.

val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

//Index labels, adding metadata to the label column. // Fit on whole dataset to include all
labels in index.
val labelIndexer = new StringIndexer() .setInputCol("label")
.setOutputCol("indexedLabel") .fit(data)

39
Decision Tree
// Automatically identify categorical features, and index them.
val featureIndexer = new VectorIndexer()
.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4)
.fit(data)

// Split the data into training and test sets (30% held out for
testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7,
0.3))

40
Decision Tree
// Train a DecisionTree model.
val dt = new DecisionTreeClassifier() .setLabelCol("indexedLabel")
.setFeaturesCol("indexedFeatures")

//Convert indexed labels back to original labels.

val labelConverter = new IndexToString() .setInputCol("prediction")

.setOutputCol("predictedLabel") .setLabels(labelIndexer.labelsArray(0))

//Chain indexers and tree in a Pipeline.

val pipeline = new Pipeline() .setStages(Array(labelIndexer,
featureIndexer, dt, labelConverter)

41
/

Decision Tree
// Train model. This also runs the indexers.
val model = pipeline.fit(trainingData)
//Select (prediction, true label) and compute test error.
val evaluator = new MulticlassClassificationEvaluator()
.setLabelCol("indexedLabel")
.setPredictionCol("prediction")
.setMetricName("accuracy")
val accuracy = evaluator.evaluate(predictions)
println(s"Test Error = ${(1.0 - accuracy)}")
val treeModel =
model.stages(2).asInstanceOf[DecisionTreeClassificationModel]
println(s"Learned classification tree model:\n
${treeModel.toDebugString}")
42
Random Forest
import org.apache.spark.ml.Pipeline
Import org.apache.spark.ml.classification.DecisionTreeClassificationModel
import org.apache.spark.ml.classification.DecisionTreeClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer,
VectorIndexer}

//Load the data stored in LIBSVM format as a DataFrame.

val data = spark.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")

43
Random Forest
// Automatically identify categorical features, and index them.

val featureIndexer = new VectorIndexer()

.setInputCol("features")
.setOutputCol("indexedFeatures")
.setMaxCategories(4) // features with > 4 distinct values are treated as continuous.
.fit(data)

// Split the data into training and test sets (30% held out for
testing).
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

44
Random Forest
// Train a DecisionTree model.
val rf = new RandomForestClassifier() .setLabelCol("indexedLabel").
setNumTress(10) .

//Convert indexed labels back to original labels.

val labelConverter = new IndexToString() .setInputCol("prediction")

.setOutputCol("predictedLabel") .setLabels(labelIndexer.labelsArray(0))

//Chain indexers and tree in a Pipeline.

val pipeline = new Pipeline() .setStages(Array(labelIndexer,
featureIndexer, dt, labelConverter)

45
K-means Clustering
• // Load and parse the data.
• val data = sc.textFile("kmeans_data.txt")
• val parsedData = data.map(_.split(‘ ').map(_.toDouble)).cache()
• // Cluster the data into two classes using KMeans.
• val clusters = KMeans.train(parsedData, 2, numIterations = 20)

• // Compute the sum of squared errors.

• val cost = clusters.computeCost(parsedData) println("Sum of
squared errors = " + cost)

46
K-means clustering
import org.apache.spark.ml.clustering.KMeans
import org.apache.spark.ml.evaluation.ClusteringEvaluator
// Loads data.
val dataset = spark.read.format("libsvm").load("data/mllib/sample_kmeans_data.txt")

//Trains a k-means model.

val kmeans = new KMeans().setK(2).setSeed(1L)
val model = kmeans.fit(dataset)

47
Logistic Regression
Import org.apache.spark.ml.classification.LogisticRegression
// Load training data
Val labeledDF=spark.read.format("libsvm")
.load("data/mllib/sample_libsvm_data.txt")
Val seed = 5043
val Array(trainingData, testData) = labelDf.randomSplit(Array(0.7, 0.3), seed)
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
// Fit the model
val lrModel = lr.fit(training)
val prediction = lrModel.transform(testData)

48
Model Evaluation
// Extract the summary from the returned LogisticRegressionModel
val trainingSummary = lrModel.binarySummary
// Obtain the objective per iteration.
val objectiveHistory = trainingSummary.objectiveHistory println("objectiveHistory:")
objectiveHistory.foreach(loss => println(loss))

// Obtain the receiver-operating characteristic as a dataframe and

areaUnderROC.
val roc = trainingSummary.roc roc.show()
println(s"areaUnderROC: ${trainingSummary.areaUnderROC}")

// Set the model threshold to maximize F-Measure

val fMeasure = trainingSummary.fMeasureByThreshold
val maxFMeasure = fMeasure.select(max("F-Measure")).head().getDouble(0)
val bestThreshold = fMeasure.where($"F-Measure" === maxFMeasure)
.select("threshold").head().getDouble(0) lrModel.setThreshold(bestThreshold)
49
• https://sparkbyexamples.com/

200-301 Exam - Free Actual Q&as, Page 1 - ExamTopics
100% (4)
200-301 Exam - Free Actual Q&as, Page 1 - ExamTopics
579 pages
Spark ML
No ratings yet
Spark ML
110 pages
Introduction Game Analysis 3rd (051 100)
No ratings yet
Introduction Game Analysis 3rd (051 100)
50 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
Python Library Functions
No ratings yet
Python Library Functions
12 pages
BT11803 Tutorial 3 ANSWER
100% (1)
BT11803 Tutorial 3 ANSWER
4 pages
Lecture 5 - Feature Extraction, Model Building & Evaluation
No ratings yet
Lecture 5 - Feature Extraction, Model Building & Evaluation
35 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
DS Cheat Sheets
No ratings yet
DS Cheat Sheets
18 pages
BSNL Landline Broadband Closure Letter
0% (1)
BSNL Landline Broadband Closure Letter
2 pages
Samsung Electronics
100% (1)
Samsung Electronics
31 pages
Northbay Summarizes Data Pre-Processing Algorithms
No ratings yet
Northbay Summarizes Data Pre-Processing Algorithms
10 pages
Data Acquisition and Labview: Prof. R.G. Longoria
No ratings yet
Data Acquisition and Labview: Prof. R.G. Longoria
19 pages
Slides Scalable Machine Learning With Apache Spark
No ratings yet
Slides Scalable Machine Learning With Apache Spark
155 pages
Slide 11 Spark ML
No ratings yet
Slide 11 Spark ML
153 pages
Lê Hoàng Anh Duy - Spark Machine Learning
No ratings yet
Lê Hoàng Anh Duy - Spark Machine Learning
133 pages
Codes and Concepts of ML-Developer
No ratings yet
Codes and Concepts of ML-Developer
125 pages
Scalable-ML-3 4 1
No ratings yet
Scalable-ML-3 4 1
147 pages
Picture Code: V Shape Yellow/Whi TE
No ratings yet
Picture Code: V Shape Yellow/Whi TE
128 pages
V EEE EE3017 Lab Mnual
No ratings yet
V EEE EE3017 Lab Mnual
72 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
Struc Patterns
No ratings yet
Struc Patterns
86 pages
ML
No ratings yet
ML
38 pages
Lecture 4
No ratings yet
Lecture 4
79 pages
Using Multivariate Statistics 7th Edition Barbara G. Tabachnickdownload
100% (2)
Using Multivariate Statistics 7th Edition Barbara G. Tabachnickdownload
51 pages
Machine: Learning
No ratings yet
Machine: Learning
24 pages
CSE545 sp20 (5) 3-3
No ratings yet
CSE545 sp20 (5) 3-3
81 pages
DUI0448I v2p Ca9 TRM
No ratings yet
DUI0448I v2p Ca9 TRM
62 pages
Module 5.pptx - 20250608 - 201231 - 0000
No ratings yet
Module 5.pptx - 20250608 - 201231 - 0000
43 pages
20 Jan Paper I EN
No ratings yet
20 Jan Paper I EN
55 pages
9 - Pig Latin
No ratings yet
9 - Pig Latin
42 pages
Hive Part 2
No ratings yet
Hive Part 2
53 pages
Module 4
No ratings yet
Module 4
44 pages
Esquema Elétrico Asus Nexus 7 ME370T
No ratings yet
Esquema Elétrico Asus Nexus 7 ME370T
44 pages
AI-Module 4 - Updated
No ratings yet
AI-Module 4 - Updated
53 pages
ML Notes All
No ratings yet
ML Notes All
32 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
19 Jan Paper II Statistics HN 1
No ratings yet
19 Jan Paper II Statistics HN 1
34 pages
Lecture 6 - Spark ML
No ratings yet
Lecture 6 - Spark ML
31 pages
Mod 2
No ratings yet
Mod 2
43 pages
Spark Graphx
No ratings yet
Spark Graphx
43 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
Module 1 Intro
No ratings yet
Module 1 Intro
32 pages
Practical Assignment ML
No ratings yet
Practical Assignment ML
50 pages
BDA Lec11
No ratings yet
BDA Lec11
32 pages
Machine Learning Engineer Interview Preparation Guide
No ratings yet
Machine Learning Engineer Interview Preparation Guide
14 pages
AI-Module 4 Updated
No ratings yet
AI-Module 4 Updated
42 pages
Assignmnet
No ratings yet
Assignmnet
25 pages
Codigo de Barras EP2000
No ratings yet
Codigo de Barras EP2000
48 pages
Coding Neural Networks-Classification & Regression
No ratings yet
Coding Neural Networks-Classification & Regression
39 pages
Unit 6aics
No ratings yet
Unit 6aics
25 pages
Ai - W6L12
No ratings yet
Ai - W6L12
44 pages
Doubt Clearance Session (AI) On 29.12.2024
No ratings yet
Doubt Clearance Session (AI) On 29.12.2024
41 pages
Deregistration of Tax Groups
No ratings yet
Deregistration of Tax Groups
28 pages
Module 5
No ratings yet
Module 5
27 pages
CS 2 3 4 Aml
No ratings yet
CS 2 3 4 Aml
70 pages
CE 212 Digital Systems Ch4
No ratings yet
CE 212 Digital Systems Ch4
37 pages
2021BCS0103 Cse321 Lab
No ratings yet
2021BCS0103 Cse321 Lab
16 pages
Power For All - UttarPradesh
No ratings yet
Power For All - UttarPradesh
106 pages
2021BCS0103
No ratings yet
2021BCS0103
15 pages
2021BCS0103 CSE411 Lab8
No ratings yet
2021BCS0103 CSE411 Lab8
12 pages
Spark SQL - Updated
No ratings yet
Spark SQL - Updated
19 pages
MOA - Massive Online Analysis Manual
No ratings yet
MOA - Massive Online Analysis Manual
67 pages
Thirteenth Edition: Design of Goods and Services
No ratings yet
Thirteenth Edition: Design of Goods and Services
88 pages
KMD Clustering: Robust General-Purpose Clustering of Biological Data
No ratings yet
KMD Clustering: Robust General-Purpose Clustering of Biological Data
12 pages
2021BCS0103 Cse321 Lab6
No ratings yet
2021BCS0103 Cse321 Lab6
12 pages
Sony Ai Content
No ratings yet
Sony Ai Content
26 pages
Estimator
No ratings yet
Estimator
29 pages
2021BCS0103 Cse411 Lab6
No ratings yet
2021BCS0103 Cse411 Lab6
11 pages
Feature and Feature Extractionlect2
No ratings yet
Feature and Feature Extractionlect2
28 pages
2021BCS0103 Cse411 Lab-9
No ratings yet
2021BCS0103 Cse411 Lab-9
10 pages
Flaws in Applying Proof Methodologies To Signature
No ratings yet
Flaws in Applying Proof Methodologies To Signature
19 pages
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
No ratings yet
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
10 pages
Richtek RT9742
No ratings yet
Richtek RT9742
20 pages
2021BCS0103
No ratings yet
2021BCS0103
7 pages
2021BCS0103 ICS322 Assignment2
No ratings yet
2021BCS0103 ICS322 Assignment2
10 pages
Deep Learning
No ratings yet
Deep Learning
21 pages
PWP Chapter 5
No ratings yet
PWP Chapter 5
25 pages
Advanced Data Science On Spark: Reza Zadeh
No ratings yet
Advanced Data Science On Spark: Reza Zadeh
47 pages
Steps of Hadoop Installation
No ratings yet
Steps of Hadoop Installation
3 pages
Form IEPF 4 - 2017 18 1
No ratings yet
Form IEPF 4 - 2017 18 1
6 pages
Untitled Document
No ratings yet
Untitled Document
8 pages
Arrays
No ratings yet
Arrays
9 pages
2021BCS0103 CSE411 Lab5
No ratings yet
2021BCS0103 CSE411 Lab5
11 pages
UNIT2
No ratings yet
UNIT2
20 pages
Practical Machine Learning Pipelines With Mllib: Joseph K. Bradley
No ratings yet
Practical Machine Learning Pipelines With Mllib: Joseph K. Bradley
35 pages
Answer
No ratings yet
Answer
5 pages
Dev Guide
No ratings yet
Dev Guide
8 pages
Akka Parlour Menu
No ratings yet
Akka Parlour Menu
4 pages
Data Collection
No ratings yet
Data Collection
8 pages
Aiml Model
No ratings yet
Aiml Model
13 pages
2021BCS0103 Ics322
No ratings yet
2021BCS0103 Ics322
3 pages
2021BCS0103 Lab2 Microproc
No ratings yet
2021BCS0103 Lab2 Microproc
3 pages
2021BCS0103 MicroP Lab
No ratings yet
2021BCS0103 MicroP Lab
3 pages
2021BCS0103 Cse321 Lab7
No ratings yet
2021BCS0103 Cse321 Lab7
3 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Steps To Create Jar File and Execute Word Count Problem in Mapper Reducer
No ratings yet
Steps To Create Jar File and Execute Word Count Problem in Mapper Reducer
5 pages
Gigabyte RX470 V1.1
No ratings yet
Gigabyte RX470 V1.1
29 pages
Spark & SparkMLLib
No ratings yet
Spark & SparkMLLib
6 pages
K Fold
No ratings yet
K Fold
2 pages
2021BCS0103 ML
No ratings yet
2021BCS0103 ML
1 page
Practical 3.4 Spark Machine Learning
No ratings yet
Practical 3.4 Spark Machine Learning
3 pages
Mastering Advanced Analytics With Apache Spark
No ratings yet
Mastering Advanced Analytics With Apache Spark
75 pages
PIG - Installation Step
No ratings yet
PIG - Installation Step
2 pages
20191216134846D3338 - COMP6579 Session 10 - Big Data Analytics (Apache Spark - SparkML)
No ratings yet
20191216134846D3338 - COMP6579 Session 10 - Big Data Analytics (Apache Spark - SparkML)
42 pages
Assignment Aiads 2074
No ratings yet
Assignment Aiads 2074
2 pages
TOPMODEL User Notes: Windows Version 97.01
No ratings yet
TOPMODEL User Notes: Windows Version 97.01
15 pages
Class Notes
No ratings yet
Class Notes
12 pages
Apache Spark Mllib Guide For Pipelining
No ratings yet
Apache Spark Mllib Guide For Pipelining
3 pages
Cuet 2nd Day Admit Card
No ratings yet
Cuet 2nd Day Admit Card
2 pages
Product Senior Manager Financial Services in Phoenix AZ Resume Corey Miller
No ratings yet
Product Senior Manager Financial Services in Phoenix AZ Resume Corey Miller
2 pages
ML Midterm Cheatsheet
No ratings yet
ML Midterm Cheatsheet
2 pages
OfferLetter PDF
No ratings yet
OfferLetter PDF
1 page
Rayhan Rashad Salusra - 13418076 - Poland - BRIDGING
No ratings yet
Rayhan Rashad Salusra - 13418076 - Poland - BRIDGING
2 pages
Agile Methology
No ratings yet
Agile Methology
29 pages
MLib Cheat Sheet Design
No ratings yet
MLib Cheat Sheet Design
1 page
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
From Everand
Scala Data Analysis Cookbook (new): Navigate the world of data analysis, visualization, and machine learning with over 100 hands-on Scala recipes
Arun Manivannan
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet

Spark MLIB

Uploaded by

Spark MLIB

Uploaded by

1

• These stages are run in order, and the input DataFrame is

• For Transformer stages, the transform() method is called on the

• For Estimator stages, the fit() method is called to produce a

• The PipelineModel has the same number of stages as the

• Transformation: Scaling, converting, or modifying

• Selection: Selecting a subset from a larger set of features

• min: 0.0 by default. Lower bound after transformation, shared by all

• max: 1.0 by default. Upper bound after transformation, shared by all

val scaler = new MinMaxScaler() .setInputCol("features")

val scaler = new StandardScaler()

// Compute summary statistics by fitting the StandardScaler.

val assembler = new VectorAssembler() .setInputCols(Array("hour",

val output = assembler.transform(dataset) println("Assembled

val indexer = new StringIndexer() .setInputCol("category")

val indexed = indexer.fit(df).transform(df)

val pca = new PCA()

val result = pca.transform(df).select("pcaFeatures")

• StopWordsRemover: Stop words are words which should be

• Binarizer takes the common parameters inputCol and outputCol,

• Feature values greater than the threshold are binarized to 1.0;

val data = Arrays.asList( Row(Vectors.sparse(3, Seq((0, -2.0),

val (meanVal, varianceVal) = df.select(metrics("mean", "variance")

val (meanVal2, varianceVal2) = df.select(mean($"features"), variance($"features"))

println(s"without weight: mean = ${meanVal2}, sum = ${varianceVal2}")

val trainer = new MultilayerPerceptronClassifier() .setLayers(layers)

val evaluator = new MulticlassClassificationEvaluator()

val lsvc = new LinearSVC()

// Print the coefficients and intercept for linear svc

//Load the data stored in LIBSVM format as a DataFrame.

//Convert indexed labels back to original labels.

val labelConverter = new IndexToString() .setInputCol("prediction")

//Chain indexers and tree in a Pipeline.

//Load the data stored in LIBSVM format as a DataFrame.

val featureIndexer = new VectorIndexer()

//Convert indexed labels back to original labels.

val labelConverter = new IndexToString() .setInputCol("prediction")

//Chain indexers and tree in a Pipeline.

• // Compute the sum of squared errors.

//Trains a k-means model.

// Obtain the receiver-operating characteristic as a dataframe and

// Set the model threshold to maximize F-Measure

You might also like