Practical 3.4 Spark Machine Learning
Practical 3.4 Spark Machine Learning
1. Introduction
spark.ml is Spark’s machine learning (ML) library inspired by scikit-learn. It
provides a uniform set of high-level API built on top of DataFrames for constructing and
tuning machine learning pipelines. Some related terminology:
● Transformer: an algorithm which can transform one DataFrame into another
DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame
with features into a DataFrame with predictions.
● Estimator: an algorithm which can be fit on a DataFrame to produce a
Transformer. E.g., a learning algorithm is an Estimator which trains on a
DataFrame and produces a model.
● Pipeline: A Pipeline chains multiple Transformers and Estimators together to
specify an ML workflow.
2. Pipelines
A Pipeline is specified as a sequence of stages, and each stage is either a
Transformer or an Estimator. These stages are run in order, and the input DataFrame
is transformed as it passes through each stage.
● For Transformer stages, the transform() method is called on the DataFrame.
● For Estimator stages, the fit() method is called to produce a Transformer (which
becomes part of the PipelineModel, or fitted Pipeline), and that Transformer’s
transform() method is called on the DataFrame.
Figure 5.3b shows the details of what happens as the different stages of the
pipeline are enacted:
● The Pipeline.fit() method is called on the original DataFrame, which
has raw text documents and labels.
○ The Tokenizer.transform() method splits the raw text documents
into words, adding a new column with words to the DataFrame.
○ The HashingTF.transform() method converts the words column into
feature vectors, adding a new column with those vectors to the
DataFrame.
○ The LogisticRegression.fit() method is called to produce a
LogisticRegressionModel.
● After the Pipeline’s fit() method runs, it produces a PipelineModel,
which is a transformer.