Practical 3.4 Spark Machine Learning

Uploaded by

black hello

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views3 pages

Practical 3.4 Spark Machine Learning

Uploaded by

black hello

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

BMCS2013 DATA ENGINEERING 1 of 3

PRACTICAL 3.4 Spark Machine Learning

Reference

1. Introduction
spark.ml is Spark’s machine learning (ML) library inspired by scikit-learn. It
provides a uniform set of high-level API built on top of DataFrames for constructing and
tuning machine learning pipelines. Some related terminology:
● Transformer: an algorithm which can transform one DataFrame into another
DataFrame. E.g., an ML model is a Transformer which transforms a DataFrame
with features into a DataFrame with predictions.
● Estimator: an algorithm which can be fit on a DataFrame to produce a
Transformer. E.g., a learning algorithm is an Estimator which trains on a
DataFrame and produces a model.
● Pipeline: A Pipeline chains multiple Transformers and Estimators together to
specify an ML workflow.

Refer to NOTEBOOK 3.4 SparkML Estimator and Transformer.ipynb for an example

on the use of SparkML’s estimator and transformer.

2. Pipelines
A Pipeline is specified as a sequence of stages, and each stage is either a
Transformer or an Estimator. These stages are run in order, and the input DataFrame
is transformed as it passes through each stage.
● For Transformer stages, the transform() method is called on the DataFrame.
● For Estimator stages, the fit() method is called to produce a Transformer (which
becomes part of the PipelineModel, or fitted Pipeline), and that Transformer’s
transform() method is called on the DataFrame.

2.1. Training a Pipeline Model

Assume we have a Pipeline with the following three stages:
● Tokenizer (a transformer),
● HashingTF (a transformer) and
● LogisticRegression (an estimator).
The workflow for the above pipeline is as shown in Figure 5.3a.

Figure 5.3a Example of Pipeline Stages for an NLP Classification Task

BMCS2013 DATA ENGINEERING 2 of 3

Figure 5.3b Stages that are enacted by the Pipeline.fit() method

Figure 5.3b shows the details of what happens as the different stages of the
pipeline are enacted:
● The Pipeline.fit() method is called on the original DataFrame, which
has raw text documents and labels.
○ The Tokenizer.transform() method splits the raw text documents
into words, adding a new column with words to the DataFrame.
○ The HashingTF.transform() method converts the words column into
feature vectors, adding a new column with those vectors to the
DataFrame.
○ The LogisticRegression.fit() method is called to produce a
LogisticRegressionModel.
● After the Pipeline’s fit() method runs, it produces a PipelineModel,
which is a transformer.

2.2. Prediction using a Pipeline Model

The PipelineModel has the same number of stages as the original Pipeline, but
all Estimators in the original Pipeline have become Transformers (Figure 5.3c).

Figure 5.3c The PipelineModel stages.

In Figure 5.3d, When the PipelineModel’s transform() method is called on the

test dataset, the data are passed through the fitted pipeline in order. Each stage’s
transform() method updates the dataset and passes it to the next stage.

Figure 5.3d The PipelineModel.transform() method

Pipelines and PipelineModels help to ensure that training and test data go
through identical feature processing steps.
BMCS2013 DATA ENGINEERING 3 of 3

Refer to NOTEBOOK 3.5 SparkML Pipeline.ipynb for an example on the use of

SparkML’s estimator and transformer.

Explore other SparkML modules:

● Extracting, transforming and selecting features
● Classification and regression
● Clustering
● Collaborative filtering
● Frequent pattern mining
● Model selection and hyperparameter tuning
● Etc.

Topic Cheatsheet For GCP's Professional Machine Learning Engineer Beta Exam
No ratings yet
Topic Cheatsheet For GCP's Professional Machine Learning Engineer Beta Exam
2 pages
Module 3 Aws
No ratings yet
Module 3 Aws
132 pages
(Sep-2022) New PassLeader DP-900 Exam Dumps
No ratings yet
(Sep-2022) New PassLeader DP-900 Exam Dumps
6 pages
BDA Lec11
No ratings yet
BDA Lec11
32 pages
Apache Spark Mllib Guide For Pipelining
No ratings yet
Apache Spark Mllib Guide For Pipelining
3 pages
Spark MLIB
No ratings yet
Spark MLIB
50 pages
MLib Cheat Sheet Design
No ratings yet
MLib Cheat Sheet Design
1 page
Practical Machine Learning Pipelines With Mllib: Joseph K. Bradley
No ratings yet
Practical Machine Learning Pipelines With Mllib: Joseph K. Bradley
35 pages
Lecture 4
No ratings yet
Lecture 4
79 pages
Lecture Notes - Building Continuous Learning Infrastructure
No ratings yet
Lecture Notes - Building Continuous Learning Infrastructure
8 pages
Query Generation Using Nadaq System
No ratings yet
Query Generation Using Nadaq System
11 pages
Transform
No ratings yet
Transform
1 page
AD3002 Healthcare Unit2 Updated
No ratings yet
AD3002 Healthcare Unit2 Updated
83 pages
Simple Transperent Automated Data Analysis
No ratings yet
Simple Transperent Automated Data Analysis
24 pages
Unit2 Hca Notes
No ratings yet
Unit2 Hca Notes
17 pages
Scalable-ML-3 4 1
No ratings yet
Scalable-ML-3 4 1
147 pages
Slides Scalable Machine Learning With Apache Spark
No ratings yet
Slides Scalable Machine Learning With Apache Spark
155 pages
ML Lab Draft Manual
No ratings yet
ML Lab Draft Manual
46 pages
Week 3 A
No ratings yet
Week 3 A
18 pages
Neues verkehrswissenschaftliches Journal - Ausgabe 26: User-based Adaptable High Performance Simulation Modelling and Design for Railway Planning and Operations
From Everand
Neues verkehrswissenschaftliches Journal - Ausgabe 26: User-based Adaptable High Performance Simulation Modelling and Design for Railway Planning and Operations
Yong Cui
No ratings yet
EDA Pipeline Final
No ratings yet
EDA Pipeline Final
7 pages
Fake News Detection
100% (1)
Fake News Detection
25 pages
Spark ML
No ratings yet
Spark ML
110 pages
Data Science
No ratings yet
Data Science
39 pages
TusharGoel Seminar PPT
No ratings yet
TusharGoel Seminar PPT
23 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Lecture 6 - Spark ML
No ratings yet
Lecture 6 - Spark ML
31 pages
Disaster Response Classification Using NLP
No ratings yet
Disaster Response Classification Using NLP
24 pages
EECS6893 BigDataAnalytics Lecture5
No ratings yet
EECS6893 BigDataAnalytics Lecture5
66 pages
MDCM Sagar Assignment
No ratings yet
MDCM Sagar Assignment
15 pages
Fake News Detection Using Machine Learning
No ratings yet
Fake News Detection Using Machine Learning
11 pages
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
No ratings yet
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
10 pages
Estimator
No ratings yet
Estimator
29 pages
Worked Examples in Mechanics of Machines using MATLAB
From Everand
Worked Examples in Mechanics of Machines using MATLAB
Eric Ogur
No ratings yet
Simple Introduction of Neural Network
No ratings yet
Simple Introduction of Neural Network
28 pages
ML Lab Manual
No ratings yet
ML Lab Manual
36 pages
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
ML Aml Cse It Lab Manual Final
No ratings yet
ML Aml Cse It Lab Manual Final
22 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
Arnav MLOPSLab06
No ratings yet
Arnav MLOPSLab06
6 pages
ML Summer Training
No ratings yet
ML Summer Training
20 pages
DeekshikaJadyada AP24LDS11
No ratings yet
DeekshikaJadyada AP24LDS11
6 pages
MLflow Présentation
No ratings yet
MLflow Présentation
51 pages
Auto ML Tool For Supervised Machine Learning Data
No ratings yet
Auto ML Tool For Supervised Machine Learning Data
11 pages
Text Classification - Movie Review - News Wires
No ratings yet
Text Classification - Movie Review - News Wires
5 pages
Week11-AI ML DL
No ratings yet
Week11-AI ML DL
43 pages
5.1 Large Scale ML
No ratings yet
5.1 Large Scale ML
10 pages
2008 12829v2
No ratings yet
2008 12829v2
22 pages
Ai - W6L12
No ratings yet
Ai - W6L12
44 pages
Ad3461 ML Manual
No ratings yet
Ad3461 ML Manual
34 pages
Slide 11 Spark ML
No ratings yet
Slide 11 Spark ML
153 pages
Amazon Review Data Spark Example
No ratings yet
Amazon Review Data Spark Example
11 pages
Lecture Notes 1 2 Intro Python
No ratings yet
Lecture Notes 1 2 Intro Python
13 pages
How To Build Data Pipelines For Machine Learning - by Shaw Talebi - Towards Data Science
No ratings yet
How To Build Data Pipelines For Machine Learning - by Shaw Talebi - Towards Data Science
21 pages
ML Report Fake News Detection
No ratings yet
ML Report Fake News Detection
15 pages
Machine Learning Laboratory
No ratings yet
Machine Learning Laboratory
44 pages
Unit 1-1
No ratings yet
Unit 1-1
10 pages
Sklearn Pipeline Tutorial Towards Data Science
No ratings yet
Sklearn Pipeline Tutorial Towards Data Science
16 pages
Commentclass: A Robust Ensemble Machine Learning Model For Comment Classification
No ratings yet
Commentclass: A Robust Ensemble Machine Learning Model For Comment Classification
20 pages
Mallet Tutorial
No ratings yet
Mallet Tutorial
120 pages
MACHINE LEARNING LAB Manual
No ratings yet
MACHINE LEARNING LAB Manual
48 pages
Guide To Install Visual Studio 2019
No ratings yet
Guide To Install Visual Studio 2019
3 pages
Chapter 6 Network Layer - July 2023
No ratings yet
Chapter 6 Network Layer - July 2023
58 pages
Chapter 4 Data Link Layer (OSI Model) - July 2023
No ratings yet
Chapter 4 Data Link Layer (OSI Model) - July 2023
39 pages
Chap01 - Intro To Programming
No ratings yet
Chap01 - Intro To Programming
37 pages
Chapter 6 - Multimedia Element Video
No ratings yet
Chapter 6 - Multimedia Element Video
44 pages
Chapter 2 Network Protocols - Communication - July 2023
No ratings yet
Chapter 2 Network Protocols - Communication - July 2023
56 pages
Chapter 10 Application Layer - July 2023
No ratings yet
Chapter 10 Application Layer - July 2023
36 pages
Practical 1 Slide
No ratings yet
Practical 1 Slide
20 pages
Practical 3 - ESP32 WiFi
100% (1)
Practical 3 - ESP32 WiFi
9 pages
L03 Generalization, Train Test Splits and Validation
No ratings yet
L03 Generalization, Train Test Splits and Validation
49 pages
Setup - Firebase
No ratings yet
Setup - Firebase
9 pages
L08 Hierachical Agglomerative Clustering
No ratings yet
L08 Hierachical Agglomerative Clustering
41 pages
L10 Neural Network
No ratings yet
L10 Neural Network
52 pages
L05 Unsupervised Learning - Overview
No ratings yet
L05 Unsupervised Learning - Overview
16 pages
L01 Introduction To ML
No ratings yet
L01 Introduction To ML
16 pages
L04 Decision Trees
No ratings yet
L04 Decision Trees
34 pages
L02 Classification and Regression
No ratings yet
L02 Classification and Regression
26 pages
Practical 2 Hadoop Distributed File System (HDFS)
No ratings yet
Practical 2 Hadoop Distributed File System (HDFS)
4 pages
ACC 157 SAS No. 25
No ratings yet
ACC 157 SAS No. 25
4 pages
Chapter 1 - Introdution
No ratings yet
Chapter 1 - Introdution
26 pages
Basics of IT Act and Definitions
No ratings yet
Basics of IT Act and Definitions
14 pages
Machine Learning Based Recommender Syste
No ratings yet
Machine Learning Based Recommender Syste
9 pages
Design and Analysis of Algorithms
No ratings yet
Design and Analysis of Algorithms
2 pages
ABA EXTERNAL MCQ1 Merged Merged Merged
No ratings yet
ABA EXTERNAL MCQ1 Merged Merged Merged
394 pages
Tycs Data Science Sem6
No ratings yet
Tycs Data Science Sem6
99 pages
Farhan Data Engineer
No ratings yet
Farhan Data Engineer
9 pages
An Accuracy-Enhanced Light Stemmer For Arabic Text
No ratings yet
An Accuracy-Enhanced Light Stemmer For Arabic Text
22 pages
Master of Library and Information Science
No ratings yet
Master of Library and Information Science
54 pages
Vacancy Details 10-12-2024.Xlsx Scope Guide List 2025
No ratings yet
Vacancy Details 10-12-2024.Xlsx Scope Guide List 2025
13 pages
Heq Cert SD Syllabus 1
No ratings yet
Heq Cert SD Syllabus 1
9 pages
Data Mining PHD Thesis PDF
100% (2)
Data Mining PHD Thesis PDF
6 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
4 pages
BscTy Comp Sci Syllabus 2021-22
No ratings yet
BscTy Comp Sci Syllabus 2021-22
15 pages
Literature Review On Distribution Channel PDF
100% (3)
Literature Review On Distribution Channel PDF
8 pages
Github - Blog - Ai and ML - Generative Ai - What Is Retrieval Augmented Generation and What Does It Do For Generative Ai
No ratings yet
Github - Blog - Ai and ML - Generative Ai - What Is Retrieval Augmented Generation and What Does It Do For Generative Ai
14 pages
74KSA288-10 - DUCT Calc. - Report - 05.12.2022
No ratings yet
74KSA288-10 - DUCT Calc. - Report - 05.12.2022
136 pages
Vector Database in LLMs
No ratings yet
Vector Database in LLMs
14 pages
The Big Data System, Components, Tools, and Technologies A Survey
No ratings yet
The Big Data System, Components, Tools, and Technologies A Survey
100 pages
Cs3492 Databasemanagementsystemsyallabus Regulation2021forcse 240317112340 Ffd3c346
No ratings yet
Cs3492 Databasemanagementsystemsyallabus Regulation2021forcse 240317112340 Ffd3c346
4 pages
Deepak Gaur - 4 Year(s) 6 Month(s)
No ratings yet
Deepak Gaur - 4 Year(s) 6 Month(s)
2 pages
Artificial Intelligence Applications
No ratings yet
Artificial Intelligence Applications
28 pages
Seminar ON Health Informatics: Submitted To Mrs Sasi Mohandas Lecturer MMM College of Nursing
100% (1)
Seminar ON Health Informatics: Submitted To Mrs Sasi Mohandas Lecturer MMM College of Nursing
29 pages
Chapter 1 - Introduction To Database Notes
No ratings yet
Chapter 1 - Introduction To Database Notes
12 pages
DATA (1) Review Quiz - Attempt Review - Home
No ratings yet
DATA (1) Review Quiz - Attempt Review - Home
6 pages
Aio 1 PPT
No ratings yet
Aio 1 PPT
24 pages
Ccs Important Questions
No ratings yet
Ccs Important Questions
2 pages
Real Time Data Streaming To Data Warehouse Seminar
No ratings yet
Real Time Data Streaming To Data Warehouse Seminar
26 pages