Machine Learning in PySpark

The document outlines the data mining process, emphasizing the importance of defining the purpose, obtaining and cleaning data, and determining the appropriate machine learning task. It details the steps involved in applying methods, evaluating performance, and deploying models, with a focus on supervised learning techniques such as regression and classification. The document also describes the supervised learning pipeline in PySpark, including data splitting, model estimation, prediction, and evaluation.

Uploaded by

BraveAF

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views18 pages

Machine Learning in PySpark

Uploaded by

BraveAF

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Machine Learning in PySpark

Bharti Motwani
The Data Mining Process

Consists of multiple steps from problem definition to

model deployment

Explore
Define Obtain Determine Choose Apply Evaluate Deploy
&clean
purpose data DM task DM Methods Methods Performance Model
data
Defining Purpose
Define
purpose

• Should focus on business understanding and problem

• Managers are often not clear about what the goal of a data mining project is

• Determining this requires iteration between data exploration and

defining the problem
Obtaining Data
Define Obtain
purpose data

• Most real world applications combine data from multiple sources

Explore, Clean and Preprocess
Explore
Define Obtain
&clean
purpose data
data

Exploring, understanding and visualizing data are perhaps the most important steps in the data mining process.

Visualize and explore the data:

• Are there missing values? If yes, how should we handle them?
• Are there outliers? How should we handle them?
• Are the data summaries what we would expect? Are ranges of values reasonable?
• What does the data look like? Visualize the data using graphing techniques
Some of the key tasks that may be performed are:
• Eliminate variables or otherwise reduce data Apply domain knowledge!
• Transform variables (“feature engineering”)
Determine Task
Explore
Define Obtain Determine
&clean
purpose data DM task
data

• Is it supervised or unsupervised learning (or something else)?

• Is it Regression? Is it Classification?
Apply Methods and Evaluate
Explore
Define Obtain Determine Apply Evaluate
&clean
purpose data DM task Methods Performance
data

• Typically apply multiple methods and compare their performance

• Models will be judged based on how good they are at making predictions for
test data.
Apply Methods and Evaluate
Explore
Define Obtain Determine Apply Evaluate
&clean
purpose data DM task Methods Performance
data

Train
• Portion of data used to develop a model

Validation data (Tune!)

• Portion of the data used to assess how well the model fits
• To adjust parameters

Test
• Portion of the data used only at the end of the model building and
selection process
• Assess how well the final model performs on data that was
‘unseen’ during training
Model Deployment

Explore
Define Obtain Determine Choose Apply Evaluate Model
&clean
purpose data DM task DM Methods Methods Performance Deployment
data
Overarching Framework

Machine Learning

Supervised Learning Unsupervised Learning

Regression Clustering

Classification Recommendation System

Frequent Pattern Mining

14
Supervised Learning

• The process of providing an algorithm with records for which an output variable of
interest is known and the algorithm “learns” how to predict this value with new
records where the output is not known
• Goal is to predict an outcome, such as purchases/no purchase, fraud/no fraud, sales,
salary and others
Supervised Learning Models
• We build a model that understands how to correctly assign a
label to an example
• Supervised learning models are mathematical functions that
map input data (i.e., features) to predict outcome labels
(referred to as outcome/output/target variables)

>
x f(x) y
Input features Model Predicted
outcome
Regression
•When the dependent variable (label) is a real number.
Example:
•Predicting sales
•Predicting the cost of coffee in 2022
Regression Problem:
Input features Outcome
Classification

•When the dependent variable (label) is specific class (i.e.,

category)
Example:
•Determining if a customer will churn or not
•Determining if a patient is a current smoker, former smoker, or
non-smoker
Classification Problem:
Input features Outcome

Subscription Tenure in months Primary Phone Churn

2-line plan 12 Samsung S8 Yes
Family plan 36 iPhone X No
Individual 18 Pixel 4A No
Supervised Learning Pipeline
1. Split complete data into training and test/validation dataset
Using randomSplit() to split the data
2. Estimate a model on the training dataset
pyspark.ml.regression for Regression Problems
pyspark.ml.classification for Classification Problems
3. Predict using the test dataset
4. Evaluate the model using metrics of accuracy/error
pyspark.ml.evaluate for evaluating
5. Creating and selecting the best model
pyspark.ml.tuning for Hyper-parameter tuning 3
18

Unit4_PPT (2)
No ratings yet
Unit4_PPT (2)
126 pages
Unit-3-ML
No ratings yet
Unit-3-ML
119 pages
ML Workshop
No ratings yet
ML Workshop
78 pages
TM 4 - Data Mining and Machine Learning
No ratings yet
TM 4 - Data Mining and Machine Learning
60 pages
Module 2 - ML
No ratings yet
Module 2 - ML
53 pages
Module 3 - Introduction to ML
No ratings yet
Module 3 - Introduction to ML
45 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
55 pages
3. Introduction to Machine Learning
No ratings yet
3. Introduction to Machine Learning
20 pages
Machine Learning Reg
No ratings yet
Machine Learning Reg
45 pages
S1-Evaluate-Performance-LKW-1Mar2025
No ratings yet
S1-Evaluate-Performance-LKW-1Mar2025
26 pages
Predictive Analytics Updated
No ratings yet
Predictive Analytics Updated
30 pages
Lecture 2 Unit 1
No ratings yet
Lecture 2 Unit 1
60 pages
Chapter 4- Machine Learning
No ratings yet
Chapter 4- Machine Learning
81 pages
ML SIG - Day 1
No ratings yet
ML SIG - Day 1
55 pages
Model Evaluation
No ratings yet
Model Evaluation
39 pages
Statistics for Data Science
No ratings yet
Statistics for Data Science
39 pages
Lec-1 Introduction
No ratings yet
Lec-1 Introduction
65 pages
MachineLearning Jan2nd
100% (2)
MachineLearning Jan2nd
171 pages
Presentation on Supervised Learning (1)
No ratings yet
Presentation on Supervised Learning (1)
8 pages
000 Into Machine Learning
No ratings yet
000 Into Machine Learning
45 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
Machine learning QB
No ratings yet
Machine learning QB
15 pages
3 Pred Analysis
No ratings yet
3 Pred Analysis
18 pages
Introduction Class
No ratings yet
Introduction Class
134 pages
Directed Data Mining
No ratings yet
Directed Data Mining
34 pages
Predictive Analysis 1
No ratings yet
Predictive Analysis 1
22 pages
machine learning
No ratings yet
machine learning
37 pages
ML-chap-2
No ratings yet
ML-chap-2
60 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
61 pages
ML 2
No ratings yet
ML 2
39 pages
Week 4 - Intro to ML
No ratings yet
Week 4 - Intro to ML
37 pages
Machine Learning
No ratings yet
Machine Learning
54 pages
Unit 4_Question Bank and answers
No ratings yet
Unit 4_Question Bank and answers
23 pages
Classification
No ratings yet
Classification
22 pages
LECTURE-2
No ratings yet
LECTURE-2
36 pages
Case Study - Churn Mdel Prediction
No ratings yet
Case Study - Churn Mdel Prediction
77 pages
Unit1 ML
No ratings yet
Unit1 ML
15 pages
Chapter 4 Classification
No ratings yet
Chapter 4 Classification
78 pages
Seminar Report On MAGLEV
80% (5)
Seminar Report On MAGLEV
29 pages
Air quality prediction using machine learning
No ratings yet
Air quality prediction using machine learning
29 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
Machine Learning
No ratings yet
Machine Learning
42 pages
Machine Learning Supervised
No ratings yet
Machine Learning Supervised
42 pages
Full Download Supply Chain Management Text and Cases 2nd Edition Janat Shah PDF DOCX
100% (1)
Full Download Supply Chain Management Text and Cases 2nd Edition Janat Shah PDF DOCX
54 pages
ML Unit 1
No ratings yet
ML Unit 1
21 pages
4 - Data Analytics Using DM and ML Algorithms - 1
No ratings yet
4 - Data Analytics Using DM and ML Algorithms - 1
71 pages
AI 501 - Lesson 4 - Supervised Learning
No ratings yet
AI 501 - Lesson 4 - Supervised Learning
41 pages
Project
No ratings yet
Project
12 pages
Untitled
No ratings yet
Untitled
11 pages
Research Trends in Machine Learning: Muhammad Kashif Hanif
No ratings yet
Research Trends in Machine Learning: Muhammad Kashif Hanif
80 pages
Lecture 1
No ratings yet
Lecture 1
19 pages
Pattern Recognition Application
No ratings yet
Pattern Recognition Application
43 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
24 pages
Lecture 9
No ratings yet
Lecture 9
27 pages
Unit 1 Machine Learning
No ratings yet
Unit 1 Machine Learning
10 pages
Lecture Notes 1 2 Intro Python
No ratings yet
Lecture Notes 1 2 Intro Python
13 pages
Oe Cae 3
No ratings yet
Oe Cae 3
7 pages
Unit III - I
No ratings yet
Unit III - I
15 pages
1991 Land Rover Defender Owners Manual
100% (1)
1991 Land Rover Defender Owners Manual
174 pages
Unit 1 Machine Learning - PDF Lands
No ratings yet
Unit 1 Machine Learning - PDF Lands
5 pages
Machine Learning Part: Domain Overview
No ratings yet
Machine Learning Part: Domain Overview
20 pages
8 Bit Microcontroller: TLCS-870/C Series
No ratings yet
8 Bit Microcontroller: TLCS-870/C Series
160 pages
Wellcontrol New
No ratings yet
Wellcontrol New
37 pages
Dossier English-Pdc2-Consolidation
No ratings yet
Dossier English-Pdc2-Consolidation
25 pages
MP-MISC-326 - Report On AMC-AC-AC-Traction-Control-System-16.12.2019
No ratings yet
MP-MISC-326 - Report On AMC-AC-AC-Traction-Control-System-16.12.2019
36 pages
Artigo Figado Cannabis Thiago Guedes Pinto
No ratings yet
Artigo Figado Cannabis Thiago Guedes Pinto
14 pages
CINDA-2250KVA-11kV-HSD PLC-R2-Asbuilt-09.06.2024
No ratings yet
CINDA-2250KVA-11kV-HSD PLC-R2-Asbuilt-09.06.2024
24 pages
Sample Test Paper Subject: Biology: Faculty Training Programme (PCCP-FTP
No ratings yet
Sample Test Paper Subject: Biology: Faculty Training Programme (PCCP-FTP
11 pages
Case Study-Typhoid Fever
100% (2)
Case Study-Typhoid Fever
37 pages
Evaluation of The Tabulated NEH4 Least Squares and Asymptotic Fitting Methods For The CN Estimation of Urban Watersheds
No ratings yet
Evaluation of The Tabulated NEH4 Least Squares and Asymptotic Fitting Methods For The CN Estimation of Urban Watersheds
13 pages
Sigma Coating: Anodizing Test Report
100% (1)
Sigma Coating: Anodizing Test Report
1 page
Morphology of Oceans & Ocean Water
No ratings yet
Morphology of Oceans & Ocean Water
20 pages
The Secret of Baalbek
100% (1)
The Secret of Baalbek
41 pages
ARTIGO-Peak Power and Cooling Energy Savings of Shade Trees
No ratings yet
ARTIGO-Peak Power and Cooling Energy Savings of Shade Trees
10 pages
Waste Management Guidelines For Building Plan Submission
No ratings yet
Waste Management Guidelines For Building Plan Submission
10 pages
Astral Matrix I - Equipotence and Harmony of Structures
100% (1)
Astral Matrix I - Equipotence and Harmony of Structures
8 pages
PSO
100% (1)
PSO
21 pages
Ilmu Reproduksi Ternak
No ratings yet
Ilmu Reproduksi Ternak
8 pages
Yellow Epoxy Fiberglass Sheet - Data Sheet
No ratings yet
Yellow Epoxy Fiberglass Sheet - Data Sheet
2 pages
Presentation 6
No ratings yet
Presentation 6
10 pages
Study 1 Plant Base Plastic Copy 1
No ratings yet
Study 1 Plant Base Plastic Copy 1
10 pages
FeeDayScholar2018 19
No ratings yet
FeeDayScholar2018 19
2 pages
Ebrahim Was Not A Jew Nor A Christian But An Upright Muslim.
No ratings yet
Ebrahim Was Not A Jew Nor A Christian But An Upright Muslim.
3 pages
Uses of The Different Parts of The Neem Plant
No ratings yet
Uses of The Different Parts of The Neem Plant
7 pages
Department of Education
No ratings yet
Department of Education
2 pages
Certificate For COVID-19 Vaccination: Beneficiary Details
No ratings yet
Certificate For COVID-19 Vaccination: Beneficiary Details
1 page
Fiberlign Cushion Clamp For Opgw
No ratings yet
Fiberlign Cushion Clamp For Opgw
2 pages
Project Rubric
No ratings yet
Project Rubric
1 page
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
DeepSeek for Data Analysis: The Future of Data Analysis for Business Professionals
From Everand
DeepSeek for Data Analysis: The Future of Data Analysis for Business Professionals
Mohammod Shaharuzzaman
No ratings yet