Machine Learning in PySpark
Machine Learning in PySpark
Bharti Motwani
The Data Mining Process
Explore
Define Obtain Determine Choose Apply Evaluate Deploy
&clean
purpose data DM task DM Methods Methods Performance Model
data
Defining Purpose
Define
purpose
Exploring, understanding and visualizing data are perhaps the most important steps in the data mining process.
• Is it Regression? Is it Classification?
Apply Methods and Evaluate
Explore
Define Obtain Determine Apply Evaluate
&clean
purpose data DM task Methods Performance
data
• Models will be judged based on how good they are at making predictions for
test data.
Apply Methods and Evaluate
Explore
Define Obtain Determine Apply Evaluate
&clean
purpose data DM task Methods Performance
data
Train
• Portion of data used to develop a model
Test
• Portion of the data used only at the end of the model building and
selection process
• Assess how well the final model performs on data that was
‘unseen’ during training
Model Deployment
Explore
Define Obtain Determine Choose Apply Evaluate Model
&clean
purpose data DM task DM Methods Methods Performance Deployment
data
Overarching Framework
Machine Learning
Regression Clustering
14
Supervised Learning
• The process of providing an algorithm with records for which an output variable of
interest is known and the algorithm “learns” how to predict this value with new
records where the output is not known
• Goal is to predict an outcome, such as purchases/no purchase, fraud/no fraud, sales,
salary and others
Supervised Learning Models
• We build a model that understands how to correctly assign a
label to an example
• Supervised learning models are mathematical functions that
map input data (i.e., features) to predict outcome labels
(referred to as outcome/output/target variables)
>
x f(x) y
Input features Model Predicted
outcome
Regression
•When the dependent variable (label) is a real number.
Example:
•Predicting sales
•Predicting the cost of coffee in 2022
Regression Problem:
Input features Outcome
Classification