0% found this document useful (0 votes)
22 views18 pages

Machine Learning in PySpark

The document outlines the data mining process, emphasizing the importance of defining the purpose, obtaining and cleaning data, and determining the appropriate machine learning task. It details the steps involved in applying methods, evaluating performance, and deploying models, with a focus on supervised learning techniques such as regression and classification. The document also describes the supervised learning pipeline in PySpark, including data splitting, model estimation, prediction, and evaluation.

Uploaded by

BraveAF
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views18 pages

Machine Learning in PySpark

The document outlines the data mining process, emphasizing the importance of defining the purpose, obtaining and cleaning data, and determining the appropriate machine learning task. It details the steps involved in applying methods, evaluating performance, and deploying models, with a focus on supervised learning techniques such as regression and classification. The document also describes the supervised learning pipeline in PySpark, including data splitting, model estimation, prediction, and evaluation.

Uploaded by

BraveAF
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Machine Learning in PySpark

Bharti Motwani
The Data Mining Process

Consists of multiple steps from problem definition to


model deployment

Explore
Define Obtain Determine Choose Apply Evaluate Deploy
&clean
purpose data DM task DM Methods Methods Performance Model
data
Defining Purpose
Define
purpose

• Should focus on business understanding and problem


• Managers are often not clear about what the goal of a data mining project is

• Determining this requires iteration between data exploration and


defining the problem
Obtaining Data
Define Obtain
purpose data

• Most real world applications combine data from multiple sources


Explore, Clean and Preprocess
Explore
Define Obtain
&clean
purpose data
data

Exploring, understanding and visualizing data are perhaps the most important steps in the data mining process.

Visualize and explore the data:


• Are there missing values? If yes, how should we handle them?
• Are there outliers? How should we handle them?
• Are the data summaries what we would expect? Are ranges of values reasonable?
• What does the data look like? Visualize the data using graphing techniques
Some of the key tasks that may be performed are:
• Eliminate variables or otherwise reduce data Apply domain knowledge!
• Transform variables (“feature engineering”)
Determine Task
Explore
Define Obtain Determine
&clean
purpose data DM task
data

• Is it supervised or unsupervised learning (or something else)?

• Is it Regression? Is it Classification?
Apply Methods and Evaluate
Explore
Define Obtain Determine Apply Evaluate
&clean
purpose data DM task Methods Performance
data

• Typically apply multiple methods and compare their performance

• Models will be judged based on how good they are at making predictions for
test data.
Apply Methods and Evaluate
Explore
Define Obtain Determine Apply Evaluate
&clean
purpose data DM task Methods Performance
data

Train
• Portion of data used to develop a model

Validation data (Tune!)


• Portion of the data used to assess how well the model fits
• To adjust parameters

Test
• Portion of the data used only at the end of the model building and
selection process
• Assess how well the final model performs on data that was
‘unseen’ during training
Model Deployment

Explore
Define Obtain Determine Choose Apply Evaluate Model
&clean
purpose data DM task DM Methods Methods Performance Deployment
data
Overarching Framework

Machine Learning

Supervised Learning Unsupervised Learning

Regression Clustering

Classification Recommendation System

Frequent Pattern Mining

14
Supervised Learning

• The process of providing an algorithm with records for which an output variable of
interest is known and the algorithm “learns” how to predict this value with new
records where the output is not known
• Goal is to predict an outcome, such as purchases/no purchase, fraud/no fraud, sales,
salary and others
Supervised Learning Models
• We build a model that understands how to correctly assign a
label to an example
• Supervised learning models are mathematical functions that
map input data (i.e., features) to predict outcome labels
(referred to as outcome/output/target variables)

>
x f(x) y
Input features Model Predicted
outcome
Regression
•When the dependent variable (label) is a real number.
Example:
•Predicting sales
•Predicting the cost of coffee in 2022
Regression Problem:
Input features Outcome
Classification

•When the dependent variable (label) is specific class (i.e.,


category)
Example:
•Determining if a customer will churn or not
•Determining if a patient is a current smoker, former smoker, or
non-smoker
Classification Problem:
Input features Outcome

Subscription Tenure in months Primary Phone Churn


2-line plan 12 Samsung S8 Yes
Family plan 36 iPhone X No
Individual 18 Pixel 4A No
Supervised Learning Pipeline
1. Split complete data into training and test/validation dataset
Using randomSplit() to split the data
2. Estimate a model on the training dataset
pyspark.ml.regression for Regression Problems
pyspark.ml.classification for Classification Problems
3. Predict using the test dataset
4. Evaluate the model using metrics of accuracy/error
pyspark.ml.evaluate for evaluating
5. Creating and selecting the best model
pyspark.ml.tuning for Hyper-parameter tuning 3
18

You might also like