Scalable Machine Learning With Apache Spark en
Scalable Machine Learning With Apache Spark en
Learning
with Apache
Spark™
▪ Introductions
▪ Name
▪ Spark/ML/Databricks Experience
▪ Professional Responsibilities
▪ Fun Personal Interest/Fact
▪ Expectations for the Course
Programming
Apache Spark Machine Learning Language
Logical Plan
Catalyst Optimizer
Physical Execution
Cost Model
Unresolved Optimized Physical Selected
Logical Physical
Logical
Plan
Logical Physical
Plans Physical RDDs
Plan Plan Plans
Plans Plan
DataFrame
Machine
Features Output
Learning
▪ Labeled data (known function output) ▪ Unlabeled data (no known function output)
▪ Regression (a continuous/ordinal-discrete ▪ Clustering (categorize records based on
output) features)
▪ Classification (a categorical output) ▪ Dimensionality reduction (reduce feature space)
▪ Labeled and unlabeled data, mostly unlabeled ▪ States, actions, and rewards
▪ Combines supervised learning and ▪ Useful for exploring spaces and exploiting
unsupervised learning information to maximize expected cumulative
▪ Commonly trying to label the unlabeled data rewards
to be used in another round of training ▪ Frequently utilizes neural networks and deep
learning
Define
Define Success,
Feature
Business Use Constraints Data Collation Modeling Deployment
Engineering
Case and
Infrastructure
y≈ŷ+ϵ
where...
x: feature
y: label
w0: y-intercept
w1: slope of the line of best fit X
Evaluation Metrics
▪ Loss: (y - ŷ)
▪ Absolute loss: |y - ŷ|
▪ Squared loss: (y - ŷ)2
X
Test
Do we want it to be higher or
lower?
Dog OHE 1 0 0
Cat 0 1 0
Fish 0 0 1
▪ But what if we have an entire zoo of animals? That would result in really wide
data!
Track ML development
with one line of code:
parameters, metrics,
data lineage, model, and
environment. Model, environment, and artifacts
Metrics
Parameters and tags,
mlflow.autolog() including data version
Serving
In-Line Code Containers Batch & Stream OSS Inference Cloud Inference
Scoring Solutions Services
v1
v2
v3
Yes No
Yes No
Yes No
Yes No
Salary : 61,000
Commute: 30 mins Accept Offer Decline Offer
Free Coffee: No
©2023 Databricks Inc. — All rights reserved
Decision Making
Salary > $50,000 Root Node
Yes No
Salary : 61,000
Commute > 1 hr Decline Offer Commute: 30 mins
Free Coffee: No
Yes No
Yes No
Commute? Commute?
Commute? Bonus?
$50,000
Salary
1 hour
X $50,000 Salary
Model Complexity
©2023 Databricks Inc. — All rights reserved
©2023 Databricks Inc. — All rights reserved Source
Building Five Hundred Decision Trees
▪ Using more data reduces variance for one model
▪ Averaging more predictions reduces prediction variance
▪ But that would require more decision trees
▪ And we only have one training set … or do we?
Final
Prediction
...
...
Aggregation
Final Prediction
5 2 5 2
8 4 5 4
8 2
8 4
Question: With 3-fold cross validation, how many models will this build?
▪ Bayesian process
▪ Creates meta model that maps hyperparameters to
probability of a score on the objective function
▪ Provide a range and distribution for continuous and
discrete values
▪ Adaptive TPE better tunes the search space by
▪ Freezing hyperparameters
▪ Tuning number of random trials before TPE
MLflow experiment
Auto-created MLflow Easily deploy
UI and API to Experiment to track models and to Model
start AutoML metrics Registry
training
Data exploration
notebook Understand and
Generated notebook with debug data
feature summary statistics and quality and
distributions preprocessing
“Can this dataset be used to predict “What direction should I go in for this
customer churn?” ML project and what benchmark
should
I aim to beat?”
? ?
AutoML AutoML Returned Production Deployed
Configuration Training Best Model Cliff Model
“Opaque
Box”
Problem Result / Pain Points
1. A “production cliff” exists where data scientists need to ● The “best” model returned is often not good enough
modify the returned “best” model using their domain to deploy
expertise before deployment ● Data scientists must spend time and energy reverse
2. Data scientists need to be able to explain how they engineering these “opaque-box” returned models so
trained a model for regulatory purposes (e.g., FDA, GDPR, that they can modify them and/or explain them
etc.) and most AutoML solutions have “opaque box”
models
©2023 Databricks Inc. — All rights reserved
“Glass-Box” AutoML
Configure
Deploy
40 35 5 5 3 2 40 38
60 67 -7 -7 -4 -3 60 63
30 28 2 2 3 -1 30 31
33 32 1 1 0 1 33 32
Optimum Model
Complexity
Variance
Bias2
Model Complexity
1. Batch
▪ 80-90% of deployments
▪ Leverages databases and object storage
▪ Fast retrieval of stored predictions
2. Streaming (continuous)
▪ 10-15% of deployments
▪ Moderately fast scoring on new data
3. Real Time
▪ 5-10% of deployments
▪ Usually using REST (Azure ML, SageMaker, containers)
4. On-device (edge)
Latency Requirements
▪ Logistic Function:
▪ Large positive inputs → 1
▪ Large negative inputs → 0
©2023 Databricks Inc. — All rights reserved
Converting Probabilities to Classes
▪ In binary classification, the class probabilities are directly
complementary
▪ So let’s set our Red class equal to 1, and our Blue class equal to 0
▪ The model output is 𝐏[y = 1 | x] where x represents the features
But we need class predictions, not probability predictions
▪ Set a threshold on the probability predictions
▪ 𝐏[y = 1 | x] < 0.5 → y = 0
▪ 𝐏[y = 1 | x] ≥ 0.5 → y = 1
TP + TN TP
TP + FP + TN + FN TP + FP
Recall F1
TP 2 x Precision x Recall
TP + FN Precision + Recall