Slides Scalable Machine Learning With Apache Spark
Slides Scalable Machine Learning With Apache Spark
*Optional
Survey
Programming
Apache Spark Machine Learning Language
LET’S GET STARTED
Apache Spark™ Overview
Apache Spark Background
▪ Founded as a research project at UC
Berkeley in 2009
▪ Open-source unified data analytics
engine for big data
▪ Built-in APIs in SQL, Python, Scala, R,
and Java
Have you ever counted the
number of M&Ms in a jar?
Spark Cluster
Driver One Driver
Logical Plan
Catalyst Optimizer
Physical Execution
Under the Catalyst Optimizer’s Hood
Cost Model
Unresolved Optimized Physical Selected
Logical
Logical
Logical
Physical
Plan
Physical
Plans Physical RDDs
Plan Plan Plans
Plans Plan
DataFrame
When to Use Spark
Machine
Features Output
Learning
Types of Machine Learning
Supervised Learning Unsupervised Learning
▪ Labeled data (known function output) ▪ Unlabeled data (no known function output)
▪ Regression (a continuous/ordinal-discrete output) ▪ Clustering (categorize records based on features)
▪ Classification (a categorical output) ▪ Dimensionality reduction (reduce feature space)
Types of Machine Learning
Semi-supervised Learning Reinforcement Learning
▪ Labeled and unlabeled data, mostly unlabeled ▪ States, actions, and rewards
▪ Combines supervised learning and unsupervised ▪ Useful for exploring spaces and exploiting
learning information to maximize expected cumulative
▪ Commonly trying to label the unlabeled data to be rewards
used in another round of training ▪ Frequently utilizes neural networks and deep
learning
Machine Learning Workflow
Define
Define Success,
Feature
Business Use Constraints Data Collation Modeling Deployment
Engineering
Case and
Infrastructure
Defining and Measuring Success: Establish
baseline!
DATA CLEANSING DEMO
Importance of Data Visualization
Importance of Data Visualization
How do we build and evaluate models?
DATA EXPLORATION LAB
Linear Regression
Linear Regression
Goal: Find the line of best fit. Y
ŷ = w0+w1x
y≈ŷ+ϵ
where...
x: feature
y: label
w0: y-intercept
w1: slope of the line of best fit X
Minimizing the Residuals
Y
Evaluation Metrics
▪ Loss: (y - ŷ)
▪ Absolute loss: |y - ŷ|
▪ Squared loss: (y - ŷ)2
X
Evaluation Metric: Root mean-squared-error
(RMSE)
Linear Regression Assumptions
Y
▪ Linear relationship between X and
the mean of Y (linearity)
▪ Observations are independent from
one another (independence)
▪ Y is normally distributed for any fixed
observation (normality)
▪ The variance of residual is the same
for any feature (homoscedasticity)
X
Linear Regression Assumptions
So, which datasets are suited for linear regression?
Train vs. Test RMSE
Test
Evaluation Metric: R2
Dog OHE 1 0 0
Cat 0 1 0
Fish 0 0 1
▪ But what if we have an entire zoo of animals? That would result in really wide
data!
Runs Experiments
Serving
In-Line Code Containers Batch & Stream OSS Inference Cloud Inference
Scoring Solutions Services
Model Lifecycle
In-Line Code
Model Registry
Data Scientists Deployment Engineers
Models Tracking Containers
Metadata Models
v2
Cloud Inference
Custom
Services
Models
v3
Serving
OSS Serving
Solutions
MLFLOW TRACKING
DEMO
MLflow Model Registry
MLflow Model Registry
▪ Collaborative, centralized model hub
▪ Facilitate experimentation, testing, and production
▪ Integrate with approval and governance workflows
▪ Monitor ML deployments and their performance
v1
v2
v3
Yes No
Yes No
Yes No
Salary : 61,000
Commute: 30 mins Accept Offer Decline Offer
Free Coffee: No
Decision Making
Offers Free Coffee Root Node
Yes No
Salary : 61,000
Salary > $50,000 Decline Offer
Commute: 30 mins
Yes No Free Coffee: No
Yes No
Yes No
Commute? Commute?
Commute? Bonus?
$50,000
Salary
Create Split Candidates
Feature values
Lines vs. Boundaries
Linear Regression Decision Trees
▪ Lines through data ▪ Boundaries instead of lines
▪ Assumed linear relationship ▪ Learn complex relationships
Commute
1 hour
X $50,000 Salary
Linear Regression or Decision Tree?
Tree Depth: the length of the Salary > $50,000 Root Node 0
longest path from a root note to a
leaf node Yes No
Note: shallow trees tend to underfit, and deep trees tend to overfit
Underfitting vs. Overfitting
Underfitting Just Right Overfitting
Additional Resource
Model Complexity
https://www.explainxkcd.com/wiki/index.php/2021:_Software_Development
Building Five Hundred Decision Trees
▪ Using more data reduces variance for one model
▪ Averaging more predictions reduces prediction variance
▪ But that would require more decision trees
▪ And we only have one training set … or do we?
Bootstrap Sampling
A method for simulating N new datasets:
Final Prediction
Random Forest Algorithm
Full Training Data
...
...
Aggregation
Final Prediction
5 2 5 2
8 4 5 4
8 2
8 4
Question: With 3-fold cross validation, how many models will this build?
HYPERPARAMETER TUNING
LAB
Hyperparameter Tuning
with HyperOpt
Problems with Grid Search
▪ Exhaustive enumeration is expensive
▪ Manually determined search space
▪ Past information on good hyperparameters isn’t used
▪ So what do you do if…
▪ You have a training budget
▪ You have many hyperparameters to tune
▪ You want to pick your hyperparameters based on past results
Hyperopt
▪ Open-source Python library
▪ Optimization over awkward search spaces (real-valued, discrete,
and conditional dimensions)
▪ Supports serial or parallel optimization
▪ Spark integration
▪ Three core algorithms for optimization:
▪ Random Search
▪ Tree of Parzen Estimators (TPE)
▪ Adaptive TPE
Paper
Optimizing Hyperparameter Values
Random Search
▪ Bayesian process
▪ Creates meta model that maps hyperparameters to probability
of a score on the objective function
▪ Provide a range and distribution for continuous and discrete
values
▪ Adaptive TPE better tunes the search space by
▪ Freezing hyperparameters
▪ Tuning number of random trials before TPE
HYPEROPT
DEMO
HYPEROPT
LAB
AutoML
Databricks AutoML
A glass-box solution that empowers data teams without taking away control
“Can this dataset be used to predict “What direction should I go in for this ML
customer churn?” project and what benchmark should
I aim to beat?”
Problems with Existing AutoML
Solutions
Opaque-Box and Production Cliff Problems in AutoML
? ?
AutoML AutoML Returned Production Deployed
Configuration Training Best Model Cliff Model
“Opaque Box”
1. A “production cliff” exists where data scientists need to ● The “best” model returned is often not good enough to
modify the returned “best” model using their domain deploy
expertise before deployment ● Data scientists must spend time and energy reverse
2. Data scientists need to be able to explain how they trained a engineering these “opaque-box” returned models so that
model for regulatory purposes (e.g., FDA, GDPR, etc.) and they can modify them and/or explain them
most AutoML solutions have “opaque box” models
“Glass-Box” AutoML
Configure
Deploy
AutoML Lab
Feature Store
Feature Store
The first Feature Store codesigned with a Data and MLOps Platform
Feature Store
Batch (high throughput)
Feature
Feature Registry
Provider
Online (low latency)
40 35 5 5 3 2 40 38
60 67 -7 -7 -4 -3 60 63
30 28 2 2 3 -1 30 31
33 32 1 1 0 1 33 32
Boosting vs. Bagging
GBDT RF
▪ Starts with high bias, low variance ▪ Starts with high variance, low bias
▪ Works right ▪ Works left
Optimum Model
Complexity
Variance
Bias2
Model Complexity
Gradient Boosted Decision Trees Implementations
▪ Spark ML
▪ Built into Spark
▪ Utilizes Spark’s existing decision tree implementation
▪ XGBoost
▪ Designed and built specifically for gradient boosted trees
▪ Regularized to prevent overfitting
▪ Highly parallel
▪ Works nicely with Spark in Scala
XGBOOST DEMO
Appendix
MLlib Deployment Options
Data Science vs. Data Engineering
▪ Data Science != Data Engineering
▪ Data Science
▪ Scientific
▪ Art
▪ Business problems
▪ Model mathematically
▪ Optimize performance
▪ Data Engineering
▪ Reliability
▪ Scalability
▪ Maintainability
▪ SLAs
Model Operations (ModelOps)
▪ DevOps
▪ Software development and IT operations
▪ Manages deployments
▪ CI/CD of features, patches, updates, and rollbacks
▪ Agile vs. waterfall
▪ ModelOps
▪ Data modeling and deployment operations
▪ Java environments
▪ Containers
▪ Model performance monitoring
The Four ML Deployment Options
▪ Batch
▪ 80-90 percent of deployments
▪ Leverages databases and object storage
▪ Fast retrieval of stored predictions
▪ Continuous/Streaming
▪ 10-15 percent of deployments
▪ Moderately fast scoring on new data
▪ Real-time
▪ 5-10 percent of deployments
▪ Usually using REST (Azure ML, SageMaker, containers)
▪ On-device
Overview of a typical Databricks CI/CD pipeline
Continuous Continuous
integration delivery
▪ Logistic Function:
▪ Large positive inputs → 1
▪ Large negative inputs → 0
Converting Probabilities to Classes
▪ In binary classification, the class probabilities are directly complementary
▪ So let’s set our Red class equal to 1, and our Blue class equal to 0
▪ The model output is 𝐏[y = 1 | x] where x represents the features
TP + TN TP
TP + FP + TN + FN TP + FP
Recall F1
TP 2 x Precision x Recall
TP + FN Precision + Recall
K-Means
Clustering
▪ Unsupervised learning
▪ Unlabeled data (no known function output)
▪ Categorize records based on features
K-Means Clustering
Global minimum
Local minimum
Other Clustering Techniques
Collaborative Filtering
Recommendation Systems
Naive Approaches to Recommendation
▪ Hand-curated
▪ Aggregates