0% found this document useful (0 votes)
72 views

Scalable Machine Learning With Apache Spark en

Uploaded by

Ankit Kabi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views

Scalable Machine Learning With Apache Spark en

Uploaded by

Ankit Kabi
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 145

Scalable Machine

Learning
with Apache
Spark™

©2023 Databricks Inc. — All rights reserved


Introductions

▪ Introductions
▪ Name
▪ Spark/ML/Databricks Experience
▪ Professional Responsibilities
▪ Fun Personal Interest/Fact
▪ Expectations for the Course

©2023 Databricks Inc. — All rights reserved


Course Objectives
1 Create data processing pipelines with Spark

2 Build and tune machine learning models with Spark ML

3 Track, version, and deploy machine learning models with MLflow

4 Perform distributed hyperparameter tuning with Hyperopt

5 Scale the inference of single-node models with Spark

©2023 Databricks Inc. — All rights reserved


Agenda (half-days)
Day 1 Day 2 Day 3 Day 4
1. Spark Review* 1. Linear Regression, pt. 1. Decision Trees 1. AutoML
2. Delta Lake Review* 1 Lab 2. Break 2. AutoML Lab
3. ML Overview* 2. Linear Regression, pt. 3. Random Forest and 3. Feature Store
4. Break 2 Hyperparameter 4. Break
5. Data Cleansing 3. Break Tuning 5. XGBoost
6. Data Exploration Lab 4. Linear Regression, pt. 4. Hyperparameter 6. Inference with
7. Break 2 Lab Tuning Lab Pandas UDFs
8. Linear Regression, pt. 5. MLflow Tracking 5. Break 7. Pandas UDFs Lab
1 6. Break 6. Hyperopt 8. Break
7. MLflow Model 7. Hyperopt Lab 9. Training with Pandas
Registry Function API
8. MLflow Lab 10. Pandas API on Spark

©2023 Databricks Inc. — All rights reserved *Optional


Agenda (full days)
Day 1 Day 2
1. Spark Review* 1. Decision Trees
2. Delta Lake Review* 2. Break
3. ML Overview* 3. Random Forest and Hyperparameter
4. Break Tuning
5. Data Cleansing 4. Hyperparameter Tuning Lab
6. Data Exploration Lab 5. Break
7. Break 6. Hyperopt
8. Linear Regression, pt. 1 7. Hyperopt Lab
9. Linear Regression, pt. 1 Lab 8. AutoML
10. Linear Regression, pt. 2 9. AutoML Lab
11. Break 10. Feature Store
12. Linear Regression, pt. 2 Lab 11. Break
13. MLflow Tracking 12. XGBoost
14. Break 13. Inference with Pandas UDFs
15. MLflow Model Registry 14. Pandas UDFs Lab
16. MLflow Lab 15. Break
16. Training with Pandas Function API
17. Koalas
©2023 Databricks Inc. — All rights reserved *Optional
Survey

Programming
Apache Spark Machine Learning Language

©2023 Databricks Inc. — All rights reserved


Databricks Certified ML Associate
Certification helps you gain industry recognition, competitive
differentiation, greater productivity, and results.

• This course helps you prepare for the


Databricks Certified Machine Learning
Associate exam
• Please see the Databricks Academy for
additional prep materials

For more information visit:


databricks.com/learn/certification

©2022 Databricks Inc. — All rights reserved 7


LET’S GET STARTED

©2023 Databricks Inc. — All rights reserved


Apache Spark™
Overview

©2023 Databricks Inc. — All rights reserved


Apache Spark Background
▪ Founded as a research project at
UC Berkeley in 2009
▪ Open-source unified data analytics
engine for big data
▪ Built-in APIs in SQL, Python, Scala,
R, and Java

©2023 Databricks Inc. — All rights reserved


Have you ever counted
the number of M&Ms in a
jar?
©2023 Databricks Inc. — All rights reserved
Spark Cluster
Driver One Driver

Worker Worker Worker Worker

Executor Executor Executor Executor

JVM JVM JVM JVM

Many Executor JVMs


©2023 Databricks Inc. — All rights reserved
Spark’s Structured Data APIs

RDD DataFrame Dataset


(2011) (2013) (2015)

Distributed collection of Distributed collection of Internally rows, externally


JVM objects row objects JVM objects

Functional operators (map, Expression-based Almost the “best of both


filter, etc.) operations and UDFs worlds”: type safe + fast

Logical plans and optimizer But still slower than


DataFrames
Fast/efficient internal
representations

©2023 Databricks Inc. — All rights reserved


Spark DataFrame Execution
PySpark DataFrame Java/Scala DataFrame SparkR DataFrame

Logical Plan

Catalyst Optimizer

Physical Execution

©2023 Databricks Inc. — All rights reserved


Under the Catalyst Optimizer’s Hood

Logical Physical Code


Analysis
Optimization Planning Generation
SQL Query

Cost Model
Unresolved Optimized Physical Selected
Logical Physical
Logical
Plan
Logical Physical
Plans Physical RDDs
Plan Plan Plans
Plans Plan

DataFrame

©2023 Databricks Inc. — All rights reserved


When to Use Spark
Data or model is too large to
Scaling Out process on a single machine,
commonly resulting in
out-of-memory errors

Data or model is processing slowly


and could benefit from shorter
Speeding Up processing times and faster
results

©2023 Databricks Inc. — All rights reserved


Delta Lake Overview

©2023 Databricks Inc. — All rights reserved


Open-source Storage Layer

©2023 Databricks Inc. — All rights reserved


Delta Lake’s Key Features
▪ ACID transactions
▪ Time travel (data versioning)
▪ Schema enforcement and evolution
▪ Audit history
▪ Parquet format
▪ Compatible with Apache Spark API

©2023 Databricks Inc. — All rights reserved


Machine Learning
Overview (Optional)

©2023 Databricks Inc. — All rights reserved


What is Machine Learning
▪ Learn patterns and relationships in your data without explicitly
programming them
▪ Derive an approximation function to map features to an output or relate
them to each other

Machine
Features Output
Learning

©2023 Databricks Inc. — All rights reserved


Types of Machine Learning
Supervised Learning Unsupervised Learning

▪ Labeled data (known function output) ▪ Unlabeled data (no known function output)
▪ Regression (a continuous/ordinal-discrete ▪ Clustering (categorize records based on
output) features)
▪ Classification (a categorical output) ▪ Dimensionality reduction (reduce feature space)

©2023 Databricks Inc. — All rights reserved


Types of Machine Learning
Semi-supervised Learning Reinforcement Learning

▪ Labeled and unlabeled data, mostly unlabeled ▪ States, actions, and rewards
▪ Combines supervised learning and ▪ Useful for exploring spaces and exploiting
unsupervised learning information to maximize expected cumulative
▪ Commonly trying to label the unlabeled data rewards
to be used in another round of training ▪ Frequently utilizes neural networks and deep
learning

©2023 Databricks Inc. — All rights reserved


Machine Learning Workflow

Define
Define Success,
Feature
Business Use Constraints Data Collation Modeling Deployment
Engineering
Case and
Infrastructure

©2023 Databricks Inc. — All rights reserved


Defining and Measuring Success: Establish
baseline!

©2023 Databricks Inc. — All rights reserved


DATA CLEANSING DEMO

©2023 Databricks Inc. — All rights reserved


Importance of Data Visualization

©2023 Databricks Inc. — All rights reserved


Importance of Data Visualization

©2023 Databricks Inc. — All rights reserved


How do we build and evaluate models?

©2023 Databricks Inc. — All rights reserved


DATA EXPLORATION LAB

©2023 Databricks Inc. — All rights reserved


Linear Regression

©2023 Databricks Inc. — All rights reserved


Linear Regression
Goal: Find the line of best fit. Y
ŷ = w0+w1x

y≈ŷ+ϵ

where...

x: feature
y: label
w0: y-intercept
w1: slope of the line of best fit X

©2023 Databricks Inc. — All rights reserved


Minimizing the Residuals
Y

▪ Red point: True value


▪ Purple & Orange dotted lines:
Residuals
▪ Green line: Line of best fit

The goal is to draw a line


that minimizes the sum of
X the squared residuals.

©2023 Databricks Inc. — All rights reserved


Regression Evaluators
Y
Measure the “closeness”
between the actual value
and the predicted value.

Evaluation Metrics

▪ Loss: (y - ŷ)
▪ Absolute loss: |y - ŷ|
▪ Squared loss: (y - ŷ)2
X

©2023 Databricks Inc. — All rights reserved


Evaluation Metric: Root mean-squared-error
(RMSE)

©2023 Databricks Inc. — All rights reserved


Linear Regression Assumptions
Y
▪ Linear relationship between X and
the mean of Y (linearity)
▪ Observations are independent
from one another (independence)
▪ Y is normally distributed for any
fixed observation (normality)
▪ The variance of residual is the
same for any feature
(homoscedasticity)

©2023 Databricks Inc. — All rights reserved


Linear Regression Assumptions
So, which datasets are suited for linear regression?

©2023 Databricks Inc. — All rights reserved


Train vs. Test RMSE

Which is more important? Why?


Train

Test

©2023 Databricks Inc. — All rights reserved


Evaluation Metric: R2

What is the range of R2?

Do we want it to be higher or
lower?

©2023 Databricks Inc. — All rights reserved


Machine Learning Libraries

Scikit-learn is a popular single-node machine learning library.

But what if our data or model get too big?

©2023 Databricks Inc. — All rights reserved


Machine Learning in Spark
Machine learning in Spark allows us to work
Scale Out and Speed Up with bigger data and train models faster by
distributing the data and computations
across multiple workers.

Spark Machine Learning MLlib Spark ML


Libraries
Original ML API Newer ML API for
for Spark Spark

Based on RDDs Based on


DataFrames
Maintenance
©2023 Databricks Inc. — All rights reserved Mode
LINEAR REGRESSION
DEMO I

©2023 Databricks Inc. — All rights reserved


LINEAR REGRESSION
LAB I

©2023 Databricks Inc. — All rights reserved


Non-numeric Features
Two primary types of non-numeric features

Categorical Features Ordinal Features

A series of categories of a single A series of categories of a single


feature feature

No intrinsic ordering Relative ordering, but not


necessarily consistent spacing
e.g. Dog, Cat, Fish
e.g. Infant, Toddler, Adolescent,
Teen, Young Adult, etc.

©2023 Databricks Inc. — All rights reserved


Non-numeric Features in Linear Regression
How do we handle non-numeric features for linear
regression?

▪ X-axis is numeric, so features need to be numeric


▪ Convert our non-numeric features to numeric features?

Could we assign numeric values to each of the categories?

▪ “Dog” = 1, “Cat” = 2, “Fish” = 3, etc.


▪ Does this make sense?

©2023 Databricks Inc. — All rights reserved


This implies 1 Cat is equal to 2 Dogs!
Non-numeric Features in Linear Regression
Instead, we commonly use a practice known as one-hot encoding (OHE).
▪ Creates a binary “dummy” feature for each category

Animal Dog Cat Fish

Dog OHE 1 0 0

Cat 0 1 0

Fish 0 0 1

▪ Doesn’t force a uniformly-spaced, ordered numeric representation

©2023 Databricks Inc. — All rights reserved


One-hot Encoding at Scale
You might be thinking...
▪ Okay, I see what’s happening here … this works for a handful of animals.

▪ But what if we have an entire zoo of animals? That would result in really wide
data!

Spark uses sparse vectors for this…


DenseVector(0, 0, 0, 7, 0, 2, 0, 0, 0, 0)
SparseVector(10, [3, 5], [7, 2])

▪ Sparse vectors take the form:

(Number of elements, [indices of non-zero elements], [values of non-zero elements])

©2023 Databricks Inc. — All rights reserved


LINEAR REGRESSION
DEMO II

©2023 Databricks Inc. — All rights reserved


LINEAR REGRESSION
LAB II

©2023 Databricks Inc. — All rights reserved


MLflow Tracking

©2023 Databricks Inc. — All rights reserved


MLflow

▪ Open-source platform for machine learning lifecycle


▪ Operationalizing machine learning
▪ Developed by Databricks
▪ Pre-installed on the Databricks Runtime for ML

©2023 Databricks Inc. — All rights reserved


Core Machine Learning Issues
▪ Keeping track of experiments or model development
▪ Reproducing code
▪ Comparing models
▪ Standardization of packaging and deploying models

MLflow addresses these issues.

©2023 Databricks Inc. — All rights reserved


MLflow Components

Tracking Projects Models Model Registry


Record and Packaging format General model Centralized and
query for reproducible format that collaborative
experiments: runs on any supports diverse model lifecycle
code, data, platform deployment management
config, results tools

▪ APIs: CLI, Python, R, Java, REST

©2023 Databricks Inc. — All rights reserved


Ensure
MLflow tracking and autologging reproducibility

Track ML development
with one line of code:
parameters, metrics,
data lineage, model, and
environment. Model, environment, and artifacts
Metrics
Parameters and tags,
mlflow.autolog() including data version

Analyze results in UI or programmatically


● How does tuning parameter X affect my metric?
● What is the best model?
● Did I run training for long enough?
©2023 Databricks Inc. — All rights reserved
Model Deployment Options

Serving

In-Line Code Containers Batch & Stream OSS Inference Cloud Inference
Scoring Solutions Services

©2023 Databricks Inc. — All rights reserved


The Full ML Lifecycle

©2023 Databricks Inc. — All rights reserved


MLFLOW TRACKING
DEMO

©2023 Databricks Inc. — All rights reserved


MLflow Model Registry

©2023 Databricks Inc. — All rights reserved


MLflow Model Registry
▪ Collaborative, centralized model hub
▪ Facilitate experimentation, testing, and production
▪ Integrate with approval and governance workflows
▪ Monitor ML deployments and their performance

Databricks MLflow Blog Post

©2023 Databricks Inc. — All rights reserved


One Collaborative Hub for Model
Management
Full lineage from deployed models to training code /
Centralized Model Management and Discovery data

● Overview of all registered models, their versions at


Staging and Production
● Search by name, tags, etc.
● Model-based ACLs ● Full lineage from Model Version to
○ Run that produced the model
○ Notebook that produced the run
○ Exact revision history of the notebook that produced the run

©2023 Databricks Inc. — All rights reserved


Version Control and Visibility into
Deployment Process
Versioning of ML artifacts Visibility and auditability of the deployment process

● Audit log of stage transitions and requests per model

● Overview of active model versions and their deployment


stage
● Comparison of versions and their logged metrics, parameters,
etc.
©2023 Databricks Inc. — All rights reserved
Review Processes and CI/CD
Integration
Manual review process Automation through CI/CD integration

Webhooks allow registering of callbacks


(e.g. for tests / deployment) on events in
the Model Registry

Staging Production Archived

v1

v2

v3

Data Scientists Deployment


Engineers
● Stage-based Access Controls
● Request and approval workflow for stage transitions
● Webhooks for events like model creation, version creation, transition
request, etc.
● Mechanisms to store results / metadata through Tags and Comments

©2023 Databricks Inc. — All rights reserved


MLFLOW MODEL REGISTRY
DEMO

©2023 Databricks Inc. — All rights reserved


MLFLOW
LAB

©2023 Databricks Inc. — All rights reserved


Decision Trees

©2023 Databricks Inc. — All rights reserved


Decision Making
Salary > $50,000 Root Node
Yes No

Commute > 1 hr Decline Offer


Yes No

Decline Offer Offers Free Coffee

Yes No

Leaf Node Accept Offer Decline Offer Leaf Node

Salary : 61,000 Salary : 61,000


Commute: 30 mins Commute: 30 mins
Free Coffee: Yes Free Coffee: No
©2023 Databricks Inc. — All rights reserved
Decision Making
Salary > $50,000

Yes No

Commute > 1 hr Decline Offer


Yes No

Decline Offer Offers Free Coffee

Yes No

Accept Offer Decline


Salary Offer
> $60,000

Yes No

Salary : 61,000
Commute: 30 mins Accept Offer Decline Offer
Free Coffee: No
©2023 Databricks Inc. — All rights reserved
Decision Making
Salary > $50,000 Root Node
Yes No
Salary : 61,000
Commute > 1 hr Decline Offer Commute: 30 mins
Free Coffee: No
Yes No

Decline Offer Offers Free Coffee

Yes No

Accept Offer Salary > $60,000

Accept Offer Decline Offer

©2023 Databricks Inc. — All rights reserved


Determining Splits

Commute? Commute?

< 1 hr > 1 hr < 1 0 min > 10 min

1 hr is a better splitting point for Commute because it


provides information about the classification.
©2023 Databricks Inc. — All rights reserved
Determining Splits

Commute? Bonus?

< 1 hr > 1 hr Yes No

Commute is a better choice because it provides


information about the classification.
©2023 Databricks Inc. — All rights reserved
Creating Decision Boundaries
Commute

Salary > $50,000 Decline Offer


Yes No
1 hour

Commute > 1 hr Decline Offer


Decline Offer
Yes No
Accept Offer
Decline Offer Accept Offer

$50,000
Salary

©2023 Databricks Inc. — All rights reserved


Lines vs. Boundaries
Linear Regression Decision Trees
▪ Lines through data ▪ Boundaries instead of lines
▪ Assumed linear relationship ▪ Learn complex relationships
Commute

1 hour

X $50,000 Salary

©2023 Databricks Inc. — All rights reserved


Linear Regression or Decision Tree?

It depends on the data...


©2023 Databricks Inc. — All rights reserved
Tree Depth

Tree Depth: the length of the Salary >


Root Node 0
$50,000
longest path from a root note to
a leaf node Yes No

Commute > 1 hr Decline Offer 1


Yes No
3
Decline Offer
Offers Free
Coffee
2
Yes No

Leaf Node Accept Offer Decline Offer Leaf Node 3

Note: shallow trees tend to underfit, and deep trees tend to


©2023 Databricks Inc. — All rights reserved overfit
Underfitting vs. Overfitting
Underfitting Just Right Overfitting

©2023 Databricks Inc. — All rights reserved


Additional Resource

R2D3 has an excellent visualization of


how decision trees work.

©2023 Databricks Inc. — All rights reserved


DECISION TREE DEMO

©2023 Databricks Inc. — All rights reserved


Random Forests

©2023 Databricks Inc. — All rights reserved


Decision Trees
Pros Cons
▪ Interpretable ▪ Poor accuracy
▪ Simple ▪ High variance
▪ Classification/Regression
▪ Nonlinear relationships

©2023 Databricks Inc. — All rights reserved


Bias vs. Variance

©2023 Databricks Inc. — All rights reserved


Bias-Variance Tradeoff
Error = Variance + Bias2 + noise

Error ▪ Reduce Bias


Optimum Model
Total Error
Complexity
▪ Build more complex
Variance models
▪ Reduce Variance
▪ Use a lot of data
▪ Build simple models
▪ What about the noise?
Bias2

Model Complexity
©2023 Databricks Inc. — All rights reserved
©2023 Databricks Inc. — All rights reserved Source
Building Five Hundred Decision Trees
▪ Using more data reduces variance for one model
▪ Averaging more predictions reduces prediction variance
▪ But that would require more decision trees
▪ And we only have one training set … or do we?

©2023 Databricks Inc. — All rights reserved


Bootstrap Sampling
A method for simulating N new datasets:

1. Take sample with replacement from original training set


2. Repeat N times

©2023 Databricks Inc. — All rights reserved


Bootstrap Visualization
Bootstrap 1 (N = 100) Bootstrap 2 (N = 100)

Training Set (N = 100)

Bootstrap 3 (N = 100) Bootstrap 4 (N = 100)

Why are some points in the bootstrapped


samples not selected?
©2023 Databricks Inc. — All rights reserved
Training Set Coverage
Assume we are bootstrapping N draws from a training set with
N observations ...
▪ Probability of an element getting picked in each draw:
▪ Probability of an element not getting picked in each draw:
▪ Probability of an element not getting drawn in the entire
sample:

As N → ∞, the probability for each element of not


getting picked in a sample approaches 0.368.

©2023 Databricks Inc. — All rights reserved


Bootstrap Aggregating
▪ Train a tree on each of sample, and average the predictions
▪ This is bootstrap aggregating, commonly referred to as
bagging
Bootstrap 1 Bootstrap 2 Bootstrap 3 Bootstrap 4

Decision Tree Decision Tree Decision Tree Decision Tree


1 2 3 4

Final
Prediction

©2023 Databricks Inc. — All rights reserved


Random Forest Algorithm
Full Training Data

Bootstrap 1 Bootstrap 2 Bootstrap K

...

At each split, a subset of features is considered to


ensure each tree is different.

©2023 Databricks Inc. — All rights reserved


Random Forest Aggregation
Scoring Record

...

Aggregation

Final Prediction

▪ Majority-voting for classification


▪ Mean for regression
©2023 Databricks Inc. — All rights reserved
RANDOM FOREST DEMO

©2023 Databricks Inc. — All rights reserved


Hyperparameter
Tuning

©2023 Databricks Inc. — All rights reserved


What is a Hyperparameter?
▪ Examples for Random Forest:
▪ Tree depth
▪ Number of trees
▪ Number of features to consider

A parameter whose value is used to


control the training process.

©2023 Databricks Inc. — All rights reserved


Selecting Hyperparameter Values
▪ Build a model for each hyperparameter value
▪ Evaluate each model to identify the optimal hyperparameter
value
▪ What dataset should we use to train and evaluate?

Training Validation Test

What if there isn’t enough data to split


into three separate sets?

©2023 Databricks Inc. — All rights reserved


K-Fold Cross Validation

Pass 1: Training Training Validation


Average
Validation Errors
Pass 2: Training Validation Training to Identify
Optimal
Hyperparameter
Pass 3: Validation Training Training Values

Final Pass: Training with Optimal Hyperparameters Test

©2023 Databricks Inc. — All rights reserved


HYPERPARAMETER TUNING
DEMO

©2023 Databricks Inc. — All rights reserved


Optimizing Hyperparameter Values
Grid Search
▪ Train and validate every unique combination of
hyperparameters
Tree Depth Number of Trees Tree Depth Number of Trees

5 2 5 2

8 4 5 4

8 2

8 4

Question: With 3-fold cross validation, how many models will this build?

©2023 Databricks Inc. — All rights reserved


HYPERPARAMETER TUNING
LAB

©2023 Databricks Inc. — All rights reserved


Hyperparameter
Tuning
with Hyperopt

©2023 Databricks Inc. — All rights reserved


Problems with Grid Search
▪ Exhaustive enumeration is expensive
▪ Manually determined search space
▪ Past information on good hyperparameters isn’t used
▪ So what do you do if…
▪ You have a training budget
▪ You have many hyperparameters to tune
▪ You want to pick your hyperparameters based on past
results

©2023 Databricks Inc. — All rights reserved


Hyperopt
▪ Open-source Python library
▪ Optimization over awkward search spaces (real-valued,
discrete, and conditional dimensions)
▪ Supports serial or parallel optimization
▪ Spark integration
▪ Core algorithms for optimization:
▪ Random Search
▪ Adaptive Tree of Parzen Estimators (TPE)

©2023 Databricks Inc. — All rights reserved


Paper
Optimizing Hyperparameter Values
Random Search

Generally outperforms grid search

©2023 Databricks Inc. — All rights reserved


Optimizing Hyperparameter Values
Tree of Parzen Estimators

▪ Bayesian process
▪ Creates meta model that maps hyperparameters to
probability of a score on the objective function
▪ Provide a range and distribution for continuous and
discrete values
▪ Adaptive TPE better tunes the search space by
▪ Freezing hyperparameters
▪ Tuning number of random trials before TPE

©2023 Databricks Inc. — All rights reserved


HYPEROPT
DEMO

©2023 Databricks Inc. — All rights reserved


HYPEROPT
LAB

©2023 Databricks Inc. — All rights reserved


AutoML

©2023 Databricks Inc. — All rights reserved


Databricks AutoML
A glass-box solution that empowers data teams without taking away control

MLflow experiment
Auto-created MLflow Easily deploy
UI and API to Experiment to track models and to Model
start AutoML metrics Registry
training

Data exploration
notebook Understand and
Generated notebook with debug data
feature summary statistics and quality and
distributions preprocessing

Reproducible trial Iterate further on


notebooks models from
Generated notebooks with source
code for every model
AutoML, adding
your expertise
©2023 Databricks Inc. — All rights reserved
AutoML solves two key pain points
for data scientists
Quickly Verify the Predictive Power of a Get a Baseline Model to Guide Project
Dataset Direction

Marketing Data Data


Team Science Science
Team Team
Dataset Dataset Baseline
Model

“Can this dataset be used to predict “What direction should I go in for this
customer churn?” ML project and what benchmark
should
I aim to beat?”

©2023 Databricks Inc. — All rights reserved


Problems with Existing AutoML
Solutions
Opaque-Box and Production Cliff Problems in AutoML

? ?
AutoML AutoML Returned Production Deployed
Configuration Training Best Model Cliff Model
“Opaque
Box”
Problem Result / Pain Points

1. A “production cliff” exists where data scientists need to ● The “best” model returned is often not good enough
modify the returned “best” model using their domain to deploy
expertise before deployment ● Data scientists must spend time and energy reverse
2. Data scientists need to be able to explain how they engineering these “opaque-box” returned models so
trained a model for regulatory purposes (e.g., FDA, GDPR, that they can modify them and/or explain them
etc.) and most AutoML solutions have “opaque box”
models
©2023 Databricks Inc. — All rights reserved
“Glass-Box” AutoML
Configure

Train and Evaluate with a UI


Customize

Deploy

©2023 Databricks Inc. — All rights reserved


AutoML Lab

©2023 Databricks Inc. — All rights reserved


Feature Store

©2023 Databricks Inc. — All rights reserved


Feature Store
The first Feature Store codesigned with a Data and MLOps Platform
Feature Store
Batch (high throughput)
Feature
Feature Registry
Provider
Online (low latency)

Feature Registry Feature Provider


▪ Discoverability and Reusability ▪ Batch and online access to Features
▪ Versioning ▪ Feature lookup packaged with Models
▪ Upstream and downstream ▪ Simplified deployment process
Lineage

Co-designed with Co-designed with

▪ Open format ▪ Open model format that supports all ML


▪ Built-in data versioning and governance frameworks
▪ Native access through PySpark, SQL, ▪ Feature version and lookup logic
etc. hermetically logged with Model

©2023 Databricks Inc. — All rights reserved


Gradient Boosted
Decision Trees

©2023 Databricks Inc. — All rights reserved


Decision Tree Ensembles
▪ Combine many decision Full Training Data
trees
▪ Random Forest Bootstrap 1 Bootstrap 2 Bootstrap K
▪ Bagging
▪ Independent trees ...
▪ Results aggregated to a
final prediction
▪ There are other methods of
ensembling decision trees

©2023 Databricks Inc. — All rights reserved


Boosting
Full Training Data

▪ Sequential (one tree at a time)


▪ Each tree learns from the last
▪ Sequence of trees is the final
model

©2023 Databricks Inc. — All rights reserved


Gradient Boosted Decision Trees
▪ Common boosted trees algorithm
▪ Fits each tree to the residuals of the previous tree
▪ On the first iteration, residuals are the actual label values

Model 1 Model 2 Final Prediction

Y Prediction Residual Y Prediction Residual Y Prediction

40 35 5 5 3 2 40 38

60 67 -7 -7 -4 -3 60 63

30 28 2 2 3 -1 30 31

33 32 1 1 0 1 33 32

©2023 Databricks Inc. — All rights reserved


Boosting vs. Bagging
GBDT RF
▪ Starts with high bias, low variance ▪ Starts with high variance, low
▪ Works right bias
▪ Works left
Error Total Error

Optimum Model
Complexity
Variance

Bias2

Model Complexity

©2023 Databricks Inc. — All rights reserved


Gradient Boosted Decision Trees
Implementations
▪ Spark ML
▪ Built into Spark
▪ Utilizes Spark’s existing decision tree implementation
▪ XGBoost
▪ Designed and built specifically for gradient boosted trees
▪ Regularized to prevent overfitting
▪ Pre-installed in Databricks Runtime for ML (Python & Scala APIs)

©2023 Databricks Inc. — All rights reserved


XGBOOST DEMO

©2023 Databricks Inc. — All rights reserved


Appendix

©2023 Databricks Inc. — All rights reserved


ML Deployment
Options

©2023 Databricks Inc. — All rights reserved


What is ML Deployment?

▪ Data Science != Data Engineering


▪ Data science is scientific
▪ Business problems → data problems
▪ Model mathematically
▪ Optimize performance
▪ Data engineers are concerned with
▪ Reliability
▪ Scalability
▪ Maintainability
▪ SLAs
▪ ...

©2023 Databricks Inc. — All rights reserved


DevOps vs. ModelOps

▪ DevOps = software development + IT operations


▪ Manages deployments
▪ CI/CD of features, patches, updates, rollbacks
▪ ModelOps = data modeling + deployment operations
▪ Artifact management (Continuous Training)
▪ Model performance monitoring (Continuous Monitoring)
▪ Data management
▪ Use of containers and managed services

©2023 Databricks Inc. — All rights reserved


The Four Deployment Paradigms

1. Batch
▪ 80-90% of deployments
▪ Leverages databases and object storage
▪ Fast retrieval of stored predictions
2. Streaming (continuous)
▪ 10-15% of deployments
▪ Moderately fast scoring on new data
3. Real Time
▪ 5-10% of deployments
▪ Usually using REST (Azure ML, SageMaker, containers)
4. On-device (edge)

©2023 Databricks Inc. — All rights reserved


Latency Requirements (roughly)

Latency Requirements

10 ms 100 ms 1 min 1 hour 1day

Real Time Streaming Batch

©2023 Databricks Inc. — All rights reserved


Overview of a typical Databricks CI/CD
pipeline
Continuous Continuous
integration delivery

Code Build Release Deploy Test Operate

See CI/CD Templates for a starting point


©2023 Databricks Inc. — All rights reserved
Logistic Regression

©2023 Databricks Inc. — All rights reserved


Types of Supervised Learning
Regression Classification

▪ Predicting a continuous output ▪ Predicting a categorical/discrete


output

©2023 Databricks Inc. — All rights reserved


Types of Classification
Binary Classification Multiclass Classification
Two label classes Three or more label classes

Model output is commonly the probability of a


record belonging to each of the classes.
©2023 Databricks Inc. — All rights reserved
Binary Classification
Binary Classification
Two label classes ▪ Outputs:
▪ Probability that the record
is Green given a set of
features
▪ Probability that the record
is Red given a set of
features
▪ Reminders:
▪ Probabilities are bounded
between 0 and 1
▪ And linear regression
returns any real number

©2023 Databricks Inc. — All rights reserved


Bounding Binary Classification Probabilities
How can we keep model outputs between 0 and 1?

▪ Logistic Function:
▪ Large positive inputs → 1
▪ Large negative inputs → 0
©2023 Databricks Inc. — All rights reserved
Converting Probabilities to Classes
▪ In binary classification, the class probabilities are directly
complementary
▪ So let’s set our Red class equal to 1, and our Blue class equal to 0
▪ The model output is 𝐏[y = 1 | x] where x represents the features
But we need class predictions, not probability predictions
▪ Set a threshold on the probability predictions
▪ 𝐏[y = 1 | x] < 0.5 → y = 0
▪ 𝐏[y = 1 | x] ≥ 0.5 → y = 1

©2023 Databricks Inc. — All rights reserved


Evaluating Binary Classification Models
▪ How can the model be wrong?
▪ Type I Error: False Positive
▪ Type II Error: False Negative
▪ Representing these errors with a confusion matrix.

©2023 Databricks Inc. — All rights reserved


Binary Classification Metrics
Accuracy Precision

TP + TN TP
TP + FP + TN + FN TP + FP

Recall F1

TP 2 x Precision x Recall
TP + FN Precision + Recall

©2023 Databricks Inc. — All rights reserved


Collaborative Filtering

©2023 Databricks Inc. — All rights reserved


Recommendation Systems

©2023 Databricks Inc. — All rights reserved


Naive Approaches to Recommendation
▪ Hand-curated
▪ Aggregates

Question: What are problems with these


approaches?

©2023 Databricks Inc. — All rights reserved


Content-based Recommendation
▪ Idea: Recommend items to a customer that are similar to other
items the customer liked
▪ Creates a profile for each user or product
▪ User: demographic info, ratings, etc.
▪ Item: genre, flavor, brand, actor list, etc.

©2023 Databricks Inc. — All rights reserved


Content-based Recommendation
▪ Advantages
▪ No need for data from other users
▪ New item recommendations
▪ Disadvantages
▪ Cold-start problem
▪ Determining appropriate features
▪ Implicit information

©2023 Databricks Inc. — All rights reserved


Collaborative Filtering
▪ Idea: Make recommendations for one customer (filtering) by
collecting and analyzing the interests of many users
(collaboration)
▪ Advantages over content-based recommendation
▪ Relies only on past user behavior (no profile creation)
▪ Domain independent
▪ Generally more accurate
▪ Disadvantages
▪ Extremely susceptible to cold-start problem (user and item)

©2023 Databricks Inc. — All rights reserved


Types of Collaborative Filtering
▪ Neighborhood Methods: Compute relationships between items
or users
▪ Computationally expensive
▪ Not empirically as good
▪ Latent Factor Models: Explain the ratings by characterizing items
and users by small number of inferred factors
▪ Matrix factorization
▪ Characterizes both items and users by vectors of factors
from item-rating pattern
▪ Explicit feedback: sparse matrix
▪ Scalable

©2023 Databricks Inc. — All rights reserved


Latent Factor Approach

©2023 Databricks Inc. — All rights reserved


Ratings Matrix

©2023 Databricks Inc. — All rights reserved


Matrix Factorization

©2023 Databricks Inc. — All rights reserved


Alternating Least Squares
▪ Step 1: Randomly initialize user and movie factors
▪ Step 2: Repeat the following
1. Fix the movie factors, and optimize user factors
2. Fix the user factors, and optimize movie factors

©2023 Databricks Inc. — All rights reserved

You might also like