0% found this document useful (0 votes)

78 views155 pages

Slides Scalable Machine Learning With Apache Spark

Uploaded by

athishya96

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views155 pages

Slides Scalable Machine Learning With Apache Spark

Uploaded by

athishya96

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 155

Scalable Machine Learning

with Apache Spark™

Introductions
▪ Instructor Introduction
▪ Student Introductions
▪ Name
▪ Spark and Databricks Experience
▪ Professional Responsibilities
▪ Fun Personal Interest/Fact
▪ Expectations for the Course
Course Objectives
1 Create data processing pipelines with Spark

2 Build and tune machine learning models with Spark ML

3 Track, version, and deploy machine learning models with MLﬂow

4 Perform distributed hyperparameter tuning with Hyperopt

5 Scale the inference of single-node models with Spark

Agenda
Day 1 Day 2 Day 3 Day 4
1. Spark Review* 1. Linear Regression, pt. 1 1. Decision Trees 1. AutoML
2. Delta Lake Review* Lab 2. Break 2. AutoML Lab
3. ML Overview* 2. Linear Regression, pt. 2 3. Random Forest and 3. Feature Store
4. Break 3. Break Hyperparameter Tuning 4. Break
5. Data Cleansing 4. Linear Regression, pt. 2 4. Hyperparameter Tuning 5. XGBoost
6. Data Exploration Lab Lab Lab 6. Inference with Pandas
7. Break 5. MLflow Tracking 5. Break UDFs
8. Linear Regression, pt. 1 6. Break 6. Hyperopt 7. Pandas UDFs Lab
7. MLflow Model Registry 7. Hyperopt Lab 8. Break
8. MLflow Lab 9. Training with Pandas
Function API
10. Koalas

*Optional
Survey

Programming
Apache Spark Machine Learning Language
LET’S GET STARTED
Apache Spark™ Overview
Apache Spark Background
▪ Founded as a research project at UC
Berkeley in 2009
▪ Open-source uniﬁed data analytics
engine for big data
▪ Built-in APIs in SQL, Python, Scala, R,
and Java
Have you ever counted the
number of M&Ms in a jar?
Spark Cluster
Driver One Driver

Worker Worker Worker Worker

Executor Executor Executor Executor

JVM JVM JVM JVM

Many Executor JVMs

Spark’s Structured Data APIs

RDD DataFrame Dataset

(2011) (2013) (2015)

Distributed collection of JVM Distributed collection of row Internally rows, externally

objects objects JVM objects

Functional operators (map, Expression-based Almost the “best of both

filter, etc.) operations and UDFs worlds”: type safe + fast

Logical plans and optimizer But still slower than

DataFrames
Fast/efficient internal
representations
Spark DataFrame Execution
PySpark DataFrame Java/Scala DataFrame SparkR DataFrame

Logical Plan

Catalyst Optimizer

Physical Execution
Under the Catalyst Optimizer’s Hood

Logical Physical Code

Analysis
Optimization Planning Generation
SQL Query

Cost Model
Unresolved Optimized Physical Selected
Logical
Logical
Logical
Physical
Plan
Physical
Plans Physical RDDs
Plan Plan Plans
Plans Plan

DataFrame
When to Use Spark

Data or model is too large to process

Scaling Out on a single machine, commonly
resulting in out-of-memory errors

Data or model is processing slowly

Speeding Up and could beneﬁt from shorter
processing times and faster results
Delta Lake Overview
Open-source Storage Layer
Delta Lake’s Key Features
▪ ACID transactions
▪ Time travel (data versioning)
▪ Schema enforcement and evolution
▪ Audit history
▪ Parquet format
▪ Compatible with Apache Spark API
Machine Learning Overview
What is Machine Learning
▪ Learn patterns and relationships in your data without explicitly programming
them
▪ Derive an approximation function to map features to an output or relate them
to each other

Machine
Features Output
Learning
Types of Machine Learning
Supervised Learning Unsupervised Learning

▪ Labeled data (known function output) ▪ Unlabeled data (no known function output)
▪ Regression (a continuous/ordinal-discrete output) ▪ Clustering (categorize records based on features)
▪ Classiﬁcation (a categorical output) ▪ Dimensionality reduction (reduce feature space)
Types of Machine Learning
Semi-supervised Learning Reinforcement Learning

▪ Labeled and unlabeled data, mostly unlabeled ▪ States, actions, and rewards
▪ Combines supervised learning and unsupervised ▪ Useful for exploring spaces and exploiting
learning information to maximize expected cumulative
▪ Commonly trying to label the unlabeled data to be rewards
used in another round of training ▪ Frequently utilizes neural networks and deep
learning
Machine Learning Workﬂow

Define
Define Success,
Feature
Business Use Constraints Data Collation Modeling Deployment
Engineering
Case and
Infrastructure
Deﬁning and Measuring Success: Establish
baseline!
DATA CLEANSING DEMO
Importance of Data Visualization
Importance of Data Visualization
How do we build and evaluate models?
DATA EXPLORATION LAB
Linear Regression
Linear Regression
Goal: Find the line of best ﬁt. Y
ŷ = w0+w1x

y≈ŷ+ϵ

where...

x: feature
y: label
w0: y-intercept
w1: slope of the line of best ﬁt X
Minimizing the Residuals
Y

▪ Blue point: True value

▪ Green-dotted line: Positive residual
▪ Orange-dotted line: Negative
residual
▪ Red line: Line of best ﬁt

The goal is to draw a line that

minimizes the sum of the
X squared residuals.
Regression Evaluators
Y
Measure the “closeness”
between the actual value and
the predicted value.

Evaluation Metrics

▪ Loss: (y - ŷ)
▪ Absolute loss: |y - ŷ|
▪ Squared loss: (y - ŷ)2
X
Evaluation Metric: Root mean-squared-error
(RMSE)
Linear Regression Assumptions
Y
▪ Linear relationship between X and
the mean of Y (linearity)
▪ Observations are independent from
one another (independence)
▪ Y is normally distributed for any ﬁxed
observation (normality)
▪ The variance of residual is the same
for any feature (homoscedasticity)

X
Linear Regression Assumptions
So, which datasets are suited for linear regression?
Train vs. Test RMSE

Which is more important? Why?

Train

Test
Evaluation Metric: R2

What is the range of R2?

Do we want it to be higher or lower?

Machine Learning Libraries

Scikit-learn is a popular single-node machine learning library.

But what if our data or model get too big?

Machine Learning in Spark
Machine learning in Spark allows us to work
Scale Out and Speed Up with bigger data and train models faster by
distributing the data and computations
across multiple workers.

Spark Machine Learning MLlib Spark ML

Libraries
Original ML API Newer ML API for
for Spark Spark

Based on RDDs Based on

DataFrames
Maintenance
Mode
LINEAR REGRESSION
DEMO I
LINEAR REGRESSION
LAB I
Non-numeric Features
Two primary types of non-numeric features

Categorical Features Ordinal Features

A series of categories of a single A series of categories of a single

feature feature

No intrinsic ordering Relative ordering, but not

necessarily consistent spacing
e.g. Dog, Cat, Fish
e.g. Infant, Toddler, Adolescent,
Teen, Young Adult, etc.
Non-numeric Features in Linear Regression
How do we handle non-numeric Life
features for linear regression? Expectancy

▪ X-axis is numeric, so features need

to be numeric
▪ Convert our non-numeric features
to numeric features?

Could we assign numeric values

to each of the categories

▪ “Dog” = 1, “Cat” = 2, “Fish” = 3, etc.

▪ Does this make sense? Dog Cat Fish Animal

This implies 1 Cat is equal to 2 Dogs!

Non-numeric Features in Linear Regression
What about with ordinal Height
variables?

▪ Since ordinal variables have an

order just like numbers, could this
work?
▪ “Infant” = 1, “Toddler” = 2, “Child” = 3,
etc.
▪ Does this make sense?

Infant Toddler Child Life

Stage
Remember that the ordinal categories aren’t necessarily evenly
spaced, so it’s still not perfect and not particularly scalable.
Non-numeric Features in Linear Regression
Instead, we commonly use a practice known as one-hot encoding (OHE).
▪ Creates a binary “dummy” feature for each category

Animal Dog Cat Fish

Dog OHE 1 0 0

Cat 0 1 0

Fish 0 0 1

▪ Doesn’t force a uniformly-spaced, ordered numeric representation

One-hot Encoding at Scale
You might be thinking...
▪ Okay, I see what’s happening here … this works for a handful of animals.

▪ But what if we have an entire zoo of animals? That would result in really wide
data!

Spark uses sparse vectors for this…

DenseVector(0, 0, 0, 7, 0, 2, 0, 0, 0, 0)
SparseVector(10, [3, 5], [7, 2])

▪ Sparse vectors take the form:

(Number of elements, [indices of non-zero elements], [values of non-zero elements])

LINEAR REGRESSION
DEMO II
LINEAR REGRESSION
LAB II
MLﬂow Tracking
MLﬂow

▪ Open-source platform for machine learning lifecycle

▪ Operationalizing machine learning
▪ Developed by Databricks
▪ Pre-installed on the Databricks Runtime for ML
Core Machine Learning Issues
▪ Keeping track of experiments or model development
▪ Reproducing code
▪ Comparing models
▪ Standardization of packaging and deploying models

MLﬂow addresses these issues.

MLﬂow Components

Tracking Projects Models Model Registry

Record and Packaging General model Centralized and
query format format that collaborative
experiments: for reproducible supports diverse model lifecycle
code, data, runs on any deployment management
config, results platform tools

▪ APIs: CLI, Python, R, Java, REST

MLﬂow Tracking
▪ Logging API
▪ Library and environment agnostic

Runs Experiments

Executions of data science code Aggregations of runs

E.g. a model build, an optimization Typically correspond to a data science

run project
What Gets Tracked
▪ Parameters
▪ Key-value pairs of parameters (e.g. hyperparameters)
▪ Metrics
▪ Evaluation metrics (e.g. RMSE)
▪ Artifacts
▪ Arbitrary output files (e.g. images, pickled models, data files)
▪ Source
▪ The source code from the run
Examining Past Runs
▪ Querying Past Runs via the API
▪ MLflowClient Object
▪ List experiments
▪ Search runs
▪ Return run metrics
▪ MLflow UI
▪ Built in to Databricks platform
Model Deployment Options

Serving

In-Line Code Containers Batch & Stream OSS Inference Cloud Inference
Scoring Solutions Services
Model Lifecycle

In-Line Code

Model Registry
Data Scientists Deployment Engineers
Models Tracking Containers

Staging Production Archived

Flavor 1 Flavor 2 Batch & Stream
Parameter Metrics Artifacts
s Scoring
v1

Metadata Models
v2
Cloud Inference
Custom
Services
Models
v3
Serving

OSS Serving
Solutions
MLFLOW TRACKING
DEMO
MLflow Model Registry
MLflow Model Registry
▪ Collaborative, centralized model hub
▪ Facilitate experimentation, testing, and production
▪ Integrate with approval and governance workflows
▪ Monitor ML deployments and their performance

Databricks MLﬂow Blog Post

One Collaborative Hub for Model
Management
Centralized Model Management and Discovery Full lineage from deployed models to training code / data

● Overview of all registered models, their versions at

Staging and Production
● Search by name, tags, etc.
● Model-based ACLs ● Full lineage from Model Version to
○ Run that produced the model
○ Notebook that produced the run
○ Exact revision history of the notebook that produced the run
Version Control and Visibility into
Deployment Process
Versioning of ML artifacts Visibility and auditability of the deployment process

● Audit log of stage transitions and requests per model

● Overview of active model versions and their deployment stage

● Comparison of versions and their logged metrics, parameters, etc.
Review Processes and CI/CD
Integration
Manual review process Automation through CI/CD integration

Webhooks allow registering of callbacks (e.g.

for tests / deployment) on events in the
Model Registry

Staging Production Archived

Data Scientists Deployment Engineers

● Stage-based Access Controls
● Request and approval workﬂow for stage transitions
● Webhooks for events like model creation, version creation, transition request,
etc.
● Mechanisms to store results / metadata through Tags and Comments
MLFLOW MODEL REGISTRY
DEMO
MLFLOW
LAB
Decision Trees
Decision Making

Salary > $50,000 Root Node

Yes No

Commute > 1 hr Decline Offer

Yes No

Decline Offer Offers Free Coffee

Yes No

Leaf Node Accept Offer Decline Offer Leaf Node

Salary : 61,000 Salary : 61,000

Commute: 30 mins Commute: 30 mins
Free Coffee: Yes Free Coffee: No
Decision Making
Salary > $50,000 Root Node
Yes No

Commute > 1 hr Decline Offer

Yes No

Decline Offer Offers Free Coffee

Yes No

Accept Offer Salary > $60,000

Yes No
Salary : 61,000
Commute: 30 mins Accept Offer Decline Offer
Free Coffee: No
Decision Making
Offers Free Coffee Root Node
Yes No

Salary : 61,000
Salary > $50,000 Decline Offer
Commute: 30 mins
Yes No Free Coffee: No

Decline Offer Commute < 1h

Yes No

Accept Offer Salary > $60,000

Yes No

Accept Offer Decline Offer

Determining Splits

Commute? Commute?

< 1 hr > 1 hr < 1 0 min > 10 min

1 hr is a better splitting point for Commute because it

provides information about the classiﬁcation.
Determining Splits

Commute? Bonus?

< 1 hr > 1 hr Yes No

Commute is a better choice because it provides information

about the classiﬁcation.
Creating Decision Boundaries
Commute

Salary > $50,000 Decline Offer

Yes No
1 hour

Commute > 1 hr Decline Offer

Decline Offer
Yes No
Accept Offer
Decline Offer Accept Offer

$50,000
Salary
Create Split Candidates

Numerical Features Categorical Features

Feature Histogram
Animal Dog Cat & Fish
Dog
Cat Dog & Fish
Cat

Fish Fish Dog & Cat

Feature values
Lines vs. Boundaries
Linear Regression Decision Trees
▪ Lines through data ▪ Boundaries instead of lines
▪ Assumed linear relationship ▪ Learn complex relationships
Commute

1 hour

X $50,000 Salary
Linear Regression or Decision Tree?

It depends on the data...

Tree Depth

Tree Depth: the length of the Salary > $50,000 Root Node 0
longest path from a root note to a
leaf node Yes No

Commute > 1 hr Decline Offer 1

Yes No
3
Decline Offer
Offers Free
Coffee
2
Yes No

Leaf Node Accept Offer Decline Offer Leaf Node 3

Note: shallow trees tend to underfit, and deep trees tend to overfit
Underfitting vs. Overfitting
Underfitting Just Right Overfitting
Additional Resource

R2D3 has an excellent visualization of how

decision trees work.
DECISION TREE DEMO
Random Forests
Decision Trees
Pros Cons
▪ Interpretable ▪ Poor accuracy
▪ Simple ▪ High variance
▪ Classiﬁcation/Regression
▪ Nonlinear relationships
Bias vs. Variance
Bias-Variance Tradeoff
Error = Variance + Bias2 + noise

Error Total Error ▪ Reduce Bias

Optimum Model ▪ Build more complex
Complexity
Variance models
▪ Reduce Variance
▪ Use a lot of data
▪ Build simple models
▪ What about the noise?
Bias2

Model Complexity
https://www.explainxkcd.com/wiki/index.php/2021:_Software_Development
Building Five Hundred Decision Trees
▪ Using more data reduces variance for one model
▪ Averaging more predictions reduces prediction variance
▪ But that would require more decision trees
▪ And we only have one training set … or do we?
Bootstrap Sampling
A method for simulating N new datasets:

1. Take sample with replacement from original training set

2. Repeat N times
Bootstrap Visualization
Bootstrap 1 (N = 100) Bootstrap 2 (N = 100)

Training Set (N = 100)

Bootstrap 3 (N = 100) Bootstrap 4 (N = 100)

Why are some points in the bootstrapped

samples not selected?
Training Set Coverage
Assume we are bootstrapping N draws from a training set with N
observations ...
▪ Probability of an element getting picked in each draw:
▪ Probability of an element not getting picked in each draw:
▪ Probability of an element not getting drawn in the entire sample:

As N → ∞, the probability for each element of not

getting picked in a sample approaches 0.368.
Bootstrap Aggregating
▪ Train a tree on each of sample, and average the predictions
▪ This is bootstrap aggregating, commonly referred to as bagging

Bootstrap 1 Bootstrap 2 Bootstrap 3 Bootstrap 4

Decision Tree 1 Decision Tree 2 Decision Tree 3 Decision Tree 4

Final Prediction
Random Forest Algorithm
Full Training Data

Bootstrap 1 Bootstrap 2 Bootstrap K

...

At each split, a subset of features is considered to

ensure each tree is different.
Random Forest Aggregation
Scoring Record

...

Aggregation

Final Prediction

▪ Majority-voting for classiﬁcation

▪ Mean for regression
RANDOM FOREST DEMO
Hyperparameter Tuning
What is a Hyperparameter?
▪ Examples for Random Forest:
▪ Tree depth
▪ Number of trees
▪ Number of features to consider

A parameter whose value is used to

control the training process.
Selecting Hyperparameter Values
▪ Build a model for each hyperparameter value
▪ Evaluate each model to identify the optimal hyperparameter value
▪ What dataset should we use to train and evaluate?

Training Validation Test

What if there isn’t enough data to split

into three separate sets?
K-Fold Cross Validation

Pass 1: Training Training Validation

Average Validation
Errors to Identify
Pass 2: Training Validation Training Optimal
Hyperparameter
Values
Pass 3: Validation Training Training

Final Pass: Training with Optimal Hyperparameters Test

HYPERPARAMETER TUNING
DEMO
Optimizing Hyperparameter Values
Grid Search
▪ Train and validate every unique combination of hyperparameters

Tree Depth Number of Trees Tree Depth Number of Trees

5 2 5 2

8 4 5 4

8 2

8 4

Question: With 3-fold cross validation, how many models will this build?
HYPERPARAMETER TUNING
LAB
Hyperparameter Tuning
with HyperOpt
Problems with Grid Search
▪ Exhaustive enumeration is expensive
▪ Manually determined search space
▪ Past information on good hyperparameters isn’t used
▪ So what do you do if…
▪ You have a training budget
▪ You have many hyperparameters to tune
▪ You want to pick your hyperparameters based on past results
Hyperopt
▪ Open-source Python library
▪ Optimization over awkward search spaces (real-valued, discrete,
and conditional dimensions)
▪ Supports serial or parallel optimization
▪ Spark integration
▪ Three core algorithms for optimization:
▪ Random Search
▪ Tree of Parzen Estimators (TPE)
▪ Adaptive TPE

Paper
Optimizing Hyperparameter Values
Random Search

Generally outperforms grid search

Optimizing Hyperparameter Values
Tree of Parzen Estimators

▪ Bayesian process
▪ Creates meta model that maps hyperparameters to probability
of a score on the objective function
▪ Provide a range and distribution for continuous and discrete
values
▪ Adaptive TPE better tunes the search space by
▪ Freezing hyperparameters
▪ Tuning number of random trials before TPE
HYPEROPT
DEMO
HYPEROPT
LAB
AutoML
Databricks AutoML
A glass-box solution that empowers data teams without taking away control

MLﬂow experiment Easily deploy

Auto-created MLﬂow Experiment
to track models and metrics to Model
UI and API to Registry
start AutoML
training
Data exploration notebook Understand and
Generated notebook with feature
debug data quality
summary statistics and
distributions and
preprocessing

Reproducible trial Iterate further on

notebooks models from
Generated notebooks with source
code for every model
AutoML, adding your
expertise
AutoML solves two key pain points for
data scientists
Quickly Verify the Predictive Power of a Get a Baseline Model to Guide Project
Dataset Direction

Marketing Data Data

Team Science Science
Team Team
Dataset Dataset Baseline
Model

“Can this dataset be used to predict “What direction should I go in for this ML
customer churn?” project and what benchmark should
I aim to beat?”
Problems with Existing AutoML
Solutions
Opaque-Box and Production Cliff Problems in AutoML

? ?
AutoML AutoML Returned Production Deployed
Conﬁguration Training Best Model Cliff Model
“Opaque Box”

Problem Result / Pain Points

1. A “production cliff” exists where data scientists need to ● The “best” model returned is often not good enough to
modify the returned “best” model using their domain deploy
expertise before deployment ● Data scientists must spend time and energy reverse
2. Data scientists need to be able to explain how they trained a engineering these “opaque-box” returned models so that
model for regulatory purposes (e.g., FDA, GDPR, etc.) and they can modify them and/or explain them
most AutoML solutions have “opaque box” models
“Glass-Box” AutoML
Conﬁgure

Train and Evaluate with a UI

Customize

Deploy
AutoML Lab
Feature Store
Feature Store
The ﬁrst Feature Store codesigned with a Data and MLOps Platform
Feature Store
Batch (high throughput)
Feature
Feature Registry
Provider
Online (low latency)

Feature Registry Feature Provider

▪ Discoverability and Reusability ▪ Batch and online access to Features
▪ Versioning ▪ Feature lookup packaged with Models
▪ Upstream and downstream Lineage ▪ Simpliﬁed deployment process

Co-designed with Co-designed with

▪ Open format ▪ Open model format that supports all ML

▪ Built-in data versioning and frameworks
governance ▪ Feature version and lookup logic
▪ Native access through PySpark, SQL, hermetically logged with Model
etc.
Gradient Boosted Decision Trees
Decision Tree Ensembles
▪ Combine many decision trees Full Training Data
▪ Random Forest
▪ Bagging Bootstrap 1 Bootstrap 2 Bootstrap K
▪ Independent trees
▪ Results aggregated to a ...
ﬁnal prediction
▪ There are other methods of
ensembling decision trees
Boosting
Full Training Data

▪ Sequential (one tree at a time)

▪ Each tree learns from the last
▪ Sequence of trees is the ﬁnal
model
Gradient Boosted Decision Trees
▪ Common boosted trees algorithm
▪ Fits each tree to the residuals of the previous tree
▪ On the ﬁrst iteration, residuals are the actual label values

Model 1 Model 2 Final Prediction

Y Prediction Residual Y Prediction Residual Y Prediction

40 35 5 5 3 2 40 38

60 67 -7 -7 -4 -3 60 63

30 28 2 2 3 -1 30 31

33 32 1 1 0 1 33 32
Boosting vs. Bagging
GBDT RF
▪ Starts with high bias, low variance ▪ Starts with high variance, low bias
▪ Works right ▪ Works left

Error Total Error

Optimum Model
Complexity
Variance

Bias2

Model Complexity
Gradient Boosted Decision Trees Implementations
▪ Spark ML
▪ Built into Spark
▪ Utilizes Spark’s existing decision tree implementation
▪ XGBoost
▪ Designed and built specifically for gradient boosted trees
▪ Regularized to prevent overfitting
▪ Highly parallel
▪ Works nicely with Spark in Scala
XGBOOST DEMO
Appendix
MLlib Deployment Options
Data Science vs. Data Engineering
▪ Data Science != Data Engineering
▪ Data Science
▪ Scientific
▪ Art
▪ Business problems
▪ Model mathematically
▪ Optimize performance
▪ Data Engineering
▪ Reliability
▪ Scalability
▪ Maintainability
▪ SLAs
Model Operations (ModelOps)
▪ DevOps
▪ Software development and IT operations
▪ Manages deployments
▪ CI/CD of features, patches, updates, and rollbacks
▪ Agile vs. waterfall
▪ ModelOps
▪ Data modeling and deployment operations
▪ Java environments
▪ Containers
▪ Model performance monitoring
The Four ML Deployment Options
▪ Batch
▪ 80-90 percent of deployments
▪ Leverages databases and object storage
▪ Fast retrieval of stored predictions
▪ Continuous/Streaming
▪ 10-15 percent of deployments
▪ Moderately fast scoring on new data
▪ Real-time
▪ 5-10 percent of deployments
▪ Usually using REST (Azure ML, SageMaker, containers)
▪ On-device
Overview of a typical Databricks CI/CD pipeline

Continuous Continuous
integration delivery

Code Build Release Deploy Test Operate

See CI/CD Templates for a starting point

Logistic Regression
Types of Supervised Learning
Regression Classiﬁcation

▪ Predicting a continuous output ▪ Predicting a categorical/discrete output

Types of Classification
Binary Classification Multiclass Classification
Two label classes Three or more label classes

Model output is commonly the probability of a record

belonging to each of the classes.
Binary Classification
Binary Classification
Two label classes ▪ Outputs:
▪ Probability that the record is
Red given a set of features
▪ Probability that the record is
Blue given a set of features
▪ Reminders:
▪ Probabilities are bounded
between 0 and 1
▪ And linear regression returns
any real number
Bounding Binary Classification Probabilities
How can we keep model outputs between 0 and 1?

▪ Logistic Function:
▪ Large positive inputs → 1
▪ Large negative inputs → 0
Converting Probabilities to Classes
▪ In binary classiﬁcation, the class probabilities are directly complementary
▪ So let’s set our Red class equal to 1, and our Blue class equal to 0
▪ The model output is 𝐏[y = 1 | x] where x represents the features

But we need class predictions, not probability predictions

▪ Set a threshold on the probability predictions

▪ 𝐏[y = 1 | x] < 0.5 → y = 0
▪ 𝐏[y = 1 | x] ≥ 0.5 → y = 1
Evaluating Binary Classiﬁcation Models
▪ How can the model be wrong?
▪ Type I Error: False Positive
▪ Type II Error: False Negative
▪ Representing these errors with a confusion matrix.
Binary Classiﬁcation Metrics
Accuracy Precision

TP + TN TP
TP + FP + TN + FN TP + FP

Recall F1

TP 2 x Precision x Recall
TP + FN Precision + Recall
K-Means
Clustering
▪ Unsupervised learning
▪ Unlabeled data (no known function output)
▪ Categorize records based on features
K-Means Clustering

▪ Most common clustering algorithm

▪ Number of clusters, K, is manually
chosen
▪ Each cluster has a centroid
▪ Objective of minimizing the total
distance between all of the points and
their assigned centroid
K-Means Algorithm
▪ Step 1: Randomly create centroids for k clusters
▪ Repeat until convergence/stopping criteria:
▪ Step 2: Assign each data point to the cluster with the
closest centroid
▪ Step 3: Move the cluster centroids to the average location
of their assigned data points
Visualizing K-Means
Choosing the Number of Clusters
▪ K is a hyperparameter
▪ Methods of identifying the optimal K
▪ Prior knowledge
▪ Visualizing data
▪ Elbow method for within-cluster distance

Note: Error will always decrease as K increases, unless a penalty is imposed.

Issues with K-Means
Local optima vs. global optima Straight-line distance

Global minimum

Local minimum
Other Clustering Techniques
Collaborative Filtering
Recommendation Systems
Naive Approaches to Recommendation
▪ Hand-curated
▪ Aggregates

Question: What are problems with these approaches?

Content-based Recommendation
▪ Idea: Recommend items to a customer that are similar to other items
the customer liked
▪ Creates a profile for each user or product
▪ User: demographic info, ratings, etc.
▪ Item: genre, flavor, brand, actor list, etc.
Content-based Recommendation
▪ Advantages
▪ No need for data from other users
▪ New item recommendations
▪ Disadvantages
▪ Cold-start problem
▪ Determining appropriate features
▪ Implicit information
Collaborative Filtering
▪ Idea: Make recommendations for one customer (filtering) by collecting
and analyzing the interests of many users (collaboration)
▪ Advantages over content-based recommendation
▪ Relies only on past user behavior (no profile creation)
▪ Domain independent
▪ Generally more accurate
▪ Disadvantages
▪ Extremely susceptible to cold-start problem (user and item)
Types of Collaborative Filtering
▪ Neighborhood Methods: Compute relationships between items or
users
▪ Computationally expensive
▪ Not empirically as good
▪ Latent Factor Models: Explain the ratings by characterizing items and
users by small number of inferred factors
▪ Matrix factorization
▪ Characterizes both items and users by vectors of factors from
item-rating pattern
▪ Explicit feedback: sparse matrix
▪ Scalable
Latent Factor Approach
Ratings Matrix
Matrix Factorization
Alternating Least Squares
▪ Step 1: Randomly initialize user and movie factors
▪ Step 2: Repeat the following
1. Fix the movie factors, and optimize user factors
2. Fix the user factors, and optimize movie factors

Spark ML
No ratings yet
Spark ML
110 pages
Spark MLIB
No ratings yet
Spark MLIB
50 pages
Do, Does, Don't, Doesn't
No ratings yet
Do, Does, Don't, Doesn't
39 pages
Conti Rossini, Turajev. Vitae Sanctorum Indigenarum. 1904. Volume 1 - Textus.
100% (1)
Conti Rossini, Turajev. Vitae Sanctorum Indigenarum. 1904. Volume 1 - Textus.
278 pages
Format Bahasa Inggeris UPSR 2016 ENGLISH (013) Section A
No ratings yet
Format Bahasa Inggeris UPSR 2016 ENGLISH (013) Section A
33 pages
HunterSNUGSV UVM Resets Paper
No ratings yet
HunterSNUGSV UVM Resets Paper
13 pages
Machine Learning With Python 2021
No ratings yet
Machine Learning With Python 2021
124 pages
Batch Vs Online ML: Wednesday, March 17, 2021 5:30 PM
No ratings yet
Batch Vs Online ML: Wednesday, March 17, 2021 5:30 PM
436 pages
100 Days of ML
No ratings yet
100 Days of ML
383 pages
Book Summary
No ratings yet
Book Summary
35 pages
Gray and Black Professional Resume
No ratings yet
Gray and Black Professional Resume
1 page
Instant Ebooks Textbook O Pioneers Webster S German Thesaurus Edition Willa Cather Download All Chapters
No ratings yet
Instant Ebooks Textbook O Pioneers Webster S German Thesaurus Edition Willa Cather Download All Chapters
85 pages
Steven Skiena-The Algorithm Design Manual-En
50% (2)
Steven Skiena-The Algorithm Design Manual-En
27 pages
Primary Two Revision Work English
100% (2)
Primary Two Revision Work English
4 pages
Slide 11 Spark ML
No ratings yet
Slide 11 Spark ML
153 pages
Chap 4
No ratings yet
Chap 4
36 pages
Intro To ML
No ratings yet
Intro To ML
29 pages
Vilniaus Gedimino Technikos Universitetas
No ratings yet
Vilniaus Gedimino Technikos Universitetas
65 pages
ML DL Projects and Tutorials
100% (1)
ML DL Projects and Tutorials
21 pages
ML - Zep
No ratings yet
ML - Zep
94 pages
Fall2024 W4995 Lecture1
No ratings yet
Fall2024 W4995 Lecture1
110 pages
ML Workshop
No ratings yet
ML Workshop
78 pages
(Baker A.) Representations of Finite Groups (BookFi)
No ratings yet
(Baker A.) Representations of Finite Groups (BookFi)
80 pages
Distributed Linear Regression Class Notes
No ratings yet
Distributed Linear Regression Class Notes
140 pages
Case Study
No ratings yet
Case Study
9 pages
Practical Machine Learning Pipelines With Mllib: Joseph K. Bradley
No ratings yet
Practical Machine Learning Pipelines With Mllib: Joseph K. Bradley
35 pages
Machine Learning Section
No ratings yet
Machine Learning Section
31 pages
Performance Task Newtons Olympic
100% (2)
Performance Task Newtons Olympic
1 page
Lecture 15 - Recap and Midterm Review
No ratings yet
Lecture 15 - Recap and Midterm Review
37 pages
Databricks Certified Machine Learning Associate Exam Guide
No ratings yet
Databricks Certified Machine Learning Associate Exam Guide
9 pages
Congratulating and Complimenting Others: Trisukses Vocational High School
No ratings yet
Congratulating and Complimenting Others: Trisukses Vocational High School
10 pages
Lecture 1 Machine Learning
No ratings yet
Lecture 1 Machine Learning
22 pages
Scalable Machine Learning With Apache Spark en
No ratings yet
Scalable Machine Learning With Apache Spark en
145 pages
Data Science Bootcamp (Day-01) (1) - Compressed
No ratings yet
Data Science Bootcamp (Day-01) (1) - Compressed
161 pages
Assignmnet
No ratings yet
Assignmnet
25 pages
Midterm Study Guide Csci566
No ratings yet
Midterm Study Guide Csci566
20 pages
The Kefar Hebrew Study Guide - Colors
No ratings yet
The Kefar Hebrew Study Guide - Colors
1 page
Christianity in Early Africa PDF
No ratings yet
Christianity in Early Africa PDF
48 pages
Athul Dev - Spark With Python (2020) - Libgen - Li
No ratings yet
Athul Dev - Spark With Python (2020) - Libgen - Li
153 pages
Soal Bhs Inggris Olimpiade 2
No ratings yet
Soal Bhs Inggris Olimpiade 2
3 pages
ML SIG - Day 1
No ratings yet
ML SIG - Day 1
55 pages
Intro To Machine Learning With Apache Cassandra and Apache Spark
No ratings yet
Intro To Machine Learning With Apache Cassandra and Apache Spark
80 pages
Lecture 6 - Spark ML
No ratings yet
Lecture 6 - Spark ML
31 pages
BDA Lec11
No ratings yet
BDA Lec11
32 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
BasicsBiblical Hebrew-01
100% (1)
BasicsBiblical Hebrew-01
7 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
19 pages
Algorithmeknn 121213175830 Phpapp02
No ratings yet
Algorithmeknn 121213175830 Phpapp02
52 pages
The Linguistic Elements and Literary Features of Text
No ratings yet
The Linguistic Elements and Literary Features of Text
7 pages
Data Analyst Interview Questionaries
No ratings yet
Data Analyst Interview Questionaries
16 pages
Machine: Learning
No ratings yet
Machine: Learning
24 pages
Data Science
No ratings yet
Data Science
38 pages
July4 SaketAnand FriendlyIntroToML
No ratings yet
July4 SaketAnand FriendlyIntroToML
84 pages
Python 06 MachineLearning
No ratings yet
Python 06 MachineLearning
45 pages
Exercise 9.3: Creating A Persistent Volume Claim (PVC)
No ratings yet
Exercise 9.3: Creating A Persistent Volume Claim (PVC)
3 pages
Advanced Data Science On Spark: Reza Zadeh
No ratings yet
Advanced Data Science On Spark: Reza Zadeh
47 pages
Databricks Certified Machine Learning Associate Exam Guide
No ratings yet
Databricks Certified Machine Learning Associate Exam Guide
9 pages
5.1 Large Scale ML
No ratings yet
5.1 Large Scale ML
10 pages
Machine Learning
No ratings yet
Machine Learning
8 pages
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
No ratings yet
Machine Learning With PySpark and MLlib - Solving A Binary Classification Problem - by Susan Li - Towards Data Science
10 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Spark & SparkMLLib
No ratings yet
Spark & SparkMLLib
6 pages
Module 4
No ratings yet
Module 4
29 pages
Bootcamp Keynote
No ratings yet
Bootcamp Keynote
47 pages
Memory Hierarchy in Computer Architecture
No ratings yet
Memory Hierarchy in Computer Architecture
4 pages
Write A Program For Generalized Bresenham's Line Drawing Algorithm
No ratings yet
Write A Program For Generalized Bresenham's Line Drawing Algorithm
4 pages
Myp - Fpip-Assessment
No ratings yet
Myp - Fpip-Assessment
11 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
Deep Learning
No ratings yet
Deep Learning
21 pages
Report Print
No ratings yet
Report Print
22 pages
Mastering Advanced Analytics With Apache Spark
No ratings yet
Mastering Advanced Analytics With Apache Spark
75 pages
DETAILED LESSON WPS Office
No ratings yet
DETAILED LESSON WPS Office
9 pages
Machine Learning With Spark
No ratings yet
Machine Learning With Spark
26 pages
Unit 1-1
No ratings yet
Unit 1-1
10 pages
Scalable-ML-3 4 1
No ratings yet
Scalable-ML-3 4 1
147 pages
MLib Cheat Sheet Design
No ratings yet
MLib Cheat Sheet Design
1 page
04 HCIP-Datacom-NCE Northbound Openness Lab Guide
No ratings yet
04 HCIP-Datacom-NCE Northbound Openness Lab Guide
57 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Matlab Tutorial
No ratings yet
Matlab Tutorial
31 pages
Module - 1
No ratings yet
Module - 1
9 pages
20191216134846D3338 - COMP6579 Session 10 - Big Data Analytics (Apache Spark - SparkML)
No ratings yet
20191216134846D3338 - COMP6579 Session 10 - Big Data Analytics (Apache Spark - SparkML)
42 pages
Query and Reporting Tools: Search Engine Architecture
No ratings yet
Query and Reporting Tools: Search Engine Architecture
5 pages
M4 - Literacy Training
No ratings yet
M4 - Literacy Training
5 pages
Soal B Inggris Uas KLS X Olla
No ratings yet
Soal B Inggris Uas KLS X Olla
25 pages
From Field Problems To Machine Learning
No ratings yet
From Field Problems To Machine Learning
51 pages
Deep Learning With Databricks: Srijith Rajamohan, Ph.D. John O'Dwyer
No ratings yet
Deep Learning With Databricks: Srijith Rajamohan, Ph.D. John O'Dwyer
38 pages
Example, Practice, Production
No ratings yet
Example, Practice, Production
2 pages
Chapter 1: Introduction To Compiler: April 2019
No ratings yet
Chapter 1: Introduction To Compiler: April 2019
14 pages
MC Music Semi Detailed Lesson Plan in Elementary Music Grade 1
No ratings yet
MC Music Semi Detailed Lesson Plan in Elementary Music Grade 1
7 pages