Distributed Linear Regression Class Notes
Distributed Linear Regression Class Notes
Distributed Linear Regression Class Notes
Machine Learning
with Large Datasets
FA L L 2 0 2 4
deployed
raw data data ML method model
model
Big O Notation
Big O Notation
f(x) = OFormal
(g(x))de nition: |f(x)| C|g(x)| x>N
Ignores constants and lower-order terms
2 2
x• +For
3xlargeCx
enough x, >these
N terms won’t matter
2 2
• E.g., x + 3x Cx x>N
fi
fi
EXAMPLE 1:
O(1) complexity
Constant time algorithms perform same # of operations every time they’re called
• E.g., performing a xed number of arithmetic operations
Constant space algorithms require xed storage every time they’re called
• E.g., storing the results of a xed number of arithmetic operations
fi
fi
fi
EXAMPLE 2:
O(n) complexity
O(n )
2 complexity
Quadratic space algorithms require storage proportional to the square of the size
of the inputs
• E.g., outer product of two n-dimensional vectors requires O(n 2) storage (one
per each entry of the resulting matrix)
Time and space complexity can differ
Matrix-vector multiply
Matrix-matrix multiply
Goal: multiply an n × k matrix with an k × p matrix
Regression
How much should you sell your house for?
pirce ($)
= ??
house size
Examples:
• Size, Location, Age → House price
• Audio features → Song year
• Processes, memory → Power consumption
• Historical nancials → Future stock price
fi
RECALL
Question: how should we select our function class, F, and our loss function, ??
𝓁
𝓁
C O M P U TAT I O N A L C O N S I D E R AT I O N S
• Even when the class of functions, F, is simple (e.g., linear functions), the above
optimization problem might be non-convex and thus difficult to solve
• For example, this (non-convexity) is the case for the 0/1 loss
10-605/10-805:
Machine Learning
with Large Datasets
FA L L 2 0 2 4
deployed
raw data data ML method model
model
x = x1 x2 x3
We assume a linear mapping between features and label:
y w 0 + w 1 x1 + w 2 x2 + w 3 x 3
Linear least squares regression
x = 1 x1 x2 x3
We can then rewrite this linear mapping as scalar product:
3
y ŷ = w i xi = w x
i=0
Why a linear mapping?
Simple
y ŷ = w0 + w1 x
x
Question: how should we select our function class, F, and our loss function, ??
𝓁
𝓁
Evaluating predictions
ℓ( f(x), y) = | f(x) − y |
intelligence
x y f(x) Square Loss Absolute Loss
$300,000 $300,000 0 0
= ??$300,000 $300,500 250,000 500
$300,000 $400,000 1E+10 100,000
Evaluating predictions
Question: how should we select our function class, F, and our loss function, ??
𝓁
𝓁
How can we learn model (w)?
(i)
Assume we have n training points, where x denotes the ith point
{
w
i=1 (i)
ŷ
Given n training points with k features, we de ne:
n×k
• X∈ℝ : matrix storing points
n : real-valued labels
• y∈ℝ
n
• ŷ ∈ ℝ : predicted labels, where ŷ = Xw
k : regression parameters / model to learn
• w∈ℝ
1
Closed form solution: w = (X X) X y (if inverse exists)
Overfitting and generalization
We want good predictions on new data, i.e., ’generalization’
fi
HOW TO PREVENT OVERFITTING?
Regularization
n
̂ 1
∑
θ = arg min ℓ( fθ(xi), yi) + λR(θ)
θ n
i=1
2 2
min ||Xw y||2 + λ||w||2
w
2 2
min ||Xw y||2 + λ||w||2
w
Split Data
Feature Extraction
Evaluation
Predict
training
set model
full
dataset
new entity
Split Data
Goal: Predict song’s release year from audio features
Feature Extraction
Raw Data: Millionsong Dataset from UCI ML Repository
• Western, commercial tracks from 1980-2014 Supervised Learning
Predict
training
set model
full
dataset
new entity
Split Data
Split Data: Train on training set, evaluate with test set
• Test set simulates unobserved data Feature Extraction
Evaluation
Predict
training
set model
full
dataset
new entity
Split Data
Feature Extraction: Quadratic features
• Compute pairwise feature interactions Feature Extraction
• Captures covariance of initial timbre features
• Leads to a non-linear model relative to raw features Supervised Learning
Evaluation
Predict
Given 2 dimensional data, quadratic features are:
2 2
x = x1 x2 = Φ(x) = x1 x1 x2 x2 x 1 x2
2 2
z = z1 z2 = Φ(z) = z1 z1 z2 z2 z 1 z2
More succinctly:
Φ (x) = 2 2 Φ (z) = 2 2
x1 2x1 x2 x2 z1 2z1 z2 z2
2 2 2 2
Φ(x) Φ(z) = x 1 z1 + 2x1 x2 z1 z2 + x 2 z2 = Φ (x) Φ (z)
training
set model
full
dataset
new entity
Evaluation
Predict
Given n training points with k features, we de ne:
n×k : matrix storing points
• X∈ℝ
n
• y∈ℝ : real-valued labels
n
• ŷ ∈ ℝ : predicted labels, where ŷ = Xw
k
•w∈ℝ : regression parameters / model to learn
2 2
min ||Xw y||2 + λ||w||2
w
⊤ −1 ⊤
Closed-form solution: w = (X X + λIk) X y
fi
Ridge Regression: Learn mapping w ( ) that minimizes residual sum of
squares along with a regularization term:
Training Error Model Complexity
2 2
min ||Xw y||2 + λ||w||2
w
free parameter trades off between training
error and model complexity
How do we choose a good value for this free parameter?
• Most methods have free parameters / ‘hyperparameters’ to tune
First thought: Search over multiple values, evaluate each on test set
• But, goal of test set is to simulate unobserved data
• We may over t if we use it to choose hyperparameters
Second thought: Create another hold out dataset for this search
fi
training
set model
full
dataset
validation new entity
set
Split Data
Evaluation (Part 1): Hyperparameter tuning
• Training: train various models Feature Extraction
Predict
fi
We’ll cover advanced hp tuning
Hyperparameter 2
methods later in the course
Hyperparameter 1
10-8 10-6 10-4 10-2 1
Regularization Parameter ( )
But MSE’s unit of measurement is square of quantity being measured, e.g., “squared
years” for song prediction
More natural to use root-mean-square error (RMSE), i.e., MSE
training
set model
full
dataset
validation new entity
set
Split Data
Evaluation (Part 2): Evaluate nal model
• Training set: train various models Feature Extraction
Predict
fi
fi
training
set model
full
dataset
validation new entity
set
Split Data
Feature Extraction
Predict: Final model can then be used to make predictions on
future observations, e.g., new songs Supervised Learning
Evaluation
Predict
Outline
fi
LEAST SQUARES REGRESSION
Computational bottlenecks:
• Matrix multiply of X X : O(nk2) operations
• Matrix inverse: O(k3) operations
Storage requirements
1
w = (X X) X y
Computation: O(nk2 + k3) operations
Storage: O(nk + k2) oats
Storage bottlenecks:
• X X and its inverse: O(k2) oats
• X : O(nk) oats
fl
fl
fl
fl
LEAST SQUARES REGRESSION
1 2
9 3 5 28 18
3 5 =
4 1 2 11 9
2 3
9 1+3 3+5 2 = 28
Matrix multiplication via outer products
1 2
9 3 5 28 18
3 5 =
4 1 2 11 9
2 3
9 18 9 15 10 15
+ +
4 8 3 5 4 6
n k
x(1)
x(1)
x(2)
x(n)
n
…
k x(2)
x(i)
…
X X= n
=
x(i)
x(n)
i =1
Example: n = 6; 3 workers
x(i)
x(i)
Distributed
Storage
Computation
reduce:
( x(i)
) -1 O(k3) Local O(k2) Local
x(i)
Computation Storage
n k
x(1)
x(1)
x(2)
x(n)
n
…
k x(2)
x(i)
…
X X= n
=
x(i)
x(n)
i =1
Example: n = 6; 3 workers
x(i)
x(i)
Distributed
Storage
Computation
reduce:
( x(i)
) -1 O(k3) Local O(k2) Local
x(i)
Computation Storage
LEAST SQUARES REGRESSION
x(1)
x(2)
x(n)
n
…
k x(2)
x(i)
…
X X= n
=
x(i)
x(n)
i =1
Example: n = 6; 3 workers
x(i)
x(i)
Distributed
Storage
Computation
reduce:
( x(i)
) -1 O(k3) Local O(k2) Local
x(i)
Computation Storage
LEAST SQUARES REGRESSION
k r k
r
≈ ‘Low-rank’
n n
Large n and large k
We need methods that are linear in time and space
Example: n = 6; 3 workers
O(nk2)
map: x(i) x(i) x(i) O(k2) Local
x(i)
x(i)
x(i)
Distributed
Storage
Computation
reduce:
( x(i)
) -1 O(k3) Local O(k2) Local
x(i)
Computation Storage
Gradient descent for large n and large k
Example: n = 6; 3 workers
O(nk2)
map: x(i) x(i) x(i) O(k2) Local
x(i)
x(i)
x(i)
Distributed
Storage
Computation
reduce:
( x(i)
) -1 O(k3) Local O(k2) Local
x(i)
Computation Storage
Gradient descent for large n and large k
Example: n = 6; 3 workers
reduce:
( x(i)
) -1 O(k3) Local O(k2) Local
x(i)
Computation Storage
Gradient descent for large n and large k
Example: n = 6; 3 workers
n
w* w
Scalar objective: f(w) = ||wx 2 (j) (j) 2
y||2 = (wx y )
j=1
Gradient descent
f(w)
Start at a random point
w* w1 w0 w
Gradient descent
f(w)
Start at a random point
Repeat
Determine a descent direction
Choose a step size
Update
Until stopping criterion is satis ed
w* w2 w1 w0 w
fi
Gradient descent
f(w)
Start at a random point
Repeat
Determine a descent direction
Choose a step size
Update
Until stopping criterion is satis ed
w* … w2 w1 w0 w
fi
Where will we converge?
f(w) Convex g(w) Non-convex
…
… …
w* w w w* w
Any local minimum is a global minimum Multiple local minima may exist
Least Squares, Ridge Regression, and Logistic Regression are all convex!
(Neural networks are non-convex)
′
Choosing a descent direction (1D)
f(w) positive go left! f(w) negative go right!
zero done!
w* w0 w w0 w*
Step Size
w
We can only move in two directions df
Negative slope is direction of descent! Update Rule: wi+1 = wi α i (w i )
dw
Negative Slope
Choosing a descent direction
2D Example:
• Function values are in black/white
and black represents higher values
• Arrows are gradients
Step Size
k
We can move anywhere in ℝ
Negative gradient is direction of Update Rule: wi+1 = wi αi f(wi )
steepest descent!
Gradient descent for least squares
df
Update Rule: wi+1 = wi α i (w i )
dw
n
Scalar objective: f(w) = ||wx 2 (j) (j) 2
y||2 = (wx y )
j=1
n
Derivative: df (j) (j) (j)
(w ) = 2 (wx y )x
(chain rule) dw
j=1
n
Scalar Update: (j) (j) (j)
wi+1 = wi αi (w i x y )x
(2 absorbed in α ) j=1
Vector Update: n
(j) (j) (j)
wi+1 = wi αi (wi x y )x
j =1
Choosing a step size
f(w) f(w) f(w)
w* w w* w w* w
Too small: converge very slowly Too big: overshoot, can diverge Reduce size over time
j =1
Compute summands in parallel!
note: workers must all have wi
Example: n = 6; 3 workers
wi + 1 O(nk)
map: (wi x (j) (j)
y )x (j)
(wi x (j)
y )x(j) (j)
(wi x (j) (j)
y )x (j)
Distributed
O(k) Local
Storage
Computation
n
reduce: (wi x (j) (j)
y )x (j)
O(k) Local O(k) Local
j=1
Computation Storage
Parallel gradient descent for least squares
n
Vector Update: wi+1 = wi αi (wi x (j) (j)
y )x (j)
j =1
Compute summands in parallel!
note: workers must all have wi
Example: n = 6; 3 workers
wi + 1 O(nk)
map: (wi x (j) (j)
y )x (j)
(wi x (j)
y )x(j) (j)
(wi x (j) (j)
y )x (j)
Distributed
O(k) Local
Storage
Computation
n
reduce: (wi x (j) (j)
y )x (j)
O(k) Local O(k) Local
j=1
Computation Storage
Gradient descent summary
Pros: Cons:
• Easily parallelized • Slow convergence (especially
• Cheap at each iteration compared with closed-form)
• Stochastic variants can make • Requires communication
things even cheaper across nodes!
DISTRIBUTED MACHINE LEARNING
Communication Principles
RECALL
Communication hierarchy
50 GB/s
1 GB/s 1 GB/s 0.3 GB/s
Communication hierarchy
Access rates fall sharply with distance
• Parallelism makes computation fast
• Network makes communication slow
50 GB/s
1 GB/s 1 GB/s 0.3 GB/s
RAM
CPU RAM
Disk CPU
Disk
2nd Rule of thumb
Perform parallel and in-memory computation
Network
Network
← Persist training data across iterations
Network
← Persist training data across iterations
x(i)
x(i)
reduce:
( x(i)
x(i)
) -1
3rd Rule of thumb
Minimize network communication — stay local
wi + 1 map: (wi x(j) y(j) )x(j) (wi x(j) y(j) )x(j) (wi x(j) y(j) )x(j)
n
reduce: (wi x (j) (j)
y )x (j)
j=1
3rd Rule of thumb
Minimize network communication — stay local
Extreme: Divide-and-conquer
• Fully process each partition locally, communicate nal result
• Single iteration; minimal communication
• Approximate results
fi
3rd Rule of thumb
Minimize network communication — reduce iterations
Latency
Memory 1e-4 ms We can amortize latency!
Hard Disk 10 ms • Send larger messages
• Batch their communication
Network (same datacenter) .25 ms • E.g., Train multiple models together
Network (US to Europe) >5 ms
1st Rule of thumb
Computation and storage should be linear (in n, k)
European Size
noise)
European Size
noise)
European Size
noise)
European Size
noise)
European Size
noise)
European Size
Euclidean distance between the original
points and their projections
American Size
Linear Regression — predict y from x. PCA — reconstruct 2D data via 2D data with
Evaluate accuracy of predictions single degree of freedom. Evaluate
(represented by blue line) by vertical reconstructions (represented by blue line)
distances between points and the line by Euclidean distances
y
European Size
x American Size
Computing PCA solution
Given: n × k matrix of uncentered raw data
Goal: Compute r ≪ k dimensional representation
≈
Step 4: Compute PCA Scores
PCA at scale
Case 1: Big n and Small k
• O(k 2) local storage, O(k3) local computation, O(kr)
communication
• Similar strategy as closed-form linear regression
≈
Step 0: Data Parallel storage
Example: n = 6; 3 workers
Example: n = 6; 3 workers
reduce: m=
1
n ( x(i) )
Step 1: Center Data
k
• Compute k feature means, m ∈ ℝ
• Communicate m to all workers
Example: n = 6; 3 workers
Example: n = 6; 3 workers
1 2
9 3 5 28 18
3 5 =
4 1 2 11 9
2 3
9 18 9 15 10 15
+ +
4 8 3 5 4 6
n k
x(1)
x(1)
x(2)
x(n)
n
…
k x(2)
x(i)
…
X X= n
=
x(i)
x(n)
i =1
Example: n = 6; 3 workers
x(i)
x(i)
O(nk2) Distributed Computation
Example: n = 6; 3 workers
x(i)
x(i)
O(nk2) Distributed Computation
reduce: ( x(i)
)
x(i)
eigh
Step 3: Eigendecomposition
• Perform locally since k is small
k×r
• Communicate r principal components ( P ∈ ℝ ) to workers
Example: n = 6; 3 workers
x(i)
x(i)
O(nk2) Distributed Computation
Example: n = 6; 3 workers
x(i)
x(i)
p(2) p(2) p(2) O(kr) Local Computation
PCA at scale
Case 1: Big n and Small k
• O(k 2) local storage, O(k3) local computation, O(kr)
communication
• Similar strategy as closed-form linear regression
3. Driver uses qi to
= update
X Xv estimate
i of P
Repeat for O(r) iterations:
k
1. Communicate vi ∈ ℝ to all workers
2. Compute qi = X Xvi in a distributed fashion
• Step 1: bi = Xvi
• Step 2: qi = X bi
• Perform in single map-reduce!
3. Driver uses qi = X Xv
to update estimate
i of P
(j)
• bij = vi x : each component is dot product
n
(j)
is aXsumbof
• qi = i rescaled data points, i.e., q i = bij x
j =1
Compute qi = X Xvi in a distributed
n
fashion
(j) and q = (j)
• bij = vi x i bij x
j =1
• Locally compute each dot product and rescale each point before summing all
rescaled points in reduce step!
Example: n = 6; 3 workers
x(j)
x(j)
bij ⨉ bij ⨉ bij ⨉
O(nk) Distributed Computation
Split Data
Goal: Predict song’s release year from audio features
Feature Extraction
Supervised Learning
Evaluation
Predict
training
set model
full
dataset
validation new entity
set
Split Data
Raw Data: Millionsong Dataset from UCI ML Repository
• Explore features Feature Extraction
Predict
training
set model
full
dataset
validation new entity
set
Split Data
Split Data: Create training, validation, and test sets
Feature Extraction
Supervised Learning
Evaluation
Predict
training
set model
full
dataset
validation new entity
set
Split Data
Feature Extraction:
• Initially use raw features Feature Extraction
• Subsequently compare with quadratic features
Supervised Learning
Evaluation
Predict
training
set model
full
dataset
validation new entity
set
Split Data
Predict
training
set model
full
dataset
validation new entity
set
• Use grid search to nd good values for regularization and step Feature Extraction
size hyperparameters
• Evaluate using RMSE Supervised Learning
Predict
fi
training
set model
full
dataset
validation new entity
set
Split Data
Evaluation (Part 2): Evaluate nal model
• Evaluate using RMSE Feature Extraction
Predict
fi
training
set model
full
dataset
validation new entity
set
Split Data
Feature Extraction
Predict: Final model could be used to predict song year for
new songs (we won’t do this though) Supervised Learning
Evaluation
Predict