Distributed Linear Regression Class Notes

10-605/10-805:
Machine Learning
with Large Datasets
FA L L 2 0 2 4
Distributed Linear Regression

Announcements
• HW2: due Friday, Sept 20 at 11:59pm ET

• please start early!
• use piazza, office hours, recitation for questions
• but don’t provide direct answers on these forums
• AWS / HW2 recitation

• This Friday!
• Will get you set up with AWS & walk you through HW2 data preprocessing
• We expect AWS credits to arrive before Friday’s recitation
RECALL
Machine learning pipeline

pre- post-
processing processing
deployed
raw data data ML method model
model
feature model &

optimization evaluation
extraction parameters
P E R F O R M I N G T H E M L P I P E L I N E AT S C A L E
Key course topics
Data preparation Training Inference

• Data cleaning • Distributed ML • Hardware for ML
• Data summarization
• Visualization
• Large-scale optimization • Techniques for low-
• Scalable deep learning latency inference
• Dimensionality reduction • E cient data structures (compression, pruning,
• Hyperparameter tuning distillation)
Infrastructure / Frameworks Advanced topics

• Apache Spark • Data curation
• PyTorch • E cient netuning
• AWS / Google Cloud / Azure • Scaling laws for FMs
• Safety at scale
ffi
ffi
fi
Key course topics

• Visualization

• Safety at scale
ffi
ffi
fi
Outline
1. Background: linear regression & big O notation

2. Distributed linear regression
3. Gradient descent
4. Distributed PCA
5. HW2 preview
REVIEW
Big O Notation
Big O Notation
Describes how algorithms respond to changes in input size

• Both in terms of processing time and space requirements
• We refer to complexity and Big O notation synonymously
Required space proportional to units of storage

• Typically 8 bytes to store a oating point number
Required time proportional to number of ‘basic operations’

• Arithmetic operations (+, −, ×, /), logical tests (<, >, ==)
fl
Big O Notation
Notation: f(x) = O(g(x)) |f(x)| C|g(x)| x>N
• Can describe an algorithm’s time or space complexity
Informal de nition: f does not grow faster than g
f(x) = OFormal
(g(x))de nition: |f(x)| C|g(x)| x>N
Ignores constants and lower-order terms
2 2
x• +For
3xlargeCx
enough x, >these
N terms won’t matter
2 2
• E.g., x + 3x Cx x>N
fi
fi
EXAMPLE 1:
O(1) complexity
Constant time algorithms perform same # of operations every time they’re called
• E.g., performing a xed number of arithmetic operations
Constant space algorithms require xed storage every time they’re called
• E.g., storing the results of a xed number of arithmetic operations
fi
fi
fi
EXAMPLE 2:
O(n) complexity
Linear time algorithms perform a number of operations proportional to number

of inputs
• E.g., adding two n-dimensional vectors requires O(n) arithmetic operations
Linear space algorithms require storage proportional to the size of the inputs
• E.g., adding two n-dimensional vectors results in a new n-dimensional vector
which requires O(n) storage
EXAMPLE 3:
O(n )
2 complexity
Quadratic time algorithms perform a number of operations proportional to the

square of the number of inputs
• E.g., outer product of two n-dimensional vectors requires O(n 2) multiplication
operations (one per each entry of the resulting matrix)
Quadratic space algorithms require storage proportional to the square of the size
of the inputs
• E.g., outer product of two n-dimensional vectors requires O(n 2) storage (one
per each entry of the resulting matrix)
Time and space complexity can differ
Inner product of two n-dimensional vectors

• O(n) time complexity to multiply n pairs of numbers
• O(1) space complexity to store result (which is a scalar)
Matrix inversion of an n × n matrix

• O(n 3) time complexity to perform inversion
• O(n 2) space complexity to store result
EXAMPLE 1:
Matrix-vector multiply
Goal: multiply an n × k matrix with an k × 1 vector
Computing result takes O(nk) time

• There are n entries in the resulting vector
• Each entry computed via dot product between two k-dimensional vectors (a
row of input matrix and input vector)
Storing result takes O(n) space

• The result is an n-dimensional vector
EXAMPLE 2:
Matrix-matrix multiply
Goal: multiply an n × k matrix with an k × p matrix
Computing result takes O(npk) time

• There are np entries in the resulting matrix
• Each entry computed via dot product between two k-dimensional vectors
Storing result takes O(np) space

• The result is an n × p matrix
EXAMPLE 1:
Regression
How much should you sell your house for?
data regression intelligence
pirce ($)
= ??
house size
input: houses & features learn: x → y relationship predict: y (continuous)

Regression
Goal: Learn a mapping from observations (features) to continuous labels

given a training set (supervised learning)
Examples:
• Size, Location, Age → House price
• Audio features → Song year
• Processes, memory → Power consumption
• Historical nancials → Future stock price
fi
RECALL
Empirical risk minimization

n
̂f = arg min 1
f∈F n ∑
n ℓ( f(xi ), yi )
i=1
• A popular approach in supervised ML

• Given a loss and data (x1, y1), … (xn, yn), we estimate a predictor f by minimizing
the empirical risk
• We typically restrict this predictor to lie in some class, F
• Could reflect our prior knowledge about the task
• Or may be for computational convenience
Question: how should we select our function class, F, and our loss function, ??
𝓁
𝓁
C O M P U TAT I O N A L C O N S I D E R AT I O N S

n
̂f = arg min 1
f∈F n ∑
n ℓ( f(xi ), yi )
i=1
• Even when the class of functions, F, is simple (e.g., linear functions), the above
optimization problem might be non-convex and thus difficult to solve
• For example, this (non-convexity) is the case for the 0/1 loss
10-605/10-805:
Machine Learning
with Large Datasets
FA L L 2 0 2 4
Distributed Linear Regression

Announcements
• HW2: due Friday, Sept 20 at 11:59pm ET

• please start early!
• use piazza, office hours, recitation for questions
• but don’t provide direct answers on these forums
• AWS / HW2 recitation

• This Friday!
• Will get you set up with AWS & walk you through HW2 data preprocessing
• We expect AWS credits to arrive before Friday’s recitation
RECALL
Machine learning pipeline

pre- post-
processing processing
deployed
raw data data ML method model
model
feature model &

optimization evaluation
extraction parameters
Key course topics

• Visualization

• Safety at scale
ffi
ffi
fi
Key course topics

• Visualization

• Safety at scale
ffi
ffi
fi
C O M P U TAT I O N A L C O N S I D E R AT I O N S

n
̂f = arg min 1
f∈F n ∑
n ℓ( f(xi ), yi )
i=1
• We will consider how to solve this objective when:

• the number of features (k) is large
• the number of observations (n) is large
• Techniques:
• parallel and distributed learning, large-scale optimization, efficient data
structures, dimensionality reduction, etc
• Will first focus on convex objectives (linear regression, logistic regression), and
then discuss non-convex objectives (deep neural networks) later in the course
Linear least squares regression
Example: Predicting house price from size, location, age
For each observation we have a feature vector, x, and label, y
x = x1 x2 x3
We assume a linear mapping between features and label:
y w 0 + w 1 x1 + w 2 x2 + w 3 x 3
Linear least squares regression
Example: Predicting house price from size, location, age
We can augment the feature vector to incorporate offset:
x = 1 x1 x2 x3
We can then rewrite this linear mapping as scalar product:
3
y ŷ = w i xi = w x
i=0
Why a linear mapping?
Simple
Often works well in practice
Can introduce complexity via feature extraction

1D example
Goal: nd the line of best t y

x coordinate: features
y coordinate: labels
y ŷ = w0 + w1 x
x
Intercept / Offset Slope

fi
fi
LINEAR REGRESSION VIA
Empirical risk minimization linear (actually, affine) functions

n
̂f = arg min 1
f∈F n ∑
n ℓ( f(xi ), yi )
i=1

the empirical risk
𝓁
𝓁
Evaluating predictions
Can measure ‘closeness’ between label and prediction

• House price: better to be incorrect by $50 than $50,000
• Song year prediction: better to be off by a year than by 20 years
What is an appropriate evaluation metric or ‘loss’ function?

• Absolute loss: |y ŷ|
2
• Square loss: (y ŷ)
RECALL
Loss function: Regression

Square Loss
2
ℓ( f(x), y) = ( f(x) − y)
Absolute Loss
ℓ( f(x), y) = | f(x) − y |
intelligence
x y f(x) Square Loss Absolute Loss
$300,000 $300,000 0 0
= ??$300,000 $300,500 250,000 500
$300,000 $400,000 1E+10 100,000
Can measure ‘closeness’ between label and prediction

• House price: better to be incorrect by $50 than $50,000
• Song year prediction: better to be off by a year than by 20 years
What is an appropriate evaluation metric or ‘loss’ function?

• Absolute loss: |y ŷ|
• Square loss: (y ŷ) 2 ← Has nice mathematical properties
LINEAR REGRESSION VIA
Empirical risk minimization linear (actually, affine) functions

n
̂f = arg min 1
f∈F n ∑
n ℓ( f(xi ), yi )
i=1
square loss
the empirical risk
𝓁
𝓁
How can we learn model (w)?
(i)
Assume we have n training points, where x denotes the ith point
Recall two earlier points:

• Linear assumption: ŷ = w x
2
• We use square loss: (y ŷ)
Idea: Find w that minimizes square loss over training points:

n
(i) (i) 2
min (w x y )
{
w
i=1 (i)
ŷ
Given n training points with k features, we de ne:
n×k
• X∈ℝ : matrix storing points
n : real-valued labels
• y∈ℝ
n
• ŷ ∈ ℝ : predicted labels, where ŷ = Xw
k : regression parameters / model to learn
• w∈ℝ
Least Squares Regression: Learn mapping (w) from features to labels

that minimizes residual sum of squares:
2
min ||Xw y||2
w
n
Equivalent min (w x (i) (i) 2
y ) by de nition of Euclidean norm
w
i=1
fi
fi
Find solution by setting derivative to zero
n
n 2 (i) (i) 2
df 1D: f(w)(= i) || w x
(i) y||=
(i)2 (wx y )
(w ) = 2 x (wx y )=0 wx x x y=0
dw i=1
i=n1
df (i) (i) (i)
(w ) = 2 xwx (wx y )=0 wx x x y=0
dw x x y
i=1
1
w = (x x) x y
wx x x y
1
w = (x x) x y
Least Squares Regression: Learn mapping (w) from features to labels
that minimizes residual sum of squares:
2
min ||Xw y||2
w
1
Closed form solution: w = (X X) X y (if inverse exists)
Overfitting and generalization
We want good predictions on new data, i.e., ’generalization’
Least squares regression minimizes training error, and could over t

• Simpler models are more likely to generalize (Occam’s razor)
Can we change the problem to penalize for model complexity?

• Intuitively, models with smaller weights are simpler
fi
HOW TO PREVENT OVERFITTING?
Regularization
n
̂ 1
∑
θ = arg min ℓ( fθ(xi), yi) + λR(θ)
θ n
i=1
• Key idea: modify ERM objective to penalize complex models

• Helps to prevent overfitting
• Note: have re-parameterized our hypothesis, f, in terms of
• Larger
• more regularization, reduce likelihood of overfitting
• more bias, less variance
Question: how to select our regularizer, R() ??

𝝀
𝜃
n×k : matrix storing points
• X∈ℝ
n
• y∈ℝ : real-valued labels
n
k
•w∈ℝ : regression parameters / model to learn
Ridge Regression: Learn mapping (w) that minimizes residual

sum of squares along with a regularization term:
Training Error Model Complexity
2 2
min ||Xw y||2 + λ||w||2
w
free parameter trades off

⊤ −1 ⊤
Closed-form solution: w = (X X + λIk) X y between training error and
model complexity
fi
• X∈ℝ
n
n
k

2 2
min ||Xw y||2 + λ||w||2
w
free parameter trades off

⊤ −1 ⊤
Closed-form solution: w = (X X + λIk) X y between training error and
model complexity
fi
EXAMPLE
Millionsong Regression Pipeline

training
set model
full
dataset
new entity
test set accuracy prediction
Obtain Raw Data
Split Data
Feature Extraction
Example Supervised Learning Pipeline Supervised Learning
Evaluation
Predict
training
set model
full
dataset
new entity
Obtain Raw Data
Split Data
Goal: Predict song’s release year from audio features
Feature Extraction
Raw Data: Millionsong Dataset from UCI ML Repository
• Western, commercial tracks from 1980-2014 Supervised Learning
• 12 timbre averages (features) and release year (label) Evaluation
Predict
training
set model
full
dataset
new entity
Obtain Raw Data
Split Data
Split Data: Train on training set, evaluate with test set
• Test set simulates unobserved data Feature Extraction
• Test error tells us whether we’ve generalized well Supervised Learning
Evaluation
Predict
training
set model
full
dataset
new entity
Obtain Raw Data
Split Data
Feature Extraction: Quadratic features
• Compute pairwise feature interactions Feature Extraction
• Captures covariance of initial timbre features
• Leads to a non-linear model relative to raw features Supervised Learning
Evaluation
Predict
Given 2 dimensional data, quadratic features are:
2 2
x = x1 x2 = Φ(x) = x1 x1 x2 x2 x 1 x2
2 2
z = z1 z2 = Φ(z) = z1 z1 z2 z2 z 1 z2
More succinctly:
Φ (x) = 2 2 Φ (z) = 2 2
x1 2x1 x2 x2 z1 2z1 z2 z2
Equivalent inner products:
2 2 2 2
Φ(x) Φ(z) = x 1 z1 + 2x1 x2 z1 z2 + x 2 z2 = Φ (x) Φ (z)
training
set model
full
dataset
new entity
Obtain Raw Data
Supervised Learning: Least Squares Regression Split Data
• Learn a mapping from entities to continuous labels given a Feature Extraction

training set
• Audio features → Song year Supervised Learning
Evaluation
Predict
• X∈ℝ
n
n
k

2 2
min ||Xw y||2 + λ||w||2
w
⊤ −1 ⊤
Closed-form solution: w = (X X + λIk) X y
fi
Ridge Regression: Learn mapping w ( ) that minimizes residual sum of
squares along with a regularization term:
2 2
min ||Xw y||2 + λ||w||2
w
free parameter trades off between training
error and model complexity
How do we choose a good value for this free parameter?
• Most methods have free parameters / ‘hyperparameters’ to tune
First thought: Search over multiple values, evaluate each on test set
• But, goal of test set is to simulate unobserved data
• We may over t if we use it to choose hyperparameters
Second thought: Create another hold out dataset for this search
fi
training
set model
full
dataset
validation new entity
set
Obtain Raw Data
Split Data
Evaluation (Part 1): Hyperparameter tuning
• Training: train various models Feature Extraction
• Validation: evaluate various models (e.g., Grid Search) Supervised Learning
• Test: evaluate nal model’s accuracy Evaluation
Predict
fi
We’ll cover advanced hp tuning
Hyperparameter 2
methods later in the course
Hyperparameter 1
10-8 10-6 10-4 10-2 1
Regularization Parameter ( )
Grid Search: Exhaustively search through hyperparameter space

• De ne and discretize search space (linear or log scale)
• Evaluate points via validation error
fi
𝝀
How can we compare labels and predictions for n validation points?

2
Least squares optimization involves squared loss, (y ŷ) , so it seems reasonable
to use mean squared error (MSE):
n
1 (i) (i) 2
MSE = (ŷ y )
n
i=1
But MSE’s unit of measurement is square of quantity being measured, e.g., “squared
years” for song prediction
More natural to use root-mean-square error (RMSE), i.e., MSE
training
set model
full
dataset
set
Obtain Raw Data
Split Data
Evaluation (Part 2): Evaluate nal model
• Training set: train various models Feature Extraction
• Validation set: evaluate various models Supervised Learning
• Test set: evaluate nal model’s accuracy Evaluation
Predict
fi
fi
training
set model
full
dataset
set
Obtain Raw Data
Split Data
Feature Extraction
Predict: Final model can then be used to make predictions on
future observations, e.g., new songs Supervised Learning
Evaluation
Predict
Outline

3. Gradient descent
4. Distributed PCA
5. HW2 preview
RECALL
Least Squares Regression
Least Squares Regression: Learn mapping (w) from features to

labels that minimizes residual sum of squares:
2
min ||Xw y||2
w
Closed form solution: w = (X X) 1 (if inverse exists)

X y
How do we solve this computationally?
• Note: will discuss least squares regression, as computational pro le
similar for ridge regression
fi
LEAST SQUARES REGRESSION
Computing closed form solution

1
w = (X X) X y
Computation: O(nk2 + k3) operations
Consider number of arithmetic operations ( +, −, ×, / )
Computational bottlenecks:
• Matrix multiply of X X : O(nk2) operations
• Matrix inverse: O(k3) operations
Other methods (Cholesky, QR, SVD) have same complexity

Storage requirements
1
w = (X X) X y
Storage: O(nk + k2) oats
Consider storing values as oats (8 bytes)
Storage bottlenecks:
• X X and its inverse: O(k2) oats
• X : O(nk) oats
fl
fl
fl
fl
Large n and small k

1
w = (X X) X y
Assume O(k3) computation and O(k2) storage feasible on single machine

Storing X and computing X X are the bottlenecks
Can distribute storage and computation!
• Store data points (rows of X ) across machines
• Compute X X as a sum of outer products
fl
Matrix multiplication via inner products
Each entry of output matrix is result of inner product of inputs matrices
1 2
9 3 5 28 18
3 5 =
4 1 2 11 9
2 3
9 1+3 3+5 2 = 28
Matrix multiplication via outer products
Output matrix is sum of outer products between corresponding rows

and columns of input matrices
1 2
9 3 5 28 18
3 5 =
4 1 2 11 9
2 3
9 18 9 15 10 15
+ +
4 8 3 5 4 6
n k
x(1)
x(1)
x(2)
x(n)
n
…
k x(2)
x(i)
…
X X= n
=
x(i)
x(n)
i =1
Example: n = 6; 3 workers
workers: x(1) x(3) x(2)

O(nk) Distributed Storage
x(5) x(4) x(6)
O(nk2)
map: x(i) x(i) x(i) O(k2) Local
x(i)
x(i)
x(i)
Distributed
Storage
Computation
reduce:
( x(i)
) -1 O(k3) Local O(k2) Local
x(i)
Computation Storage
n k
x(1)
x(1)
x(2)
x(n)
n
…
k x(2)
x(i)
…
X X= n
=
x(i)
x(n)
i =1

x(5) x(4) x(6)
O(nk2)
x(i)
x(i)
x(i)
Distributed
Storage
Computation
reduce:
( x(i)
x(i)
Computation Storage
Large n and small k

1
w = (X X) X y

• Compute X X as a sum of outer products
fl
Large n and small k

1
w = (X X) X y

• Compute X X
as a sum of outer products
fl
Large n and large k

1
w = (X X) X y
As before, storing X and computing X X are bottlenecks

Now, storing and operating on X X is also a bottleneck
• Can’t easily distribute!
fl
n k
x(1)
x(1)
x(2)
x(n)
n
…
k x(2)
x(i)
…
X X= n
=
x(i)
x(n)
i =1

x(5) x(4) x(6)
O(nk2)
x(i)
x(i)
x(i)
Distributed
Storage
Computation
reduce:
( x(i)
x(i)
Computation Storage
Large n and large k

1
w = (X X) X y
As before, storing X and computing X X are bottlenecks

Now, storing and operating on X X is also a bottleneck
• Can’t easily distribute!
1st Rule of thumb

Computation and storage should be at most linear (in n, k)
fl
Large n and large k
We need methods that are linear in time and space
One idea: Exploit sparsity or reduce dimension

• Explicit sparsity can provide orders of magnitude storage and
computational gains
dense : 1. 0. 0. 0. 0. 0. 3.
Sparse data is prevalent 8
size : 7
• Text processing: bag-of-words, n-grams >
<
• Collaborative ltering: ratings matrix sparse : indices : 0 6
>
:
• Graphs: adjacency matrix values : 1. 3.
• Categorical features: one-hot-encoding
fi
Large n and large k

computational gains
• Latent sparsity assumption can be used to reduce dimension, e.g., PCA,
low-rank approximation (we’ll revisit this soon…)
k r k
r
≈ ‘Low-rank’
n n
Large n and large k

computational gains
• Latent sparsity assumption can be used to reduce dimension, e.g., PCA,
low-rank approximation (we’ll revisit this soon…)
Another idea: Use different algorithms

• Gradient descent is an iterative algorithm that requires
O(nk) computation and O(k) local storage per iteration
Closed form solution for large n and large k
workers: x(1) x(3) x(2) O(nk) Distributed

x(5) x(4) x(6) Storage
O(nk2)
x(i)
x(i)
x(i)
Distributed
Storage
Computation
reduce:
( x(i)
x(i)
Computation Storage
Gradient descent for large n and large k

O(nk2)
x(i)
x(i)
x(i)
Distributed
Storage
Computation
reduce:
( x(i)
x(i)
Computation Storage

O(nk) O(k)
O(nk2)
map: ? ? ? Distributed
O(k2) Local
Storage
Computation
reduce:
( x(i)
x(i)
Computation Storage

O(nk) O(k)
O(nk2)
map: ? ? ? Distributed
O(k2) Local
Storage
Computation
O(k) O(k)
reduce: ? O(k3) Local O(k2) Local
Computation Storage
Outline

3. Gradient descent
4. Distributed PCA
5. HW2 preview
Linear Regression Optimization
Goal: Find w that minimizes f(w)

2
f(w) = ||Xw y||2
• Closed form solution exists
• Gradient Descent is iterative
(Intuition: go downhill!)
n
w* w
Scalar objective: f(w) = ||wx 2 (j) (j) 2
y||2 = (wx y )
j=1
Gradient descent
f(w)
Start at a random point
Determine a descent direction

Choose a step size
Update
w* w1 w0 w
Gradient descent
f(w)
Repeat
Choose a step size
Update
Until stopping criterion is satis ed
w* w2 w1 w0 w
fi
Gradient descent
f(w)
Repeat
Choose a step size
Update
Until stopping criterion is satis ed
w* … w2 w1 w0 w
fi
Where will we converge?
f(w) Convex g(w) Non-convex
…
… …
w* w w w* w
Any local minimum is a global minimum Multiple local minima may exist
Least Squares, Ridge Regression, and Logistic Regression are all convex!
(Neural networks are non-convex)
′
Choosing a descent direction (1D)
f(w) positive go left! f(w) negative go right!
zero done!
w* w0 w w0 w*
Step Size
w
We can only move in two directions df
Negative slope is direction of descent! Update Rule: wi+1 = wi α i (w i )
dw
Negative Slope
Choosing a descent direction
2D Example:
• Function values are in black/white
and black represents higher values
• Arrows are gradients
"Gradient2" by Sarang. Licensed under CC BY-SA 2.5 via Wikimedia Commons

http://commons.wikimedia.org/wiki/File:Gradient2.svg#/media/File:Gradient2.svg
Step Size
k
We can move anywhere in ℝ
Negative gradient is direction of Update Rule: wi+1 = wi αi f(wi )
steepest descent!
Gradient descent for least squares
df
Update Rule: wi+1 = wi α i (w i )
dw
n
Scalar objective: f(w) = ||wx 2 (j) (j) 2
y||2 = (wx y )
j=1
n
Derivative: df (j) (j) (j)
(w ) = 2 (wx y )x
(chain rule) dw
j=1
n
Scalar Update: (j) (j) (j)
wi+1 = wi αi (w i x y )x
(2 absorbed in α ) j=1
Vector Update: n
(j) (j) (j)
wi+1 = wi αi (wi x y )x
j =1
Choosing a step size
f(w) f(w) f(w)
w* w w* w w* w
Too small: converge very slowly Too big: overshoot, can diverge Reduce size over time
Theoretical convergence results for various step sizes

α Constant
A common step size is αi =
n i Iteration #
# Training Points
Parallel gradient descent for least squares
n
Vector Update: wi+1 = wi αi (wi x (j) (j)
y )x (j)
j =1
Compute summands in parallel!
note: workers must all have wi

wi + 1 O(nk)
map: (wi x (j) (j)
y )x (j)
(wi x (j)
y )x(j) (j)
(wi x (j) (j)
y )x (j)
Distributed
O(k) Local
Storage
Computation
n
reduce: (wi x (j) (j)
y )x (j)
O(k) Local O(k) Local
j=1
Computation Storage
Parallel gradient descent for least squares
n
Vector Update: wi+1 = wi αi (wi x (j) (j)
y )x (j)
j =1
Compute summands in parallel!
note: workers must all have wi

wi + 1 O(nk)
map: (wi x (j) (j)
y )x (j)
(wi x (j)
y )x(j) (j)
(wi x (j) (j)
y )x (j)
Distributed
O(k) Local
Storage
Computation
n
y )x (j)
O(k) Local O(k) Local
j=1
Computation Storage
Gradient descent summary
Pros: Cons:
• Easily parallelized • Slow convergence (especially
• Cheap at each iteration compared with closed-form)
• Stochastic variants can make • Requires communication
things even cheaper across nodes!
DISTRIBUTED MACHINE LEARNING
Communication Principles
RECALL
Communication hierarchy
Access rates fall sharply with distance

>50× gap between memory and network!
50 GB/s
1 GB/s 1 GB/s 0.3 GB/s
CPU RAM Local disks Rack Different

Racks
Be mindful of this hierarchy when developing parallel/distributed algorithms!

RECALL
Communication hierarchy
Access rates fall sharply with distance
• Parallelism makes computation fast
• Network makes communication slow
50 GB/s
1 GB/s 1 GB/s 0.3 GB/s
CPU RAM Local disks Rack Different

Racks
Be mindful of this hierarchy when developing parallel/distributed algorithms!

2nd Rule of thumb
Perform parallel and in-memory computation
Persisting in memory reduces communication

• Especially for iterative computation (gradient descent)
Scale-up (powerful multicore machine)
✓ No network communication
Expensive hardware, eventually hit a wall
RAM
CPU RAM
Disk CPU
Disk
2nd Rule of thumb

Scale-out (distributed, e.g., cloud-based)
✓ Commodity hardware, scales to massive problems
Need to deal with network communication
Network
RAM RAM RAM RAM
CPU CPU CPU

… CPU
Disk Disk Disk Disk

2nd Rule of thumb

Network
← Persist training data across iterations
RAM RAM RAM RAM
CPU CPU CPU

… CPU
Disk Disk Disk Disk

2nd Rule of thumb

Network
← Persist training data across iterations
RAM RAM RAM RAM
CPU CPU CPU

… CPU
Disk Disk Disk Disk

3rd Rule of thumb
Minimize network communication
Q: How should we leverage distributed computing while

mitigating network communication?
First Observation: We need to store and potentially communicate

Data, Model, and Intermediate objects
• A: Keep large objects local
3rd Rule of thumb
Minimize network communication — stay local
Example: Linear regression, big n and small k

• Solve via closed form (not iterative!)
• Communicate O(k 2) intermediate data
• Compute locally on data (Data Parallel)

x(5) x(4) x(6)
map: x(i) x(i) x(i)

x(i)
x(i)
x(i)
reduce:
( x(i)
x(i)
) -1
3rd Rule of thumb
Example: Linear regression, big n and big k

• Gradient descent, communicate wi
• O(k) communication OK for fairly large k
• Compute locally on data (Data Parallel)

x(5) x(4) x(6)
wi + 1 map: (wi x(j) y(j) )x(j) (wi x(j) y(j) )x(j) (wi x(j) y(j) )x(j)
n
y )x (j)
j=1
3rd Rule of thumb
Example: Hyperparameter tuning for ridge regression with

small n and small k
• Data is small, so can communicate it
• ‘Model’ is collection of regression models
corresponding to different hyperparameters
• Train each model locally (Model Parallel)
3rd Rule of thumb
Example: Linear regression, big n and huge k

• Gradient descent
• O(k) communication slow with hundreds of millions
parameters
• Distribute data and model (Data and Model Parallel)
• Often rely on sparsity to reduce communication
3rd Rule of thumb
Q: How should we leverage distributed computing while mitigating

network communication?
First Observation: We need to store and potentially communicate Data,

Model and Intermediate objects
• A: Keep large objects local
Second Observation: ML methods are typically iterative

• A: Reduce # iterations
3rd Rule of thumb
Minimize network communication — reduce iterations
Distributed iterative algorithms must compute and communicate

• In Bulk Synchronous Parallel (BSP) systems, e.g., Apache Spark,
we strictly alternate between the two
Distributed Computing Properties
• Parallelism makes computation fast
• Network makes communication slow
Idea: Design algorithms that compute more, communicate less
• Do more computation at each iteration
• Reduce total number of iterations
3rd Rule of thumb
Extreme: Divide-and-conquer
• Fully process each partition locally, communicate nal result
• Single iteration; minimal communication
• Approximate results
fi
3rd Rule of thumb
Less extreme: Local-updating methods

• Do more work locally than gradient descent before communicating
• Will discuss more later in the course …
3rd Rule of thumb
Throughput: How many bytes per second can be read

Latency: Cost to send message (independent of size)
Latency
Memory 1e-4 ms We can amortize latency!
Hard Disk 10 ms • Send larger messages
• Batch their communication
Network (same datacenter) .25 ms • E.g., Train multiple models together
Network (US to Europe) >5 ms
1st Rule of thumb
Computation and storage should be linear (in n, k)
2nd Rule of thumb

3rd Rule of thumb

Outline

3. Gradient descent
4. Distributed PCA
5. HW2 preview
RECALL
Toy example: Shoe sizes
Shoes often labeled in terms of their

European and American shoe sizes
• We expect sizes to be correlated (but
maybe not perfect, due to some errors/
European Size
noise)
How can we find a simpler, more compact

representation of this data?
Pick a direction & project into one dimension

American Size

European Size
noise)


American Size
a bad direction !
European Size
noise)


American Size
a bad direction !
European Size
noise)


American Size
a better direction !
European Size
noise)


American Size
How to find the ‘best’ direction?
Goal: Minimize reconstruction error

• i.e., Find the direction that minimizes the
European Size
Euclidean distance between the original
points and their projections
This is the key idea behind PCA
American Size
Linear Regression — predict y from x. PCA — reconstruct 2D data via 2D data with
Evaluate accuracy of predictions single degree of freedom. Evaluate
(represented by blue line) by vertical reconstructions (represented by blue line)
distances between points and the line by Euclidean distances
y
European Size
x American Size
Computing PCA solution
Given: n × k matrix of uncentered raw data
Goal: Compute r ≪ k dimensional representation
Step 1: Center Data

Step 2: Compute Covariance or Scatter Matrix
1 1 P
CX • n X X Cversus
= X = X X
n
Step 3: Eigendecomposition
≈ Z = X ≈
≈
Step 4: Compute PCA Scores
PCA at scale
Case 1: Big n and Small k
• O(k 2) local storage, O(k3) local computation, O(kr)
communication
• Similar strategy as closed-form linear regression
Case 2: Big n and Big k

• O(k) local storage and computation on workers, O(kr) P
communication
• Iterative algorithm ≈ Z = X ≈
≈
Step 0: Data Parallel storage

x(5) x(4) x(6)
Step 1: Center Data
k
• Compute k feature means, m ∈ ℝ

x(5) x(4) x(6)
reduce: m=
1
n ( x(i) )
Step 1: Center Data
k
• Communicate m to all workers

x(5) x(4) x(6)
m
O(k) Local Storage
reduce: m=
1
n ( x(i) ) O(k) Local Computation
O(k) Communication
Step 1: Center Data
k
• Communicate m to all workers
• Subtract m from each data point

x(5) x(4) x(6)
map: x(i) m x(i) m x(i) m O(k) Local Computation

1
Step 2: Compute ScatterCMatrix
X = ( X X )
n
• Compute matrix product via outer products (just like we did for
closed-form linear regression!)
1 2
9 3 5 28 18
3 5 =
4 1 2 11 9
2 3
9 18 9 15 10 15
+ +
4 8 3 5 4 6
n k
x(1)
x(1)
x(2)
x(n)
n
…
k x(2)
x(i)
…
X X= n
=
x(i)
x(n)
i =1

x(5) x(4) x(6)
map: x(i) x(i) x(i) O(k2) Local Storage

x(i)
x(i)
x(i)
O(nk2) Distributed Computation
x(i) O(k2) Local Storage

reduce:
x(i)
O(k2) Local Computation

• Perform locally since k is small

x(5) x(4) x(6)

x(i)
x(i)
x(i)
reduce: ( x(i)
)
x(i)
eigh
• Perform locally since k is small
k×r
• Communicate r principal components ( P ∈ ℝ ) to workers

x(5) x(4) x(6)

P
x(i)
x(i)
x(i)
O(k2) Local Storage

reduce: ( x(i)
) O(k3) Local Computation
x(i)
eigh O(kr) Communication

Step 4: Compute PCA Scores
• Multiply each point by principal components, P

x(5) x(4) x(6)
p(1) p(1) p(1)

map:
x(i)
x(i)
x(i)
p(2) p(2) p(2) O(kr) Local Computation
PCA at scale
Case 1: Big n and Small k
• O(k 2) local storage, O(k3) local computation, O(kr)
communication
• Similar strategy as closed-form linear regression
Case 2: Big n and Big k

• O(k) local storage and computation on workers, O(kr) P
communication Z = X
≈ ≈
• Iterative algorithm
An iterative approach
We can use algorithms that rely on a sequence of matrix-vector products to
compute top r eigenvectors ( P )
• E.g., Krylov subspace or random projection methods
Krylov subspace methods (used in MLlib) iteratively compute

k
X Xv for some v ∈ ℝ provided by the method
• Requires O(r) passes over data, O(k) local storage on workers
• We don’t need to compute the covariance matrix!
Repeat for O(k) iterations:
k
1. Communicate vi ∈ ℝ to all workers
2. Compute qi = X Xvi in a distributed fashion
3. Driver uses qi to
= update
X Xv estimate
i of P
Repeat for O(r) iterations:
k
1. Communicate vi ∈ ℝ to all workers
2. Compute qi = X Xvi in a distributed fashion
• Step 1: bi = Xvi
• Step 2: qi = X bi
• Perform in single map-reduce!
3. Driver uses qi = X Xv
to update estimate
i of P
(j)
• bij = vi x : each component is dot product
n
(j)
is aXsumbof
• qi = i rescaled data points, i.e., q i = bij x
j =1
Compute qi = X Xvi in a distributed
n
fashion
(j) and q = (j)
• bij = vi x i bij x
j =1
• Locally compute each dot product and rescale each point before summing all
rescaled points in reduce step!

x(5) x(4) x(6)
map: O(k) Local Storage

x(j)
x(j)
x(j)
bij ⨉ bij ⨉ bij ⨉
O(nk) Distributed Computation
O(k) Local Storage

reduce: qi = O(k) Local Computation
x(j)
bij ⨉
O(k) Communication
Outline

3. Gradient descent
4. Distributed PCA
5. HW2 preview
training
set model
full
dataset
set
Obtain Raw Data
Split Data
Goal: Predict song’s release year from audio features
Feature Extraction
Supervised Learning
Evaluation
Predict
training
set model
full
dataset
set
Obtain Raw Data
Split Data
Raw Data: Millionsong Dataset from UCI ML Repository
• Explore features Feature Extraction
• Shift labels so that they start at 0 (for interpretability) Supervised Learning

• Visualize data
Evaluation
Predict
training
set model
full
dataset
set
Obtain Raw Data
Split Data
Split Data: Create training, validation, and test sets
Feature Extraction
Supervised Learning
Evaluation
Predict
training
set model
full
dataset
set
Obtain Raw Data
Split Data
Feature Extraction:
• Initially use raw features Feature Extraction
• Subsequently compare with quadratic features
Supervised Learning
Evaluation
Predict
training
set model
full
dataset
set
Obtain Raw Data
Split Data
Supervised Learning: Least Squares Regression Feature Extraction

• First implement gradient descent from scratch
• Then use MLlib implementation Supervised Learning
• Visualize performance by iteration Evaluation
Predict
training
set model
full
dataset
set
Obtain Raw Data
Evaluation (Part 1): Hyperparameter tuning Split Data
• Use grid search to nd good values for regularization and step Feature Extraction
size hyperparameters
• Evaluate using RMSE Supervised Learning
• Visualize grid search Evaluation
Predict
fi
training
set model
full
dataset
set
Obtain Raw Data
Split Data
Evaluation (Part 2): Evaluate nal model
• Evaluate using RMSE Feature Extraction
• Compare to baseline model that returns average song Supervised Learning

year in training data
Evaluation
Predict
fi
training
set model
full
dataset
set
Obtain Raw Data
Split Data
Feature Extraction
Predict: Final model could be used to predict song year for
new songs (we won’t do this though) Supervised Learning
Evaluation
Predict

Distributed Linear Regression Class Notes

Uploaded by

Copyright:

Available Formats

Distributed Linear Regression Class Notes

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Distributed Linear Regression Class Notes

Uploaded by

Copyright:

Available Formats

10-605/10-805:

Distributed Linear Regression

• HW2: due Friday, Sept 20 at 11:59pm ET

• AWS / HW2 recitation

Machine learning pipeline

feature model &

Key course topics

Data preparation Training Inference

Infrastructure / Frameworks Advanced topics

Key course topics

Data preparation Training Inference

Infrastructure / Frameworks Advanced topics

1. Background: linear regression & big O notation

Describes how algorithms respond to changes in input size

Required space proportional to units of storage

Required time proportional to number of ‘basic operations’

Linear time algorithms perform a number of operations proportional to number

Quadratic time algorithms perform a number of operations proportional to the

Inner product of two n-dimensional vectors

Matrix inversion of an n × n matrix

Goal: multiply an n × k matrix with an k × 1 vector

Computing result takes O(nk) time

Storing result takes O(n) space

Computing result takes O(npk) time

Storing result takes O(np) space

data regression intelligence

input: houses & features learn: x → y relationship predict: y (continuous)

Goal: Learn a mapping from observations (features) to continuous labels

Empirical risk minimization

• A popular approach in supervised ML

Empirical risk minimization

Distributed Linear Regression

• HW2: due Friday, Sept 20 at 11:59pm ET

• AWS / HW2 recitation

Machine learning pipeline

feature model &

Key course topics

Data preparation Training Inference

Infrastructure / Frameworks Advanced topics

Key course topics

Data preparation Training Inference

Infrastructure / Frameworks Advanced topics

Empirical risk minimization

• We will consider how to solve this objective when:

Example: Predicting house price from size, location, age

For each observation we have a feature vector, x, and label, y

Example: Predicting house price from size, location, age

We can augment the feature vector to incorporate offset:

Often works well in practice

Can introduce complexity via feature extraction

Goal: nd the line of best t y

Intercept / Offset Slope

Empirical risk minimization linear (actually, affine) functions

• A popular approach in supervised ML

Can measure ‘closeness’ between label and prediction

What is an appropriate evaluation metric or ‘loss’ function?

Loss function: Regression

Can measure ‘closeness’ between label and prediction

What is an appropriate evaluation metric or ‘loss’ function?