Distributed Linear Regression Class Notes

Download as pdf or txt
Download as pdf or txt
You are on page 1of 140

10-605/10-805:

Machine Learning
with Large Datasets
FA L L 2 0 2 4

Distributed Linear Regression


Announcements

• HW2: due Friday, Sept 20 at 11:59pm ET


• please start early!
• use piazza, office hours, recitation for questions
• but don’t provide direct answers on these forums

• AWS / HW2 recitation


• This Friday!
• Will get you set up with AWS & walk you through HW2 data preprocessing
• We expect AWS credits to arrive before Friday’s recitation
RECALL

Machine learning pipeline


pre- post-
processing processing

deployed
raw data data ML method model
model

feature model &


optimization evaluation
extraction parameters
P E R F O R M I N G T H E M L P I P E L I N E AT S C A L E

Key course topics

Data preparation Training Inference


• Data cleaning • Distributed ML • Hardware for ML
• Data summarization
• Visualization
• Large-scale optimization • Techniques for low-
• Scalable deep learning latency inference
• Dimensionality reduction • E cient data structures (compression, pruning,
• Hyperparameter tuning distillation)

Infrastructure / Frameworks Advanced topics


• Apache Spark • Data curation
• PyTorch • E cient netuning
• AWS / Google Cloud / Azure • Scaling laws for FMs
• Safety at scale
ffi
ffi
fi
P E R F O R M I N G T H E M L P I P E L I N E AT S C A L E

Key course topics

Data preparation Training Inference


• Data cleaning • Distributed ML • Hardware for ML
• Data summarization
• Visualization
• Large-scale optimization • Techniques for low-
• Scalable deep learning latency inference
• Dimensionality reduction • E cient data structures (compression, pruning,
• Hyperparameter tuning distillation)

Infrastructure / Frameworks Advanced topics


• Apache Spark • Data curation
• PyTorch • E cient netuning
• AWS / Google Cloud / Azure • Scaling laws for FMs
• Safety at scale
ffi
ffi
fi
Outline

1. Background: linear regression & big O notation


2. Distributed linear regression
3. Gradient descent
4. Distributed PCA
5. HW2 preview
REVIEW

Big O Notation
Big O Notation

Describes how algorithms respond to changes in input size


• Both in terms of processing time and space requirements
• We refer to complexity and Big O notation synonymously

Required space proportional to units of storage


• Typically 8 bytes to store a oating point number

Required time proportional to number of ‘basic operations’


• Arithmetic operations (+, −, ×, /), logical tests (<, >, ==)
fl
Big O Notation
Notation: f(x) = O(g(x)) |f(x)| C|g(x)| x>N
• Can describe an algorithm’s time or space complexity
Informal de nition: f does not grow faster than g

f(x) = OFormal
(g(x))de nition: |f(x)| C|g(x)| x>N
Ignores constants and lower-order terms
2 2
x• +For
3xlargeCx
enough x, >these
N terms won’t matter
2 2
• E.g., x + 3x Cx x>N
fi
fi
EXAMPLE 1:

O(1) complexity

Constant time algorithms perform same # of operations every time they’re called
• E.g., performing a xed number of arithmetic operations
Constant space algorithms require xed storage every time they’re called
• E.g., storing the results of a xed number of arithmetic operations
fi
fi
fi
EXAMPLE 2:

O(n) complexity

Linear time algorithms perform a number of operations proportional to number


of inputs
• E.g., adding two n-dimensional vectors requires O(n) arithmetic operations
Linear space algorithms require storage proportional to the size of the inputs
• E.g., adding two n-dimensional vectors results in a new n-dimensional vector
which requires O(n) storage
EXAMPLE 3:

O(n )
2 complexity

Quadratic time algorithms perform a number of operations proportional to the


square of the number of inputs
• E.g., outer product of two n-dimensional vectors requires O(n 2) multiplication
operations (one per each entry of the resulting matrix)

Quadratic space algorithms require storage proportional to the square of the size
of the inputs
• E.g., outer product of two n-dimensional vectors requires O(n 2) storage (one
per each entry of the resulting matrix)
Time and space complexity can differ

Inner product of two n-dimensional vectors


• O(n) time complexity to multiply n pairs of numbers
• O(1) space complexity to store result (which is a scalar)

Matrix inversion of an n × n matrix


• O(n 3) time complexity to perform inversion
• O(n 2) space complexity to store result
EXAMPLE 1:

Matrix-vector multiply

Goal: multiply an n × k matrix with an k × 1 vector

Computing result takes O(nk) time


• There are n entries in the resulting vector
• Each entry computed via dot product between two k-dimensional vectors (a
row of input matrix and input vector)

Storing result takes O(n) space


• The result is an n-dimensional vector
EXAMPLE 2:

Matrix-matrix multiply
Goal: multiply an n × k matrix with an k × p matrix

Computing result takes O(npk) time


• There are np entries in the resulting matrix
• Each entry computed via dot product between two k-dimensional vectors

Storing result takes O(np) space


• The result is an n × p matrix
EXAMPLE 1:

Regression
How much should you sell your house for?

data regression intelligence

pirce ($)
= ??
house size

input: houses & features learn: x → y relationship predict: y (continuous)


Regression

Goal: Learn a mapping from observations (features) to continuous labels


given a training set (supervised learning)

Examples:
• Size, Location, Age → House price
• Audio features → Song year
• Processes, memory → Power consumption
• Historical nancials → Future stock price
fi
RECALL

Empirical risk minimization


n
̂f = arg min 1
f∈F n ∑
n ℓ( f(xi ), yi )
i=1

• A popular approach in supervised ML


• Given a loss and data (x1, y1), … (xn, yn), we estimate a predictor f by minimizing
the empirical risk
• We typically restrict this predictor to lie in some class, F
• Could reflect our prior knowledge about the task
• Or may be for computational convenience

Question: how should we select our function class, F, and our loss function, ??
𝓁
𝓁
C O M P U TAT I O N A L C O N S I D E R AT I O N S

Empirical risk minimization


n
̂f = arg min 1
f∈F n ∑
n ℓ( f(xi ), yi )
i=1

• Even when the class of functions, F, is simple (e.g., linear functions), the above
optimization problem might be non-convex and thus difficult to solve
• For example, this (non-convexity) is the case for the 0/1 loss
10-605/10-805:
Machine Learning
with Large Datasets
FA L L 2 0 2 4

Distributed Linear Regression


Announcements

• HW2: due Friday, Sept 20 at 11:59pm ET


• please start early!
• use piazza, office hours, recitation for questions
• but don’t provide direct answers on these forums

• AWS / HW2 recitation


• This Friday!
• Will get you set up with AWS & walk you through HW2 data preprocessing
• We expect AWS credits to arrive before Friday’s recitation
RECALL

Machine learning pipeline


pre- post-
processing processing

deployed
raw data data ML method model
model

feature model &


optimization evaluation
extraction parameters
P E R F O R M I N G T H E M L P I P E L I N E AT S C A L E

Key course topics

Data preparation Training Inference


• Data cleaning • Distributed ML • Hardware for ML
• Data summarization
• Visualization
• Large-scale optimization • Techniques for low-
• Scalable deep learning latency inference
• Dimensionality reduction • E cient data structures (compression, pruning,
• Hyperparameter tuning distillation)

Infrastructure / Frameworks Advanced topics


• Apache Spark • Data curation
• PyTorch • E cient netuning
• AWS / Google Cloud / Azure • Scaling laws for FMs
• Safety at scale
ffi
ffi
fi
P E R F O R M I N G T H E M L P I P E L I N E AT S C A L E

Key course topics

Data preparation Training Inference


• Data cleaning • Distributed ML • Hardware for ML
• Data summarization
• Visualization
• Large-scale optimization • Techniques for low-
• Scalable deep learning latency inference
• Dimensionality reduction • E cient data structures (compression, pruning,
• Hyperparameter tuning distillation)

Infrastructure / Frameworks Advanced topics


• Apache Spark • Data curation
• PyTorch • E cient netuning
• AWS / Google Cloud / Azure • Scaling laws for FMs
• Safety at scale
ffi
ffi
fi
C O M P U TAT I O N A L C O N S I D E R AT I O N S

Empirical risk minimization


n
̂f = arg min 1
f∈F n ∑
n ℓ( f(xi ), yi )
i=1

• We will consider how to solve this objective when:


• the number of features (k) is large
• the number of observations (n) is large
• Techniques:
• parallel and distributed learning, large-scale optimization, efficient data
structures, dimensionality reduction, etc
• Will first focus on convex objectives (linear regression, logistic regression), and
then discuss non-convex objectives (deep neural networks) later in the course
Linear least squares regression

Example: Predicting house price from size, location, age

For each observation we have a feature vector, x, and label, y

x = x1 x2 x3
We assume a linear mapping between features and label:

y w 0 + w 1 x1 + w 2 x2 + w 3 x 3
Linear least squares regression

Example: Predicting house price from size, location, age

We can augment the feature vector to incorporate offset:

x = 1 x1 x2 x3
We can then rewrite this linear mapping as scalar product:
3
y ŷ = w i xi = w x
i=0
Why a linear mapping?

Simple

Often works well in practice

Can introduce complexity via feature extraction


1D example

Goal: nd the line of best t y


x coordinate: features
y coordinate: labels

y ŷ = w0 + w1 x
x

Intercept / Offset Slope


fi
fi
LINEAR REGRESSION VIA

Empirical risk minimization linear (actually, affine) functions


n
̂f = arg min 1
f∈F n ∑
n ℓ( f(xi ), yi )
i=1

• A popular approach in supervised ML


• Given a loss and data (x1, y1), … (xn, yn), we estimate a predictor f by minimizing
the empirical risk
• We typically restrict this predictor to lie in some class, F
• Could reflect our prior knowledge about the task
• Or may be for computational convenience

Question: how should we select our function class, F, and our loss function, ??
𝓁
𝓁
Evaluating predictions

Can measure ‘closeness’ between label and prediction


• House price: better to be incorrect by $50 than $50,000
• Song year prediction: better to be off by a year than by 20 years

What is an appropriate evaluation metric or ‘loss’ function?


• Absolute loss: |y ŷ|
2
• Square loss: (y ŷ)
RECALL

Loss function: Regression


Square Loss
2
ℓ( f(x), y) = ( f(x) − y)
Absolute Loss

ℓ( f(x), y) = | f(x) − y |
intelligence
x y f(x) Square Loss Absolute Loss

$300,000 $300,000 0 0
= ??$300,000 $300,500 250,000 500
$300,000 $400,000 1E+10 100,000
Evaluating predictions

Can measure ‘closeness’ between label and prediction


• House price: better to be incorrect by $50 than $50,000
• Song year prediction: better to be off by a year than by 20 years

What is an appropriate evaluation metric or ‘loss’ function?


• Absolute loss: |y ŷ|
• Square loss: (y ŷ) 2 ← Has nice mathematical properties
LINEAR REGRESSION VIA

Empirical risk minimization linear (actually, affine) functions


n
̂f = arg min 1
f∈F n ∑
n ℓ( f(xi ), yi )
i=1
square loss
• A popular approach in supervised ML
• Given a loss and data (x1, y1), … (xn, yn), we estimate a predictor f by minimizing
the empirical risk
• We typically restrict this predictor to lie in some class, F
• Could reflect our prior knowledge about the task
• Or may be for computational convenience

Question: how should we select our function class, F, and our loss function, ??
𝓁
𝓁
How can we learn model (w)?

(i)
Assume we have n training points, where x denotes the ith point

Recall two earlier points:


• Linear assumption: ŷ = w x
2
• We use square loss: (y ŷ)

Idea: Find w that minimizes square loss over training points:


n
(i) (i) 2
min (w x y )

{
w
i=1 (i)

Given n training points with k features, we de ne:
n×k
• X∈ℝ : matrix storing points
n : real-valued labels
• y∈ℝ
n
• ŷ ∈ ℝ : predicted labels, where ŷ = Xw
k : regression parameters / model to learn
• w∈ℝ

Least Squares Regression: Learn mapping (w) from features to labels


that minimizes residual sum of squares:
2
min ||Xw y||2
w
n
Equivalent min (w x (i) (i) 2
y ) by de nition of Euclidean norm
w
i=1
fi
fi
Find solution by setting derivative to zero
n
n 2 (i) (i) 2
df 1D: f(w)(= i) || w x
(i) y||=
(i)2 (wx y )
(w ) = 2 x (wx y )=0 wx x x y=0
dw i=1
i=n1
df (i) (i) (i)
(w ) = 2 xwx (wx y )=0 wx x x y=0
dw x x y
i=1
1
w = (x x) x y
wx x x y
1
w = (x x) x y
Least Squares Regression: Learn mapping (w) from features to labels
that minimizes residual sum of squares:
2
min ||Xw y||2
w

1
Closed form solution: w = (X X) X y (if inverse exists)
Overfitting and generalization
We want good predictions on new data, i.e., ’generalization’

Least squares regression minimizes training error, and could over t


• Simpler models are more likely to generalize (Occam’s razor)

Can we change the problem to penalize for model complexity?


• Intuitively, models with smaller weights are simpler

fi
HOW TO PREVENT OVERFITTING?

Regularization
n
̂ 1

θ = arg min ℓ( fθ(xi), yi) + λR(θ)
θ n
i=1

• Key idea: modify ERM objective to penalize complex models


• Helps to prevent overfitting
• Note: have re-parameterized our hypothesis, f, in terms of
• Larger
• more regularization, reduce likelihood of overfitting
• more bias, less variance

Question: how to select our regularizer, R() ??


𝝀
𝜃
Given n training points with k features, we de ne:
n×k : matrix storing points
• X∈ℝ
n
• y∈ℝ : real-valued labels
n
• ŷ ∈ ℝ : predicted labels, where ŷ = Xw
k
•w∈ℝ : regression parameters / model to learn

Ridge Regression: Learn mapping (w) that minimizes residual


sum of squares along with a regularization term:
Training Error Model Complexity

2 2
min ||Xw y||2 + λ||w||2
w

free parameter trades off


⊤ −1 ⊤
Closed-form solution: w = (X X + λIk) X y between training error and
model complexity
fi
Given n training points with k features, we de ne:
n×k : matrix storing points
• X∈ℝ
n
• y∈ℝ : real-valued labels
n
• ŷ ∈ ℝ : predicted labels, where ŷ = Xw
k
•w∈ℝ : regression parameters / model to learn

Ridge Regression: Learn mapping (w) that minimizes residual


sum of squares along with a regularization term:
Training Error Model Complexity

2 2
min ||Xw y||2 + λ||w||2
w

free parameter trades off


⊤ −1 ⊤
Closed-form solution: w = (X X + λIk) X y between training error and
model complexity
fi
EXAMPLE

Millionsong Regression Pipeline


training
set model
full
dataset
new entity

test set accuracy prediction

Obtain Raw Data

Split Data

Feature Extraction

Example Supervised Learning Pipeline Supervised Learning

Evaluation

Predict
training
set model
full
dataset
new entity

test set accuracy prediction

Obtain Raw Data

Split Data
Goal: Predict song’s release year from audio features
Feature Extraction
Raw Data: Millionsong Dataset from UCI ML Repository
• Western, commercial tracks from 1980-2014 Supervised Learning

• 12 timbre averages (features) and release year (label) Evaluation

Predict
training
set model
full
dataset
new entity

test set accuracy prediction

Obtain Raw Data

Split Data
Split Data: Train on training set, evaluate with test set
• Test set simulates unobserved data Feature Extraction

• Test error tells us whether we’ve generalized well Supervised Learning

Evaluation

Predict
training
set model
full
dataset
new entity

test set accuracy prediction

Obtain Raw Data

Split Data
Feature Extraction: Quadratic features
• Compute pairwise feature interactions Feature Extraction
• Captures covariance of initial timbre features
• Leads to a non-linear model relative to raw features Supervised Learning

Evaluation

Predict
Given 2 dimensional data, quadratic features are:

2 2
x = x1 x2 = Φ(x) = x1 x1 x2 x2 x 1 x2
2 2
z = z1 z2 = Φ(z) = z1 z1 z2 z2 z 1 z2

More succinctly:

Φ (x) = 2 2 Φ (z) = 2 2
x1 2x1 x2 x2 z1 2z1 z2 z2

Equivalent inner products:

2 2 2 2
Φ(x) Φ(z) = x 1 z1 + 2x1 x2 z1 z2 + x 2 z2 = Φ (x) Φ (z)
training
set model
full
dataset
new entity

test set accuracy prediction

Obtain Raw Data

Supervised Learning: Least Squares Regression Split Data

• Learn a mapping from entities to continuous labels given a Feature Extraction


training set
• Audio features → Song year Supervised Learning

Evaluation

Predict
Given n training points with k features, we de ne:
n×k : matrix storing points
• X∈ℝ
n
• y∈ℝ : real-valued labels
n
• ŷ ∈ ℝ : predicted labels, where ŷ = Xw
k
•w∈ℝ : regression parameters / model to learn

Ridge Regression: Learn mapping (w) that minimizes residual


sum of squares along with a regularization term:
Training Error Model Complexity

2 2
min ||Xw y||2 + λ||w||2
w

⊤ −1 ⊤
Closed-form solution: w = (X X + λIk) X y
fi
Ridge Regression: Learn mapping w ( ) that minimizes residual sum of
squares along with a regularization term:
Training Error Model Complexity

2 2
min ||Xw y||2 + λ||w||2
w
free parameter trades off between training
error and model complexity
How do we choose a good value for this free parameter?
• Most methods have free parameters / ‘hyperparameters’ to tune
First thought: Search over multiple values, evaluate each on test set
• But, goal of test set is to simulate unobserved data
• We may over t if we use it to choose hyperparameters
Second thought: Create another hold out dataset for this search
fi
training
set model
full
dataset
validation new entity
set

test set accuracy prediction

Obtain Raw Data

Split Data
Evaluation (Part 1): Hyperparameter tuning
• Training: train various models Feature Extraction

• Validation: evaluate various models (e.g., Grid Search) Supervised Learning

• Test: evaluate nal model’s accuracy Evaluation

Predict
fi
We’ll cover advanced hp tuning
Hyperparameter 2
methods later in the course

Hyperparameter 1
10-8 10-6 10-4 10-2 1

Regularization Parameter ( )

Grid Search: Exhaustively search through hyperparameter space


• De ne and discretize search space (linear or log scale)
• Evaluate points via validation error
fi
𝝀
Evaluating predictions

How can we compare labels and predictions for n validation points?


2
Least squares optimization involves squared loss, (y ŷ) , so it seems reasonable
to use mean squared error (MSE):
n
1 (i) (i) 2
MSE = (ŷ y )
n
i=1

But MSE’s unit of measurement is square of quantity being measured, e.g., “squared
years” for song prediction
More natural to use root-mean-square error (RMSE), i.e., MSE
training
set model
full
dataset
validation new entity
set

test set accuracy prediction

Obtain Raw Data

Split Data
Evaluation (Part 2): Evaluate nal model
• Training set: train various models Feature Extraction

• Validation set: evaluate various models Supervised Learning

• Test set: evaluate nal model’s accuracy Evaluation

Predict
fi
fi
training
set model
full
dataset
validation new entity
set

test set accuracy prediction

Obtain Raw Data

Split Data

Feature Extraction
Predict: Final model can then be used to make predictions on
future observations, e.g., new songs Supervised Learning

Evaluation

Predict
Outline

1. Background: linear regression & big O notation


2. Distributed linear regression
3. Gradient descent
4. Distributed PCA
5. HW2 preview
RECALL

Least Squares Regression

Least Squares Regression: Learn mapping (w) from features to


labels that minimizes residual sum of squares:
2
min ||Xw y||2
w

Closed form solution: w = (X X) 1 (if inverse exists)


X y
How do we solve this computationally?
• Note: will discuss least squares regression, as computational pro le
similar for ridge regression

fi
LEAST SQUARES REGRESSION

Computing closed form solution


1
w = (X X) X y
Computation: O(nk2 + k3) operations

Consider number of arithmetic operations ( +, −, ×, / )

Computational bottlenecks:
• Matrix multiply of X X : O(nk2) operations
• Matrix inverse: O(k3) operations

Other methods (Cholesky, QR, SVD) have same complexity


LEAST SQUARES REGRESSION

Storage requirements
1
w = (X X) X y
Computation: O(nk2 + k3) operations
Storage: O(nk + k2) oats

Consider storing values as oats (8 bytes)

Storage bottlenecks:
• X X and its inverse: O(k2) oats
• X : O(nk) oats
fl
fl
fl
fl
LEAST SQUARES REGRESSION

Large n and small k


1
w = (X X) X y
Computation: O(nk2 + k3) operations
Storage: O(nk + k2) oats

Assume O(k3) computation and O(k2) storage feasible on single machine


Storing X and computing X X are the bottlenecks
Can distribute storage and computation!
• Store data points (rows of X ) across machines
• Compute X X as a sum of outer products
fl
Matrix multiplication via inner products

Each entry of output matrix is result of inner product of inputs matrices

1 2
9 3 5 28 18
3 5 =
4 1 2 11 9
2 3

9 1+3 3+5 2 = 28
Matrix multiplication via outer products

Output matrix is sum of outer products between corresponding rows


and columns of input matrices

1 2
9 3 5 28 18
3 5 =
4 1 2 11 9
2 3

9 18 9 15 10 15
+ +
4 8 3 5 4 6
n k
x(1)

x(1)
x(2)

x(n)
n

k x(2)
x(i)


X X= n
=

x(i)
x(n)
i =1
Example: n = 6; 3 workers

workers: x(1) x(3) x(2)


O(nk) Distributed Storage
x(5) x(4) x(6)
O(nk2)
map: x(i) x(i) x(i) O(k2) Local
x(i)

x(i)

x(i)
Distributed
Storage
Computation

reduce:
( x(i)
) -1 O(k3) Local O(k2) Local
x(i)

Computation Storage
n k
x(1)

x(1)
x(2)

x(n)
n

k x(2)
x(i)


X X= n
=

x(i)
x(n)
i =1
Example: n = 6; 3 workers

workers: x(1) x(3) x(2)


O(nk) Distributed Storage
x(5) x(4) x(6)
O(nk2)
map: x(i) x(i) x(i) O(k2) Local
x(i)

x(i)

x(i)
Distributed
Storage
Computation

reduce:
( x(i)
) -1 O(k3) Local O(k2) Local
x(i)

Computation Storage
LEAST SQUARES REGRESSION

Large n and small k


1
w = (X X) X y
Computation: O(nk2 + k3) operations
Storage: O(nk + k2) oats

Assume O(k3) computation and O(k2) storage feasible on single machine


Can distribute storage and computation!
• Store data points (rows of X ) across machines
• Compute X X as a sum of outer products
fl
LEAST SQUARES REGRESSION

Large n and small k


1
w = (X X) X y
Computation: O(nk2 + k3) operations
Storage: O(nk + k2) oats

Assume O(k3) computation and O(k2) storage feasible on single machine


Can distribute storage and computation!
• Store data points (rows of X ) across machines
• Compute X X
as a sum of outer products
fl
LEAST SQUARES REGRESSION

Large n and large k


1
w = (X X) X y
Computation: O(nk2 + k3) operations
Storage: O(nk + k2) oats

As before, storing X and computing X X are bottlenecks


Now, storing and operating on X X is also a bottleneck
• Can’t easily distribute!
fl
n k
x(1)

x(1)
x(2)

x(n)
n

k x(2)
x(i)


X X= n
=

x(i)
x(n)
i =1
Example: n = 6; 3 workers

workers: x(1) x(3) x(2)


O(nk) Distributed Storage
x(5) x(4) x(6)
O(nk2)
map: x(i) x(i) x(i) O(k2) Local
x(i)

x(i)

x(i)
Distributed
Storage
Computation

reduce:
( x(i)
) -1 O(k3) Local O(k2) Local
x(i)

Computation Storage
LEAST SQUARES REGRESSION

Large n and large k


1
w = (X X) X y
Computation: O(nk2 + k3) operations
Storage: O(nk + k2) oats

As before, storing X and computing X X are bottlenecks


Now, storing and operating on X X is also a bottleneck
• Can’t easily distribute!

1st Rule of thumb


Computation and storage should be at most linear (in n, k)
fl
Large n and large k
We need methods that are linear in time and space

One idea: Exploit sparsity or reduce dimension


• Explicit sparsity can provide orders of magnitude storage and
computational gains
dense : 1. 0. 0. 0. 0. 0. 3.
Sparse data is prevalent 8
size : 7
• Text processing: bag-of-words, n-grams >
<
• Collaborative ltering: ratings matrix sparse : indices : 0 6
>
:
• Graphs: adjacency matrix values : 1. 3.
• Categorical features: one-hot-encoding
fi
Large n and large k
We need methods that are linear in time and space

One idea: Exploit sparsity or reduce dimension


• Explicit sparsity can provide orders of magnitude storage and
computational gains
• Latent sparsity assumption can be used to reduce dimension, e.g., PCA,
low-rank approximation (we’ll revisit this soon…)

k r k
r

≈ ‘Low-rank’
n n
Large n and large k
We need methods that are linear in time and space

One idea: Exploit sparsity or reduce dimension


• Explicit sparsity can provide orders of magnitude storage and
computational gains
• Latent sparsity assumption can be used to reduce dimension, e.g., PCA,
low-rank approximation (we’ll revisit this soon…)

Another idea: Use different algorithms


• Gradient descent is an iterative algorithm that requires
O(nk) computation and O(k) local storage per iteration
Closed form solution for large n and large k

Example: n = 6; 3 workers

workers: x(1) x(3) x(2) O(nk) Distributed


x(5) x(4) x(6) Storage

O(nk2)
map: x(i) x(i) x(i) O(k2) Local
x(i)

x(i)

x(i)
Distributed
Storage
Computation

reduce:
( x(i)
) -1 O(k3) Local O(k2) Local
x(i)

Computation Storage
Gradient descent for large n and large k

Example: n = 6; 3 workers

workers: x(1) x(3) x(2) O(nk) Distributed


x(5) x(4) x(6) Storage

O(nk2)
map: x(i) x(i) x(i) O(k2) Local
x(i)

x(i)

x(i)
Distributed
Storage
Computation

reduce:
( x(i)
) -1 O(k3) Local O(k2) Local
x(i)

Computation Storage
Gradient descent for large n and large k

Example: n = 6; 3 workers

workers: x(1) x(3) x(2) O(nk) Distributed


x(5) x(4) x(6) Storage
O(nk) O(k)
O(nk2)
map: ? ? ? Distributed
O(k2) Local
Storage
Computation

reduce:
( x(i)
) -1 O(k3) Local O(k2) Local
x(i)

Computation Storage
Gradient descent for large n and large k

Example: n = 6; 3 workers

workers: x(1) x(3) x(2) O(nk) Distributed


x(5) x(4) x(6) Storage
O(nk) O(k)
O(nk2)
map: ? ? ? Distributed
O(k2) Local
Storage
Computation
O(k) O(k)
reduce: ? O(k3) Local O(k2) Local
Computation Storage
Outline

1. Background: linear regression & big O notation


2. Distributed linear regression
3. Gradient descent
4. Distributed PCA
5. HW2 preview
Linear Regression Optimization

Goal: Find w that minimizes f(w)


2
f(w) = ||Xw y||2
• Closed form solution exists
• Gradient Descent is iterative
(Intuition: go downhill!)

n
w* w
Scalar objective: f(w) = ||wx 2 (j) (j) 2
y||2 = (wx y )
j=1
Gradient descent

f(w)
Start at a random point

Determine a descent direction


Choose a step size
Update

w* w1 w0 w
Gradient descent

f(w)
Start at a random point
Repeat
Determine a descent direction
Choose a step size
Update
Until stopping criterion is satis ed
w* w2 w1 w0 w
fi
Gradient descent

f(w)
Start at a random point
Repeat
Determine a descent direction
Choose a step size
Update
Until stopping criterion is satis ed
w* … w2 w1 w0 w
fi
Where will we converge?
f(w) Convex g(w) Non-convex


… …

w* w w w* w
Any local minimum is a global minimum Multiple local minima may exist
Least Squares, Ridge Regression, and Logistic Regression are all convex!
(Neural networks are non-convex)

Choosing a descent direction (1D)
f(w) positive go left! f(w) negative go right!

zero done!

w* w0 w w0 w*
Step Size
w
We can only move in two directions df
Negative slope is direction of descent! Update Rule: wi+1 = wi α i (w i )
dw
Negative Slope
Choosing a descent direction

2D Example:
• Function values are in black/white
and black represents higher values
• Arrows are gradients

"Gradient2" by Sarang. Licensed under CC BY-SA 2.5 via Wikimedia Commons


http://commons.wikimedia.org/wiki/File:Gradient2.svg#/media/File:Gradient2.svg

Step Size
k
We can move anywhere in ℝ
Negative gradient is direction of Update Rule: wi+1 = wi αi f(wi )
steepest descent!
Gradient descent for least squares
df
Update Rule: wi+1 = wi α i (w i )
dw
n
Scalar objective: f(w) = ||wx 2 (j) (j) 2
y||2 = (wx y )
j=1
n
Derivative: df (j) (j) (j)
(w ) = 2 (wx y )x
(chain rule) dw
j=1
n
Scalar Update: (j) (j) (j)
wi+1 = wi αi (w i x y )x
(2 absorbed in α ) j=1
Vector Update: n
(j) (j) (j)
wi+1 = wi αi (wi x y )x
j =1
Choosing a step size
f(w) f(w) f(w)

w* w w* w w* w
Too small: converge very slowly Too big: overshoot, can diverge Reduce size over time

Theoretical convergence results for various step sizes


α Constant
A common step size is αi =
n i Iteration #
# Training Points
Parallel gradient descent for least squares
n
Vector Update: wi+1 = wi αi (wi x (j) (j)
y )x (j)

j =1
Compute summands in parallel!
note: workers must all have wi
Example: n = 6; 3 workers

workers: x(1) x(3) x(2) O(nk) Distributed


x(5) x(4) x(6) Storage

wi + 1 O(nk)
map: (wi x (j) (j)
y )x (j)
(wi x (j)
y )x(j) (j)
(wi x (j) (j)
y )x (j)
Distributed
O(k) Local
Storage
Computation
n
reduce: (wi x (j) (j)
y )x (j)
O(k) Local O(k) Local
j=1
Computation Storage
Parallel gradient descent for least squares
n
Vector Update: wi+1 = wi αi (wi x (j) (j)
y )x (j)

j =1
Compute summands in parallel!
note: workers must all have wi
Example: n = 6; 3 workers

workers: x(1) x(3) x(2) O(nk) Distributed


x(5) x(4) x(6) Storage

wi + 1 O(nk)
map: (wi x (j) (j)
y )x (j)
(wi x (j)
y )x(j) (j)
(wi x (j) (j)
y )x (j)
Distributed
O(k) Local
Storage
Computation
n
reduce: (wi x (j) (j)
y )x (j)
O(k) Local O(k) Local
j=1
Computation Storage
Gradient descent summary

Pros: Cons:
• Easily parallelized • Slow convergence (especially
• Cheap at each iteration compared with closed-form)
• Stochastic variants can make • Requires communication
things even cheaper across nodes!
DISTRIBUTED MACHINE LEARNING

Communication Principles
RECALL

Communication hierarchy

Access rates fall sharply with distance


>50× gap between memory and network!

50 GB/s
1 GB/s 1 GB/s 0.3 GB/s

CPU RAM Local disks Rack Different


Racks

Be mindful of this hierarchy when developing parallel/distributed algorithms!


RECALL

Communication hierarchy
Access rates fall sharply with distance
• Parallelism makes computation fast
• Network makes communication slow
50 GB/s
1 GB/s 1 GB/s 0.3 GB/s

CPU RAM Local disks Rack Different


Racks

Be mindful of this hierarchy when developing parallel/distributed algorithms!


2nd Rule of thumb
Perform parallel and in-memory computation

Persisting in memory reduces communication


• Especially for iterative computation (gradient descent)
Scale-up (powerful multicore machine)
✓ No network communication
Expensive hardware, eventually hit a wall

RAM
CPU RAM

Disk CPU
Disk
2nd Rule of thumb
Perform parallel and in-memory computation

Persisting in memory reduces communication


• Especially for iterative computation (gradient descent)
Scale-out (distributed, e.g., cloud-based)
✓ Commodity hardware, scales to massive problems
Need to deal with network communication

Network

RAM RAM RAM RAM

CPU CPU CPU


… CPU

Disk Disk Disk Disk


2nd Rule of thumb
Perform parallel and in-memory computation

Persisting in memory reduces communication


• Especially for iterative computation (gradient descent)
Scale-out (distributed, e.g., cloud-based)
✓ Commodity hardware, scales to massive problems
Need to deal with network communication

Network
← Persist training data across iterations

RAM RAM RAM RAM

CPU CPU CPU


… CPU

Disk Disk Disk Disk


2nd Rule of thumb
Perform parallel and in-memory computation

Persisting in memory reduces communication


• Especially for iterative computation (gradient descent)
Scale-out (distributed, e.g., cloud-based)
✓ Commodity hardware, scales to massive problems
Need to deal with network communication

Network
← Persist training data across iterations

RAM RAM RAM RAM

CPU CPU CPU


… CPU

Disk Disk Disk Disk


3rd Rule of thumb
Minimize network communication

Q: How should we leverage distributed computing while


mitigating network communication?

First Observation: We need to store and potentially communicate


Data, Model, and Intermediate objects
• A: Keep large objects local
3rd Rule of thumb
Minimize network communication — stay local

Example: Linear regression, big n and small k


• Solve via closed form (not iterative!)
• Communicate O(k 2) intermediate data
• Compute locally on data (Data Parallel)

workers: x(1) x(3) x(2)


x(5) x(4) x(6)

map: x(i) x(i) x(i)


x(i)

x(i)

x(i)
reduce:
( x(i)
x(i)
) -1
3rd Rule of thumb
Minimize network communication — stay local

Example: Linear regression, big n and big k


• Gradient descent, communicate wi
• O(k) communication OK for fairly large k
• Compute locally on data (Data Parallel)

workers: x(1) x(3) x(2)


x(5) x(4) x(6)

wi + 1 map: (wi x(j) y(j) )x(j) (wi x(j) y(j) )x(j) (wi x(j) y(j) )x(j)

n
reduce: (wi x (j) (j)
y )x (j)

j=1
3rd Rule of thumb
Minimize network communication — stay local

Example: Hyperparameter tuning for ridge regression with


small n and small k
• Data is small, so can communicate it
• ‘Model’ is collection of regression models
corresponding to different hyperparameters
• Train each model locally (Model Parallel)
3rd Rule of thumb
Minimize network communication — stay local

Example: Linear regression, big n and huge k


• Gradient descent
• O(k) communication slow with hundreds of millions
parameters
• Distribute data and model (Data and Model Parallel)
• Often rely on sparsity to reduce communication
3rd Rule of thumb
Minimize network communication

Q: How should we leverage distributed computing while mitigating


network communication?

First Observation: We need to store and potentially communicate Data,


Model and Intermediate objects
• A: Keep large objects local

Second Observation: ML methods are typically iterative


• A: Reduce # iterations
3rd Rule of thumb
Minimize network communication — reduce iterations

Distributed iterative algorithms must compute and communicate


• In Bulk Synchronous Parallel (BSP) systems, e.g., Apache Spark,
we strictly alternate between the two
Distributed Computing Properties
• Parallelism makes computation fast
• Network makes communication slow
Idea: Design algorithms that compute more, communicate less
• Do more computation at each iteration
• Reduce total number of iterations
3rd Rule of thumb
Minimize network communication — reduce iterations

Extreme: Divide-and-conquer
• Fully process each partition locally, communicate nal result
• Single iteration; minimal communication
• Approximate results

fi
3rd Rule of thumb
Minimize network communication — reduce iterations

Less extreme: Local-updating methods


• Do more work locally than gradient descent before communicating
• Will discuss more later in the course …
3rd Rule of thumb
Minimize network communication — reduce iterations

Throughput: How many bytes per second can be read


Latency: Cost to send message (independent of size)

Latency
Memory 1e-4 ms We can amortize latency!
Hard Disk 10 ms • Send larger messages
• Batch their communication
Network (same datacenter) .25 ms • E.g., Train multiple models together
Network (US to Europe) >5 ms
1st Rule of thumb
Computation and storage should be linear (in n, k)

2nd Rule of thumb


Perform parallel and in-memory computation

3rd Rule of thumb


Minimize network communication
Outline

1. Background: linear regression & big O notation


2. Distributed linear regression
3. Gradient descent
4. Distributed PCA
5. HW2 preview
RECALL

Toy example: Shoe sizes

Shoes often labeled in terms of their


European and American shoe sizes
• We expect sizes to be correlated (but
maybe not perfect, due to some errors/

European Size
noise)

How can we find a simpler, more compact


representation of this data?

Pick a direction & project into one dimension


American Size
Toy example: Shoe sizes

Shoes often labeled in terms of their


European and American shoe sizes
• We expect sizes to be correlated (but
maybe not perfect, due to some errors/

European Size
noise)

How can we find a simpler, more compact


representation of this data?

Pick a direction & project into one dimension


American Size
Toy example: Shoe sizes
a bad direction !
Shoes often labeled in terms of their
European and American shoe sizes
• We expect sizes to be correlated (but
maybe not perfect, due to some errors/

European Size
noise)

How can we find a simpler, more compact


representation of this data?

Pick a direction & project into one dimension


American Size
Toy example: Shoe sizes
a bad direction !
Shoes often labeled in terms of their
European and American shoe sizes
• We expect sizes to be correlated (but
maybe not perfect, due to some errors/

European Size
noise)

How can we find a simpler, more compact


representation of this data?

Pick a direction & project into one dimension


American Size
Toy example: Shoe sizes
a better direction !
Shoes often labeled in terms of their
European and American shoe sizes
• We expect sizes to be correlated (but
maybe not perfect, due to some errors/

European Size
noise)

How can we find a simpler, more compact


representation of this data?

Pick a direction & project into one dimension


American Size
How to find the ‘best’ direction?

Goal: Minimize reconstruction error


• i.e., Find the direction that minimizes the

European Size
Euclidean distance between the original
points and their projections

This is the key idea behind PCA

American Size
Linear Regression — predict y from x. PCA — reconstruct 2D data via 2D data with
Evaluate accuracy of predictions single degree of freedom. Evaluate
(represented by blue line) by vertical reconstructions (represented by blue line)
distances between points and the line by Euclidean distances
y

European Size
x American Size
Computing PCA solution
Given: n × k matrix of uncentered raw data
Goal: Compute r ≪ k dimensional representation

Step 1: Center Data


Step 2: Compute Covariance or Scatter Matrix
1 1 P
CX • n X X Cversus
= X = X X
n
Step 3: Eigendecomposition
≈ Z = X ≈


Step 4: Compute PCA Scores
PCA at scale
Case 1: Big n and Small k
• O(k 2) local storage, O(k3) local computation, O(kr)

communication
• Similar strategy as closed-form linear regression

Case 2: Big n and Big k


• O(k) local storage and computation on workers, O(kr) P
communication
• Iterative algorithm ≈ Z = X ≈


Step 0: Data Parallel storage

Example: n = 6; 3 workers

workers: x(1) x(3) x(2)


O(nk) Distributed Storage
x(5) x(4) x(6)
Step 1: Center Data
k
• Compute k feature means, m ∈ ℝ

Example: n = 6; 3 workers

workers: x(1) x(3) x(2)


O(nk) Distributed Storage
x(5) x(4) x(6)

reduce: m=
1
n ( x(i) )
Step 1: Center Data
k
• Compute k feature means, m ∈ ℝ
• Communicate m to all workers

Example: n = 6; 3 workers

workers: x(1) x(3) x(2)


O(nk) Distributed Storage
x(5) x(4) x(6)
m
O(k) Local Storage
reduce: m=
1
n ( x(i) ) O(k) Local Computation
O(k) Communication
Step 1: Center Data
k
• Compute k feature means, m ∈ ℝ
• Communicate m to all workers
• Subtract m from each data point

Example: n = 6; 3 workers

workers: x(1) x(3) x(2)


O(nk) Distributed Storage
x(5) x(4) x(6)

map: x(i) m x(i) m x(i) m O(k) Local Computation


1
Step 2: Compute ScatterCMatrix
X = ( X X )
n
• Compute matrix product via outer products (just like we did for
closed-form linear regression!)

1 2
9 3 5 28 18
3 5 =
4 1 2 11 9
2 3

9 18 9 15 10 15
+ +
4 8 3 5 4 6
n k
x(1)

x(1)
x(2)

x(n)
n

k x(2)
x(i)


X X= n
=

x(i)
x(n)
i =1
Example: n = 6; 3 workers

workers: x(1) x(3) x(2)


O(nk) Distributed Storage
x(5) x(4) x(6)

map: x(i) x(i) x(i) O(k2) Local Storage


x(i)

x(i)

x(i)
O(nk2) Distributed Computation

x(i) O(k2) Local Storage


reduce:
x(i)

O(k2) Local Computation


Step 3: Eigendecomposition
• Perform locally since k is small

Example: n = 6; 3 workers

workers: x(1) x(3) x(2)


O(nk) Distributed Storage
x(5) x(4) x(6)

map: x(i) x(i) x(i) O(k2) Local Storage


x(i)

x(i)

x(i)
O(nk2) Distributed Computation

reduce: ( x(i)
)
x(i)

eigh
Step 3: Eigendecomposition
• Perform locally since k is small
k×r
• Communicate r principal components ( P ∈ ℝ ) to workers

Example: n = 6; 3 workers

workers: x(1) x(3) x(2)


O(nk) Distributed Storage
x(5) x(4) x(6)

map: x(i) x(i) x(i) O(k2) Local Storage


P
x(i)

x(i)

x(i)
O(nk2) Distributed Computation

O(k2) Local Storage


reduce: ( x(i)
) O(k3) Local Computation
x(i)

eigh O(kr) Communication


Step 4: Compute PCA Scores
• Multiply each point by principal components, P

Example: n = 6; 3 workers

workers: x(1) x(3) x(2)


O(nk) Distributed Storage
x(5) x(4) x(6)

p(1) p(1) p(1)


map:
x(i)

x(i)

x(i)
p(2) p(2) p(2) O(kr) Local Computation
PCA at scale
Case 1: Big n and Small k
• O(k 2) local storage, O(k3) local computation, O(kr)

communication
• Similar strategy as closed-form linear regression

Case 2: Big n and Big k


• O(k) local storage and computation on workers, O(kr) P
communication Z = X
≈ ≈
• Iterative algorithm
An iterative approach
We can use algorithms that rely on a sequence of matrix-vector products to
compute top r eigenvectors ( P )
• E.g., Krylov subspace or random projection methods

Krylov subspace methods (used in MLlib) iteratively compute


k
X Xv for some v ∈ ℝ provided by the method
• Requires O(r) passes over data, O(k) local storage on workers
• We don’t need to compute the covariance matrix!
Repeat for O(k) iterations:
k
1. Communicate vi ∈ ℝ to all workers
2. Compute qi = X Xvi in a distributed fashion

3. Driver uses qi to
= update
X Xv estimate
i of P
Repeat for O(r) iterations:
k
1. Communicate vi ∈ ℝ to all workers
2. Compute qi = X Xvi in a distributed fashion
• Step 1: bi = Xvi
• Step 2: qi = X bi
• Perform in single map-reduce!
3. Driver uses qi = X Xv
to update estimate
i of P

(j)
• bij = vi x : each component is dot product
n
(j)
is aXsumbof
• qi = i rescaled data points, i.e., q i = bij x
j =1
Compute qi = X Xvi in a distributed
n
fashion
(j) and q = (j)
• bij = vi x i bij x
j =1
• Locally compute each dot product and rescale each point before summing all
rescaled points in reduce step!

Example: n = 6; 3 workers

workers: x(1) x(3) x(2)


O(nk) Distributed Storage
x(5) x(4) x(6)

map: O(k) Local Storage


x(j)

x(j)

x(j)
bij ⨉ bij ⨉ bij ⨉
O(nk) Distributed Computation

O(k) Local Storage


reduce: qi = O(k) Local Computation
x(j)
bij ⨉
O(k) Communication
Outline

1. Background: linear regression & big O notation


2. Distributed linear regression
3. Gradient descent
4. Distributed PCA
5. HW2 preview
training
set model
full
dataset
validation new entity
set

test set accuracy prediction

Obtain Raw Data

Split Data
Goal: Predict song’s release year from audio features
Feature Extraction

Supervised Learning

Evaluation

Predict
training
set model
full
dataset
validation new entity
set

test set accuracy prediction

Obtain Raw Data

Split Data
Raw Data: Millionsong Dataset from UCI ML Repository
• Explore features Feature Extraction

• Shift labels so that they start at 0 (for interpretability) Supervised Learning


• Visualize data
Evaluation

Predict
training
set model
full
dataset
validation new entity
set

test set accuracy prediction

Obtain Raw Data

Split Data
Split Data: Create training, validation, and test sets
Feature Extraction

Supervised Learning

Evaluation

Predict
training
set model
full
dataset
validation new entity
set

test set accuracy prediction

Obtain Raw Data

Split Data
Feature Extraction:
• Initially use raw features Feature Extraction
• Subsequently compare with quadratic features
Supervised Learning

Evaluation

Predict
training
set model
full
dataset
validation new entity
set

test set accuracy prediction

Obtain Raw Data

Split Data

Supervised Learning: Least Squares Regression Feature Extraction


• First implement gradient descent from scratch
• Then use MLlib implementation Supervised Learning

• Visualize performance by iteration Evaluation

Predict
training
set model
full
dataset
validation new entity
set

test set accuracy prediction

Obtain Raw Data

Evaluation (Part 1): Hyperparameter tuning Split Data

• Use grid search to nd good values for regularization and step Feature Extraction
size hyperparameters
• Evaluate using RMSE Supervised Learning

• Visualize grid search Evaluation

Predict
fi
training
set model
full
dataset
validation new entity
set

test set accuracy prediction

Obtain Raw Data

Split Data
Evaluation (Part 2): Evaluate nal model
• Evaluate using RMSE Feature Extraction

• Compare to baseline model that returns average song Supervised Learning


year in training data
Evaluation

Predict
fi
training
set model
full
dataset
validation new entity
set

test set accuracy prediction

Obtain Raw Data

Split Data

Feature Extraction
Predict: Final model could be used to predict song year for
new songs (we won’t do this though) Supervised Learning

Evaluation

Predict

You might also like