Map-Reduce for Machine Learning on Multicore
Abebe Zerihun
Introduction
• Multicore Era: Increasing number of processing cores
per chip, but lack of effective programming
frameworks for multicore machine learning.
• Goal: Develop a general, parallel programming
method to leverage multicore architectures for a
wide range of machine learning algorithms.
• Challenge: traditional approaches focus on specific
optimizations for individual algorithms, lacking
scalability and generalizability.
Map-Reduce
Contribution
• General Framework:
– Algorithms fitting the Statistical Query Model are
expressed in a summation form, enabling easy
parallelization.
– This method ensures exact, non-approximated
implementations of machine learning algorithms.
• Example instead of directly accessing a dataset {(xi, yi)}ni=1,
an algorithm may query.
where f(x, y) is a function of the data.
Cont..
• Integration with Map-Reduce:
– Adapts Google's Map-Reduce paradigm for multicore
systems, simplifying programming and achieving linear
speedups.
• Broad Applicability:
– Demonstrates parallelization for diverse algorithms:
Logistic Regression, SVM, Neural Networks, PCA, ICA, k-
means, EM, Naive Bayes, etc.
Workflow
• Summation Form:
– Reformulates computations as summations over data
subsets, ideal for parallel execution.
• Map-Reduce Framework:
– Map Phase: Data is split among cores, and partial
computations are performed.
• Reduce Phase:
– Partial results are aggregated, and final computations are
completed.
• Parallel Gradient Descent:
– Key algorithms like Logistic Regression and Neural
Networks use batch gradient descent for efficient parallel
optimization
Algorithms Demonstrated
• Locally Weighted Linear Regression
(LWLR):Parallelizes matrix multiplications and
statistics aggregation
• Logistic Regression: computes gradients and
Hessian in parallel for optimization.
• Neural Networks (Backpropagation):Parallelizes
error backpropagation and gradient computations for
weight updates
• Principal Component Analysis (PCA):Parallelizes
covariance matrix computation and mean
calculations.
• k-means Clustering: parallelizes distance
Logistic Regression
• In logistic regression, we aim to optimize the parameters θ of the
hypothesis:
– To predict probabilities for binary classification.
• The negative log-likelihood gives the log-loss (cost function):
Gradient Computation
• The gradient of J(θ) with respect to θ :
• This is already in summation form since we compute a
summation over all data points.
• Each term can be computed independently for each data
point (xi,yi).
Parallelizing Logistic Regression
• Step 1: Partition the data
– Divide the dataset into P disjoint subsets Dp(one for each processor).
– Each subset contains mpexamples:
• Step 2: Compute Partial Gradients Locally
– On each processor p, compute the partial gradient over its assigned
subset:
Cont..
• Aggregate gradients
– Once the gradient is computed, update the parameters 𝜃θ
• Update parameters
using gradient descent:
– Where α is learning rate
Example
• Suppose there are 500 training samples.
• Partitioned data for 5 cores.
• Let D1,D2,…,D5be the data subsets for each core,
• Core 1: 𝐷1={(𝑥1,𝑦1),…,(𝑥100,𝑦100)},
where each subset contains 100 samples:
• Core 2: 𝐷2={(𝑥101,𝑦101),…,(𝑥200,𝑦200)},
• Core 3: 𝐷3={(𝑥201,𝑦201),…,(𝑥300,𝑦300)},
• Core 4: 𝐷4={(𝑥301,𝑦301),…,(𝑥400,𝑦400)}
• Core 5: 𝐷5={(𝑥401,𝑦401),…,(𝑥500,𝑦500)}.
• Each core computes the partial
gradients for its assigned data • Core 4:
subset.
• Core1:
• Core 5:
• Core 2:
• Core 3
Aggregating Gradients
• After all cores compute their partial gradients, the gradients are
aggregated to form the total gradient.
• Update
Time complexity
• Logistic Regression
• Single machine/ core
– O(mn2 + n3)
• Multi-core
– O(+ n2log(p))
Q and A ?