0% found this document useful (0 votes)

20 views

Week-8 - Lecture Notes

The document discusses parameter servers, which are a machine learning framework that distributes a model over multiple machines. Parameter servers offer operations to pull and push parts of the model. They are commonly used with stochastic gradient descent and other algorithms.

Uploaded by

tejastaware7451

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

Week-8 - Lecture Notes

Uploaded by

tejastaware7451

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 169

Parameter Servers

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]

Big Data Computing Vu Pham Parameter Servers

Preface
Content of this Lecture:

In this lecture, we will discuss the ‘Parameters

EL
Servers’, also discuss its Stale Synchronous Parallel
Model.

PT
N

Big Data Computing Vu Pham Parameter Servers

ML Systems

EL
Scalable Machine Learning Algorithms
PT Abstractions
N
Scalable Systems

Big Data Computing Vu Pham Parameter Servers

ML Systems Landscape
Shared Memory
Dataflow Systems Graph Systems Systems
Model

EL
PT
N

Hadoop, GraphLab, Bosen, DMTK,

Spark Tensorflow ParameterServer.org
Vu Pham
ML Systems Landscape
Shared Memory
Dataflow Systems Graph Systems Systems
Model

EL
Algorithms
PT
N

Hadoop, GraphLab, Bosen, DMTK,

Spark Tensorflow ParameterServer.org
Vu Pham
ML Systems Landscape
Shared Memory
Dataflow Systems Graph Systems Systems
Model

EL
Naïve Bayes, Graph Algorithms, SGD, Sampling
Rocchio
PT
Graphical Models [NIPS’09, NIPS’13]
N

Hadoop, GraphLab, Bosen, DMTK,

Spark Tensorflow ParameterServer.org
Vu Pham
ML Systems Landscape
Shared Memory
Dataflow Systems Graph Systems Systems
Model

EL
Naïve Bayes, Graph Algorithms, SGD, Sampling
Rocchio
PT
Graphical Models [NIPS’09, NIPS’13]
N
Abstractions
Hadoop & GraphLab, Bosen, DMTK,
Spark Tensorflow ParameterServer.org
Vu Pham
ML Systems Landscape
Shared Memory
Dataflow Systems Graph Systems Systems
Model

EL
Naïve Bayes, Graph Algorithms, SGD, Sampling
Rocchio
PT
Graphical Models [NIPS’09, NIPS’13]

PIG, Vertex-Programs Parameter Server

N
GuineaPig, … [VLDB’10]

Hadoop & GraphLab, Bosen, DMTK,

Spark Tensorflow ParameterServer.org
Vu Pham
ML Systems Landscape
Shared Memory
Dataflow Systems Graph Systems Systems
Model
Simple case: Parameters of the

EL
ML system are stored in a
distributed hash table that is

PT
accessible thru the network
[NIPS’09, NIPS’13]

Param Servers used in Google, Yahoo, ….

Parameter Server
N
Academic work by Smola, Xing, …
[VLDB’10]

Big Data Computing Vu Pham Parameter Servers

Parameter Server
A machine learning framework
Distributes a model over multiple machines
Offers two operations:

EL
Pull: query parts of the model
Push: update parts of the model

PT
Machine learning update equation
N
(Stochastic) gradient descent
Collapsed Gibbs sampling for topic modeling
Aggregate push updates via addition (+)

Big Data Computing Vu Pham Parameter Servers

Spark with Parameter Servers

EL
PT
N

Big Data Computing Vu Pham Parameter Servers

Parameter Server (PS)
Training state stored in PS shards, asynchronous
updates

EL
PT
N

Big Data Computing Vu Pham Parameter Servers

Parameter Servers Are Flexible

EL
PT Implemented
with Parameter
N
Server

Big Data Computing Vu Pham Parameter Servers

Parameter Server (PS)

EL
Worker Server
Machines Machines

PT
N
➢ Model parameters are stored on PS machines and accessed via key-
value interface (distributed shared memory)
➢ Extensions: multiple keys (for a matrix); multiple “channels” (for
multiple sparse vectors, multiple clients for same servers, …)
[Smola et al 2010, Ho et al 2013, Li et al 2014]

Big Data Computing Vu Pham Parameter Servers

Parameter Server (PS)

EL
Worker Server
Machines Machines

PT
N
➢ Extensions: push/pull interface to send/receive most recent copy
of (subset of) parameters, blocking is optional
➢ Extension: can block until push/pulls with clock < (t – τ ) complete
[Smola et al 2010, Ho et al 2013, Li et al 2014]

Big Data Computing Vu Pham Parameter Servers

Data parallel learning with PS

Parameter Server Parameter Server Parameter Server

w1 w2 w3 w4 w5 w6 w7 w8 w9

EL
PT
N
w519 w64 w83 w72
Data Data Data Data

Split Data Across Machines

Big Data Computing Vu Pham Parameter Servers
Data parallel learning with PS

Parameter Server Parameter Server Parameter Server

w1 w2 w3 w4 w5 w6 w7 w8 w9

EL
PT
1.Different parts of the model on different servers.
2.Workers retrieve the part needed as needed
N
Data Data Data Data

Split Data Across Machines

Big Data Computing Vu Pham Parameter Servers
Abstraction used for
Data parallel learning with PS
Key-Value API for workers:

EL
1. get(key) → value

2.
PT
add(key, delta)
Model
N

Model Model
Big Data Computing Vu Pham Parameter Servers
Iteration in Map-Reduce (IPM)
Initial Learned
Model Map Reduce Model

w(0)
w(1)

EL
Training
Data

PT w(2)
N

w(3)
Big Data Computing Vu Pham Parameter Servers
Cost of Iteration in Map-Reduce
Initial Learned
Model Map Reduce Model

w(0)
w(1)

EL
Training
Data

Read 2
PT Repeatedly
w(2)
load same data
N

w(3)
Big Data Computing Vu Pham Parameter Servers
Cost of Iteration in Map-Reduce
Initial Learned
Model Map Reduce Model

w(0)
w(1)

EL
Training
Redundantly save
Data

output between PT w(2)

N
stages

w(3)
Big Data Computing Vu Pham Parameter Servers
Parameter Servers

EL
Stale Synchronous Parallel Model

PT
N

Big Data Computing Vu Pham Parameter Servers

Parameter Server (PS)

EL
Worker Server
Machines Machines

PT
N
➢ Model parameters are stored on PS machines and accessed via key-
value interface (distributed shared memory)

[Smola et al 2010, Ho et al 2013, Li et al 2014]

Big Data Computing Vu Pham Parameter Servers

Iterative ML Algorithms

EL
PT Model
N
Data Worker Parameters

➢ Topic Model, matrix factorization, SVM, Deep Neural

Network…

Big Data Computing Vu Pham Parameter Servers

Map-Reduce VS. Parameter Server
Data Independent Independent
Model Records Data

EL
Programming Key-Value Store
Map & Reduce
Abstraction PT (Distributed Shared
Memory)
N
Execution Bulk Synchronous
Semantics Parallel (BSP) ?
The Problem: Networks Are Slow!

get(key)
add(key, delta)

EL
Worker Server
Machines Machines

PT
N
➢ Network is slow compared to local memory access
➢ We want to explore options for handling this….
[Smola et al 2010, Ho et al 2013, Li et al 2014]

Big Data Computing Vu Pham Parameter Servers

Solution 1: Cache Synchronization

Data

EL
Data Data

PT Server
N
Data Data

Data

Big Data Computing Vu Pham Parameter Servers

Parameter Cache Synchronization

Sparse Changes
to Model Data

EL
Data Data

PT Server
N
Data Data

Data

Big Data Computing Vu Pham Parameter Servers

Parameter Cache Synchronization
(aka IPM)

Data

EL
Data Data

PT Server
N
Data Data

Data

Big Data Computing Vu Pham Parameter Servers

Solution 2: Asynchronous
Execution
Compute Communicate Compute

Machine 1 Iteration Waste Iteration

EL
Machine 2
PT Iteration Iteration Waste
N
Machine 3 Iteration Waste Iteration

Barrier Barrier

Enable more frequent coordination on parameter values

Big Data Computing Vu Pham Parameter Servers
Asynchronous Execution

Parameter Server (Logical)

w1 w2 w3 w4 w5 w6 w7 w8 w9

EL
Machine 1
PT
Iteration Iteration Iteration Iteration
N
Machine 2 Iteration Iteration Iteration

Machine 3 Iteration Iteration Iteration Iteration Iteration

[Smola et al 2010]
Big Data Computing Vu Pham Parameter Servers
Asynchronous Execution
Problem:
Async lacks theoretical guarantee as distributed
environment can have arbitrary delays from

EL
network & stragglers

But…. PT
N

Big Data Computing Vu Pham Parameter Servers

RECAP

f is loss function, x is parameters

1. Take a gradient step: x’ = xt – ηt gt

2. If you’ve restricted the parameters to a subspace X (e.g., must be positive, …)
find the closest thing in X to x’: xt+1 = argminX dist( x – x’ )
3. But…. you might be using a “stale” g (from τ steps ago)

EL
PT
N
Map-Reduce VS. Parameter Server

Data Independent Independent

Model Records Data

EL
Programming Key-Value Store
Map & Reduce
Abstraction PT (Distributed Shared
Memory)
N
Execution Bulk Synchronous Bounded
Semantics Parallel (BSP) ?
Asynchronous
Parameter Server
Stale synchronous parallel (SSP):
• Global clock time t
• Parameters workers “get” can be out of date
• but can’t be older than t-τ

EL
• τ controls “staleness”
PT
• aka stale synchronous parallel (SSP)
N
Bounded
Asynchronous

Big Data Computing Vu Pham Parameter Servers

Stale Synchronous Parallel (SSP)

Staleness bound s = 3 worker 2 blocked until

worker 1 reaches clock 2
worker 1

EL
updates guaranteed
worker 2 visible to worker 2

incomplete updates sent

to worker 2; not yet
worker 3 visible

PT
updates not yet sent to
worker 2
worker 4

Clock
N
0 1 2 3 4 5 6 7
➢ Interpolate between BSP and Async and subsumes both
➢ Allow workers to usually run at own pace
➢ Fastest/slowest threads not allowed to drift >s clocks apart
➢ Efficiently implemented: Cache parameters [Ho et al 2013]

Big Data Computing Vu Pham Parameter Servers

Consistency Matters

EL
PT
N
Strong consistency Relaxed consistency

➢ Suitable delay (SSP) gives big speed-up

[Ho et al 2013]
Big Data Computing Vu Pham Parameter Servers
Stale Synchronous Parallel (SSP)

EL
PT
N

[Ho et al 2013]
Big Data Computing Vu Pham Parameter Servers
Conclusion

In this lecture, we have discussed the parameters

servers.

EL
PT
N

Big Data Computing Vu Pham Parameter Servers

PageRank Algorithm in Big Data

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham Page Rank
Preface
Content of this Lecture:
In this lecture, we will discuss PageRank Algorithm in Big
Data using different framework with different ways and
scale.

EL
PT
N

Big Data Computing Vu Pham Page Rank

Big Graphs
Social scale graph
1 billion vertices, 100 billion edges Social graph from Facebook
https://medium.com/@johnrobb/facebook-
Web scale graph the-complete-social-graph-b58157ee6594

50 billion vertices, 1 trillion edges

EL
Brain scale graph
100 billion vertices, 100 trillion edges

PT Web graph from the SNAP database

https://snap.stanford.edu/data/
N

Human connectome.
Gerhard et al., Frontiers in
Neuroinformatics 5(3), 2011
Big Data Computing Vu Pham Page Rank
What is PageRank?
Why is Page Importance Rating important?
New challenges for information retrieval on the World Wide Web.
• Huge number of web pages: 150 million by1998
1000 billion by 2008

EL
• Diversity of web pages: different topics, different quality, etc.

What is PageRank?
PT
• A method for rating the importance of web pages objectively and
mechanically using the link structure of the web.
N
• History:
– PageRank was developed by Larry Page (hence the name Page-Rank) and
Sergey Brin.
– It is first as part of a research project about a new kind of search engine.
That project started in 1995 and led to a functional prototype in 1998.
Big Data Computing Vu Pham Page Rank
PageRank of E denotes by PR (E)

Give pages ranks (scores) based on links to them

Links from many pages high rank

EL
Links from a high-rank page high rank

PT
N

Big Data Computing Vu Pham Page Rank

PageRank of E denotes by PR (E)

EL
PT
N

Big Data Computing Vu Pham Page Rank

Example

EL
PT
N

Big Data Computing Vu Pham Page Rank

Example

EL
PT
N

Big Data Computing Vu Pham Page Rank

Example

EL
PT
N

Big Data Computing Vu Pham Page Rank

Example

EL
PT
N

Big Data Computing Vu Pham Page Rank

Example

EL
PT
N

Big Data Computing Vu Pham Page Rank

Example

EL
PT
N

Big Data Computing Vu Pham Page Rank

Page Rank

EL
PT
N

Big Data Computing Vu Pham Page Rank

Stationary Distribution

EL
PT
N

Big Data Computing Vu Pham Page Rank

Page Rank

EL
PT
N

Big Data Computing Vu Pham Page Rank

Page Rank

EL
PT
N

Big Data Computing Vu Pham Page Rank

Page Rank

EL
PT
N

Big Data Computing Vu Pham Page Rank

Summary
PageRank (PR)- a first algorithm by Google Search to
rank websites in their search engine results.

EL
We have shown how to calculate iteratively PageRank
for every vertex in the given graph.

PT
N

Big Data Computing Vu Pham Page Rank

MapReduce for PageRank

EL
PT
N

Big Data Computing Vu Pham Page Rank

Problems
The entire state of the graph is shuffled on every
iteration

EL
We only need to shuffle the new rank contributions,
not the graph structure
PT
Further, we have to control the iteration outside of
N
MapReduce

Big Data Computing Vu Pham Page Rank

Pregel
Originally from Google
Open source implementations
Apache Giraph, Standard GPS, Jpregel, Hama

EL
Batch algorithms on large graphs

PT
N

Big Data Computing Vu Pham Page Rank

PageRank in Pregel

EL
PT
N

Big Data Computing Vu Pham Page Rank

Example

EL
PT
N

Big Data Computing Vu Pham Page Rank

Example

EL
PT
N

Big Data Computing Vu Pham Page Rank

Example

EL
PT
N

Big Data Computing Vu Pham Page Rank

Example

EL
PT
N

Big Data Computing Vu Pham Page Rank

Example

EL
PT
N

Big Data Computing Vu Pham Page Rank

Example

EL
PT
N

Big Data Computing Vu Pham Page Rank

Example

EL
PT
N

Big Data Computing Vu Pham Page Rank

Example

EL
PT
N

Big Data Computing Vu Pham Page Rank

Conclusion
Graph-structured data are increasingly common in data
science contexts due to their ubiquity in modeling the
communication between entities: people (social networks),
computers (Internet communication), cities and countries

EL
(transportation networks), or corporations (financial
transactions).

PT
In this lecture, we have discussed PageRank algorithms for
extracting information from graph data using MapReduce
N
and Pregel.

Big Data Computing Vu Pham Page Rank

Spark GraphX & Graph Analytics

EL
PT
N
Dr. Rajiv Misra
Dept. of Computer Science & Engg.
Indian Institute of Technology Patna
[email protected]
Big Data Computing Vu Pham GraphX
Preface
Content of this Lecture:

In this lecture, we will discuss GraphX: a distributed

graph computation framework that unifies graph-

EL
parallel and data parallel computation for Big Data
Analytics and also discuss as a case study of Graph

PT
Analytics with GraphX.
N

Big Data Computing Vu Pham GraphX

Introduction
A piece of technology build on top of spark core.
Graphs like our social network of graphs in the computer science/network
sense.
However, Graphs are only useful for specific things.
It can measure things like “connectedness”, degree distribution, average

EL
path length, triangle counts-high level measures of a graph.
It can count triangles in the graph, and apply the PageRank algorithm to
it.
PT
It can also join graphs together and transform graphs quickly
It supports the Pregel API( google) for traversing a graph.
N
• Introduces VertexRDD and EdgeRDD, and the Edge data type .

Big Data Computing Vu Pham

Graphs in Machine Learning Landscape

EL
PT
N

Big Data Computing Vu Pham GraphX

Machine Learning Landscape

EL
PT
N