0% found this document useful (0 votes)
92 views

Scaling Distributed Machine Learning With The Parameter Server

This document summarizes a parameter server framework for distributed machine learning problems. The framework distributes data and workloads across worker nodes while server nodes maintain globally shared parameters. It supports asynchronous communication between nodes, flexible consistency models, elastic scalability, and continuous fault tolerance. The authors demonstrate scalability on real datasets with billions of examples and parameters using problems like sparse logistic regression and latent Dirichlet allocation.

Uploaded by

Simon lin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
92 views

Scaling Distributed Machine Learning With The Parameter Server

This document summarizes a parameter server framework for distributed machine learning problems. The framework distributes data and workloads across worker nodes while server nodes maintain globally shared parameters. It supports asynchronous communication between nodes, flexible consistency models, elastic scalability, and continuous fault tolerance. The authors demonstrate scalability on real datasets with billions of examples and parameters using problems like sparse logistic regression and latent Dirichlet allocation.

Uploaded by

Simon lin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Scaling Distributed Machine Learning with the Parameter Server

Mu Li∗‡ , David G. Andersen∗ , Jun Woo Park∗ , Alexander J. Smola∗† , Amr Ahmed† ,
Vanja Josifovski† , James Long† , Eugene J. Shekita† , Bor-Yiing Su†
∗ ‡ †
Carnegie Mellon University Baidu Google
{muli, dga, junwoop}@cs.cmu.edu, [email protected], {amra, vanjaj, jamlong, shekita, boryiingsu}@google.com

Abstract ≈ #machine × time # of jobs failure rate


100 hours 13,187 7.8%
We propose a parameter server framework for distributed 1, 000 hours 1,366 13.7%
machine learning problems. Both data and workloads 10, 000 hours 77 24.7%
are distributed over worker nodes, while the server nodes
maintain globally shared parameters, represented as dense Table 1: Statistics of machine learning jobs for a three
or sparse vectors and matrices. The framework manages month period in a data center.
asynchronous data communication between nodes, and
supports flexible consistency models, elastic scalability,
and continuous fault tolerance. cost of synchronization and machine latency is high.
To demonstrate the scalability of the proposed frame- • At scale, fault tolerance is critical. Learning tasks are
work, we show experimental results on petabytes of real often performed in a cloud environment where ma-
data with billions of examples and parameters on prob- chines can be unreliable and jobs can be preempted.
lems ranging from Sparse Logistic Regression to Latent
Dirichlet Allocation and Distributed Sketching. To illustrate the last point, we collected all job logs for
a three month period from one cluster at a large internet
company. We show statistics of batch machine learning
1 Introduction tasks serving a production environment in Table 1. Here,
task failure is mostly due to being preempted or losing
Distributed optimization and inference is becoming a pre- machines without necessary fault tolerance mechanisms.
requisite for solving large scale machine learning prob- Unlike in many research settings where jobs run exclu-
lems. At scale, no single machine can solve these prob- sively on a cluster without contention, fault tolerance is a
lems sufficiently rapidly, due to the growth of data and necessity in real world deployments.
the resulting model complexity, often manifesting itself
in an increased number of parameters. Implementing an
efficient distributed algorithm, however, is not easy. Both 1.1 Contributions
intensive computational workloads and the volume of data
communication demand careful system design. Since its introduction, the parameter server frame-
Realistic quantities of training data can range between work [43] has proliferated in academia and industry. This
1TB and 1PB. This allows one to create powerful and paper describes a third generation open source implemen-
complex models with 109 to 1012 parameters [9]. These tation of a parameter server that focuses on the systems
models are often shared globally by all worker nodes, aspects of distributed inference. It confers two advan-
which must frequently accesses the shared parameters as tages to developers: First, by factoring out commonly
they perform computation to refine it. Sharing imposes required components of machine learning systems, it en-
three challenges: ables application-specific code to remain concise. At the
same time, as a shared platform to target for systems-
• Accessing the parameters requires an enormous level optimizations, it provides a robust, versatile, and
amount of network bandwidth. high-performance implementation capable of handling a
• Many machine learning algorithms are sequential. diverse array of algorithms from sparse logistic regression
The resulting barriers hurt performance when the to topic models and distributed sketching. Our design de-
11
Shared Data Consistency Fault Tolerance 10

number of shared parameters


Parameter server (Sparse LR)
Graphlab [34] graph eventual checkpoint 10
10 Parameter server (LDA)
Petuum [12] hash table delay bound none
9
REEF [10] array BSP checkpoint 10 Distbelief (DNN)
Naiad [37] (key,value) multiple checkpoint 8
10 Petuum (Lasso)
Mlbase [29] table BSP RDD 7 Naiad (LR) YahooLDA (LDA)
Parameter (sparse) 10
VW (LR)
various continuous 6
Server vector/matrix 10 Graphlab (LDA)
5 MLbase (LR)
10
Table 2: Attributes of distributed data analysis systems. REEF (LR)
4
10 1 2 3 4 5
10 10 10 10 10
number of cores
cisions were guided by the workloads found in real sys-
tems. Our parameter server provides five key features: Figure 1: Comparison of the public largest machine learn-
Efficient communication: The asynchronous commu- ing experiments each system performed. Problems are
nication model does not block computation (unless re- color-coded as follows: Blue circles — sparse logistic re-
quested). It is optimized for machine learning tasks to gression; red squares — latent variable graphical models;
reduce network traffic and overhead. grey pentagons — deep networks.
Flexible consistency models: Relaxed consistency fur-
ther hides synchronization cost and latency. We allow the
algorithm designer to balance algorithmic convergence tains only a part of the parameters, and each worker node
rate and system efficiency. The best trade-off depends on typically requires only a subset of these parameters when
data, algorithm, and hardware. operating. Two key challenges arise in constructing a high
Elastic Scalability: New nodes can be added without performance parameter server system:
restarting the running framework. Communication. While the parameters could be up-
Fault Tolerance and Durability: Recovery from and re- dated as key-value pairs in a conventional datastore, us-
pair of non-catastrophic machine failures within 1s, with- ing this abstraction naively is inefficient: values are typi-
out interrupting computation. Vector clocks ensure well- cally small (floats or integers), and the overhead of send-
defined behavior after network partition and failure. ing each update as a key value operation is high.
Ease of Use: The globally shared parameters are repre- Our insight to improve this situation comes from the
sented as (potentially sparse) vectors and matrices to facil- observation that many learning algorithms represent pa-
itate development of machine learning applications. The rameters as structured mathematical objects, such as vec-
linear algebra data types come with high-performance tors, matrices, or tensors. At each logical time (or an it-
multi-threaded libraries. eration), typically a part of the object is updated. That is,
The novelty of the proposed system lies in the synergy workers usually send a segment of a vector, or an entire
achieved by picking the right systems techniques, adapt- row of the matrix. This provides an opportunity to auto-
ing them to the machine learning algorithms, and modify- matically batch both the communication of updates and
ing the machine learning algorithms to be more systems- their processing on the parameter server, and allows the
friendly. In particular, we can relax a number of other- consistency tracking to be implemented efficiently.
wise hard systems constraints since the associated ma- Fault tolerance, as noted earlier, is critical at scale, and
chine learning algorithms are quite tolerant to perturba- for efficient operation, it must not require a full restart of a
tions. The consequence is the first general purpose ML long-running computation. Live replication of parameters
system capable of scaling to industrial scale sizes. between servers supports hot failover. Failover and self-
repair in turn support dynamic scaling by treating machine
1.2 Engineering Challenges removal or addition as failure or repair respectively.
Figure 1 provides an overview of the scale of the largest
When solving distributed data analysis problems, the is- supervised and unsupervised machine learning experi-
sue of reading and updating parameters shared between ments performed on a number of systems. When possi-
different worker nodes is ubiquitous. The parameter ble, we confirmed the scaling limits with the authors of
server framework provides an efficient mechanism for ag- each of these systems (data current as of 4/2014). As is
gregating and synchronizing model parameters and statis- evident, we are able to cover orders of magnitude more
tics between workers. Each parameter server node main- data on orders of magnitude more processors than any

2
other published system. Furthermore, Table 2 provides an dates to a server keeping the aggregate state. It thus imple-
overview of the main characteristics of several machine ments largely a subset of the functionality of our system,
learning systems. Our parameter server offers the greatest lacking the mechane learning specailized optimizations:
degree of flexibility in terms of consistency. It is the only message compression, replication, and variable consis-
system offering continuous fault tolerance. Its native data tency models expressed via dependency graphs.
types make it particularly friendly for data analysis.

2 Machine Learning
1.3 Related Work
Machine learning systems are widely used in Web search,
Related systems have been implemented at Amazon, spam detection, recommendation systems, computational
Baidu, Facebook, Google [13], Microsoft, and Yahoo [1]. advertising, and document analysis. These systems au-
Open source codes also exist, such as YahooLDA [1] and tomatically learn models from examples, termed training
Petuum [24]. Furthermore, Graphlab [34] supports pa- data, and typically consist of three components: feature
rameter synchronization on a best effort model. extraction, the objective function, and learning.
The first generation of such parameter servers, as in- Feature extraction processes the raw training data, such
troduced by [43], lacked flexibility and performance — it as documents, images and user query logs, to obtain fea-
repurposed memcached distributed (key,value) store as ture vectors, where each feature captures an attribute of
synchronization mechanism. YahooLDA improved this the training data. Preprocessing can be executed effi-
design by implementing a dedicated server with user- ciently by existing frameworks such as MapReduce, and
definable update primitives (set, get, update) and a more is therefore outside the scope of this paper.
principled load distribution algorithm [1]. This second
generation of application specific parameter servers can
also be found in Distbelief [13] and the synchronization 2.1 Goals
mechanism of [33]. A first step towards a general platform The goal of many machine learning algorithms can be ex-
was undertaken by Petuum [24]. It improves YahooLDA pressed via an “objective function.” This function cap-
with a bounded delay model while placing further con- tures the properties of the learned model, such as low er-
straints on the worker threading model. We describe a ror in the case of classifying e-mails into ham and spam,
third generation system overcoming these limitations. how well the data is explained in the context of estimating
Finally, it is useful to compare the parameter server topics in documents, or a concise summary of counts in
to more general-purpose distributed systems for machine the context of sketching data.
learning. Several of them mandate synchronous, itera- The learning algorithm typically minimizes this objec-
tive communication. They scale well to tens of nodes, tive function to obtain the model. In general, there is no
but at large scale, this synchrony creates challenges as the closed-form solution; instead, learning starts from an ini-
chance of a node operating slowly increases. Mahout [4], tial model. It iteratively refines this model by processing
based on Hadoop [18] and MLI [44], based on Spark [50], the training data, possibly multiple times, to approach the
both adopt the iterative MapReduce [14] framework. A solution. It stops when a (near) optimal solution is found
key insight of Spark and MLI is preserving state between or the model is considered to be converged.
iterations, which is a core goal of the parameter server. The training data may be extremely large. For instance,
Distributed GraphLab [34] instead asynchronously a large internet company using one year of an ad impres-
schedules communication using a graph abstraction. At sion log [27] to train an ad click predictor would have
present, GraphLab lacks the elastic scalability of the trillions of training examples. Each training example is
map/reduce-based frameworks, and it relies on coarse- typically represented as a possibly very high-dimensional
grained snapshots for recovery, both of which impede “feature vector” [9]. Therefore, the training data may con-
scalability. Its applicability for certain algorithms is lim- sist of trillions of trillion-length feature vectors. Itera-
ited by its lack of global variable synchronization as an tively processing such large scale data requires enormous
efficient first-class primitive. In a sense, a core goal of the computing and bandwidth resources. Moreover, billions
parameter server framework is to capture the benefits of of new ad impressions may arrive daily. Adding this data
GraphLab’s asynchrony without its structural limitations. into the system often improves both prediction accuracy
Piccolo [39] uses a strategy related to the parameter and coverage. But it also requires the learning algorithm
server to share and aggregate state between machines. In to run daily [35], possibly in real time. Efficient execution
it, workres pre-aggregate state locally and transmit the up- of these algorithms is the main focus of this paper.

3
To motivate the design decisions in our system, next worker 1
we briefly outline the two widely used machine learning 2. push g1
servers
technologies that we will use to demonstrate the efficacy
1. compute
of our parameter server. More detailed overviews can be g1 +... +gm w1
found in [36, 28, 42, 22, 6]. 3. update

w 4. pull
2.2 Risk Minimization

...
2. push
The most intuitive variant of machine learning problems 4. pull
worker m
is that of risk minimization. The “risk” is, roughly, a mea-
gm
sure of prediction error. For example, if we were to predict
tomorrow’s stock price, the risk might be the deviation be- training
1. compute
wm
tween the prediction and the actual value of the stock. data
The training data consists of n examples. xi is the ith
such example, and is often a vector of length d. As noted
earlier, both n and d may be on the order of billions to tril-
lions of examples and dimensions, respectively. In many
cases, each training example xi is associated with a label Figure 2: Steps required in performing distributed subgra-
yi . In ad click prediction, for example, yi might be 1 for dient descent, as described e.g. in [46]. Each worker only
“clicked” or -1 for “not clicked”. caches the working set of w rather than all parameters.
Risk minimization learns a model that can predict the
value y of a future example x. The model consists of pa- Algorithm 1 Distributed Subgradient Descent
rameters w. In the simplest example, the model param- Task Scheduler:
eters might be the “clickiness” of each feature in an ad
1: issue LoadData() to all workers
impression. To predict whether a new impression would
2: for iteration t = 0, . . . , T do
be clicked, the system might simply sum its “clickiness”
3: issue W ORKER I TERATE (t) to all workers.
based upon the features present in the impression, namely
Pd 4: end for
x> w := j=1 xj wj , and then decide based on the sign.
Worker r = 1, . . . , m:
In any learning algorithm, there is an important re-
lationship between the amount of training data and the 1: function L OAD DATA()
model size. A more detailed model typically improves 2: load a part of training data {yik , xik }nk=1r

(0)
accuracy, but only up to a point: If there is too little train- 3: pull the working set wr from servers
ing data, a highly-detailed model will overfit and become 4: end function
merely a system that uniquely memorizes every item in 5: function W ORKER I TERATE(t)
(t) Pnr (t)
the training set. On the other hand, a too-small model 6: gradient gr ← k=1 ∂`(xik , yik , wr )
will fail to capture interesting and relevant attributes of (t)
7: push gr to servers
the data that are important to making a correct decision. 8: pull wr
(t+1)
from servers
Regularized risk minimization [48, 19] is a method to 9: end function
find a model that balances model complexity and training
Servers:
error. It does so by minimizing the sum of two terms:
1: function S ERVER I TERATE(t)
a loss `(x, y, w) representing the prediction error on the Pm (t)
training data and a regularizer Ω[w] penalizing the model 2: aggregate g (t) ← r=1 gr 
complexity. A good model is one with low error and low 3: w(t+1) ← w(t) − η g (t) + ∂Ω(w(t)
complexity. Consequently we strive to minimize 4: end function

n
X
F (w) = `(xi , yi , w) + Ω(w). (1)
this paper: the algorithms we present can be used with all
i=1
of the most popular loss functions and regularizers.
The specific loss and regularizer functions used are impor- In Section 5.1 we use a high-performance distributed
tant to the prediction performance of the machine learning learning algorithm to evaluate the parameter server. For
algorithm, but relatively unimportant for the purpose of the sake of simplicity we describe a much simpler model

4
100 domly assigned data to workers and then counted the av-
parameters per worker (%) erage working set size per worker on the dataset that is
used in Section 5.1. Figure 3 shows that for 100 work-
ers, each worker only needs 7.8% of the total parameters.
10 With 10,000 workers this reduces to 0.15%.

2.3 Generative Models


1
In a second major class of machine learning algorithms,
the label to be applied to training examples is unknown.
Such settings call for unsupervised algorithms (for labeled
0.1 0 1 2 3 4
10 10 10 10 10 training data one can use supervised or semi-supervised
number of workers algorithms). They attempt to capture the underlying struc-
ture of the data. For example, a common problem in this
Figure 3: Each worker’s set of parameters shrinks as more area is topic modeling: Given a collection of documents,
workers are used, requiring less memory per machine. infer the topics contained in each document.
When run on, e.g., the SOSP’13 proceedings, an algo-
rithm might generate topics such as “distributed systems”,
[46] called distributed subgradient descent.1
“machine learning”, and “performance.” The algorithms
As shown in Figure 2 and Algorithm 1, the training
infer these topics from the content of the documents them-
data is partitioned among all of the workers, which jointly
selves, not an external topic list. In practical settings such
learn the parameter vector w. The algorithm operates iter-
as content personalization for recommendation systems
atively. In each iteration, every worker independently uses
[2], the scale of these problems is huge: hundreds of mil-
its own training data to determine what changes should be
lions of users and billions of documents, making it critical
made to w in order to get closer to an optimal value. Be-
to parallelize the algorithms across large clusters.
cause each worker’s updates reflect only its own training
data, the system needs a mechanism to allow these up- Because of their scale and data volumes, these al-
dates to mix. It does so by expressing the updates as a gorithms only became commercially applicable follow-
subgradient—a direction in which the parameter vector w ing the introduction of the first-generation parameter
should be shifted—and aggregates all subgradients before servers [43]. A key challenge in topic models is that the
applying them to w. These gradients are typically scaled parameters describing the current estimate of how docu-
down, with considerable attention paid in algorithm de- ments are supposed to be generated must be shared.
sign to the right learning rate η that should be applied in A popular topic modeling approach is Latent Dirichlet
order to ensure that the algorithm converges quickly. Allocation (LDA) [7]. While the statistical model is quite
The most expensive step in Algorithm 1 is computing different, the resulting algorithm for learning it is very
the subgradient to update w. This task is divided among similar to Algorithm 1.2 The key difference, however,
all of the workers, each of which execute W ORKER I T- is that the update step is not a gradient computation, but
ERATE . As part of this, workers compute w > xik , which an estimate of how well the document can be explained
could be infeasible for very high-dimensional w. Fortu- by the current model. This computation requires access
nately, a worker needs to know a coordinate of w if and to auxiliary metadata for each document that is updated
only if some of its training data references that entry. each time a document is accessed. Because of the number
For instance, in ad-click prediction one of the key fea- of documents, metadata is typically read from and written
tures are the words in the ad. If only very few advertise- back to disk whenever the document is processed.
ments contain the phrase OSDI 2014, then most workers This auxiliary data is the set of topics assigned to each
will not generate any updates to the corresponding entry word of a document, and the parameter w being learned
in w, and hence do not require this entry. While the total consists of the relative frequency of occurrence of a word.
size of w may exceed the capacity of a single machine, As before, each worker needs to store only the param-
the working set of entries needed by a particular worker eters for the words occurring in the documents it pro-
can be trivially cached locally. To illustrate this, we ran- cesses. Hence, distributing documents across workers has
1 The unfamiliar reader could read this as gradient descent; the sub- 2 The specific algorithm we use in the evaluation is a parallelized vari-

gradient aspect is simply a generalization to loss functions and regular- ant of a stochastic variational sampler [25] with an update strategy sim-
izers that need not be continuously differentiable, such as |w| at w = 0. ilar to that used in YahooLDA [1].

5
server a server
nodes, such as online services consuming this model. Si-
server group
manager node multaneously the model is updated by a different group of
resource
manager worker nodes as new training data arrives.
The parameter server is designed to simplify devel-
oping distributed machine learning applications such as
worker group those discussed in Section 2. The shared parameters are
presented as (key,value) vectors to facilitate linear algebra
task
scheduler
operations (Sec. 3.1). They are distributed across a group
of server nodes (Sec. 4.3). Any node can both push out its
local parameters and pull parameters from remote nodes
a worker
node (Sec. 3.2). By default, workloads, or tasks, are executed
by worker nodes; however, they can also be assigned to
server nodes via user defined functions (Sec. 3.3). Tasks
training data are asynchronous and run in parallel (Sec. 3.4). The pa-
rameter server provides the algorithm designer with flexi-
bility in choosing a consistency model via the task depen-
Figure 4: Architecture of a parameter server communicat- dency graph (Sec. 3.5) and predicates to communicate a
ing with several groups of workers. subset of parameters (Sec. 3.6).

the same effect as in the previous section: we can process 3.1 (Key,Value) Vectors
much bigger models than a single worker may hold.
The model shared among nodes can be represented as a set
of (key, value) pairs. For example, in a loss minimization
3 Architecture problem, the pair is a feature ID and its weight. For LDA,
the pair is a combination of the word ID and topic ID, and
An instance of the parameter server can run more than a count. Each entry of the model can be read and written
one algorithm simultaneously. Parameter server nodes are locally or remotely by its key. This (key,value) abstraction
grouped into a server group and several worker groups is widely adopted by existing approaches [37, 29, 12].
as shown in Figure 4. A server node in the server group Our parameter server improves upon this basic ap-
maintains a partition of the globally shared parameters. proach by acknowledging the underlying meaning of
Server nodes communicate with each other to replicate these key value items: machine learning algorithms typ-
and/or to migrate parameters for reliability and scaling. A ically treat the model as a linear algebra object. For in-
server manager node maintains a consistent view of the stance, w is used as a vector for both the objective function
metadata of the servers, such as node liveness and the as- (1) and the optimization in Algorithm 1 by risk minimiza-
signment of parameter partitions. tion. By treating these objects as sparse linear algebra
Each worker group runs an application. A worker typ- objects, the parameter server can provide the same func-
ically stores locally a portion of the training data to com- tionality as the (key,value) abstraction, but admits impor-
pute local statistics such as gradients. Workers communi- tant optimized operations such as vector addition w + u,
cate only with the server nodes (not among themselves), multiplication Xw, finding the 2-norm kwk2 , and other
updating and retrieving the shared parameters. There is a more sophisticated operations [16].
scheduler node for each worker group. It assigns tasks to To support these optimizations, we assume that the
workers and monitors their progress. If workers are added keys are ordered. This lets us treat the parameters as
or removed, it reschedules unfinished tasks. (key,value) pairs while endowing them with vector and
The parameter server supports independent parameter matrix semantics, where non-existing keys are associated
namespaces. This allows a worker group to isolate its set with zeros. This helps with linear algebra in machine
of shared parameters from others. Several worker groups learning. It reduces the programming effort to implement
may also share the same namespace: we may use more optimization algorithms. Beyond convenience, this inter-
than one worker group to solve the same deep learning face design leads to efficient code by leveraging CPU-
application [13] to increase parallelization. Another ex- efficient multithreaded self-tuning linear algebra libraries
ample is that of a model being actively queried by some such as BLAS [16], LAPACK [3], and ATLAS [49].

6
3.2 Range Push and Pull iter 10: push & pull
gradient
Data is sent between nodes using push and pull oper- iter 11: gradient push & pull
ations. In Algorithm 1 each worker pushes its entire lo-
iter 12: gradient pu
cal gradient into the servers, and then pulls the updated
weight back. The more advanced algorithm described
in Algorithm 3 uses the same pattern, except that only a Figure 5: Iteration 12 depends on 11, while 10 and 11 are
range of keys is communicated each time. independent, thus allowing asynchronous processing.
The parameter server optimizes these updates for
programmer convenience as well as computational and
network bandwidth efficiency by supporting range- The caller marks a task as finished only once it receives
based push and pull. If R is a key range, then the callee’s reply. A reply could be the function return
w.push(R, dest) sends all existing entries of w in key of a user-defined function, the (key,value) pairs requested
range R to the destination, which can be either a particular by the pull, or an empty acknowledgement. The callee
node, or a node group such as the server group. Similarly, marks a task as finished only if the call of the task is re-
w.pull(R, dest) reads all existing entries of w in key turned and all subtasks issued by this call are finished.
range R from the destination. If we set R to be the whole By default, callees execute tasks in parallel, for best
key range, then the whole vector w will be communicated. performance. A caller that wishes to serialize task exe-
If we set R to include a single key, then only an individual cution can place an execute-after-finished dependency be-
entry will be sent. tween tasks. Figure 5 depicts three example iterations of
This interface can be extended to communicate any lo- WorkerIterate. Iterations 10 and 11 are independent,
cal data structures that share the same keys as w. For ex- but 12 depends on 11. The callee therefore begins itera-
ample, in Algorithm 1, a worker pushes its temporary lo- tion 11 immediately after the local gradients are computed
cal gradient g to the parameter server for aggregation. One in iteration 10. Iteration 12, however, is postponed until
option is to make g globally shared. However, note that g the pull of 11 finishes.
shares the keys of the worker’s working set w. Hence the Task dependencies help implement algorithm logic.
programmer can use w.push(R, g, dest) for the local For example, the aggregation logic in ServerIterate
gradients to save memory and also enjoy the optimization of Algorithm 1 updates the weight w only after all worker
discussed in the following sections. gradients have been aggregated. This can be implemented
by having the updating task depend on the push tasks of
all workers. The second important use of dependencies is
3.3 User-Defined Functions on the Server to support the flexible consistency models described next.
Beyond aggregating data from workers, server nodes can
execute user-defined functions. It is beneficial because the 3.5 Flexible Consistency
server nodes often have more complete or up-to-date in-
formation about the shared parameters. In Algorithm 1, Independent tasks improve system efficiency via paral-
server nodes evaluate subgradients of the regularizer Ω lelizing the use of CPU, disk and network bandwidth.
in order to update w. At the same time a more compli- However, this may lead to data inconsistency between
cated proximal operator is solved by the servers to update nodes. In the diagram above, the worker r starts iteration
the model in Algorithm 3. In the context of sketching 11 before w(11) has been pulled back, so it uses the old
(10)
(Sec. 5.3), almost all operations occur on the server side. wr in this iteration and thus obtains the same gradient
(11) (10)
as in iteration 10, namely gr = gr . This inconsis-
3.4 Asynchronous Tasks and Dependency tency potentially slows down the convergence progress of
Algorithm 1. However, some algorithms may be less sen-
A tasks is issued by a remote procedure call. It can be a sitive to this type of inconsistency. For example, only a
push or a pull that a worker issues to servers. It can segment of w is updated each time in Algorithm 3. Hence,
also be a user-defined function that the scheduler issues starting iteration 11 without waiting for 10 causes only a
to any node. Tasks may include any number of subtasks. part of w to be inconsistent.
For example, the task WorkerIterate in Algorithm 1 The best trade-off between system efficiency and algo-
contains one push and one pull. rithm convergence rate usually depends on a variety of
Tasks are executed asynchronously: the caller can per- factors, including the algorithm’s sensitivity to data incon-
form further computation immediately after issuing a task. sistency, feature correlation in training data, and capacity

7
Algorithm 2 Set vector clock to t for range R and node i
0 1 2 0 1 2 0 1 2 3 4
1: for S ∈ {Si : Si ∩ R = 6 ∅, i = 1, . . . , n} do
(a) Sequential (b) Eventual (c) 1 Bounded delay 2: if S ⊆ R then vci (S) ← t else
3: a ← max(S b , Rb ) and b ← min(S e , Re )
Figure 6: Directed acyclic graphs for different consistency
4: split range S into [S b , a), [a, b), [b, S e )
models. The size of the DAG increases with the delay.
5: vci ([a, b)) ← t
6: end if
7: end for
difference of hardware components. Instead of forcing the
user to adopt one particular dependency that may be ill-
suited to the problem, the parameter server gives the algo-
useful for synchronization. One example is the signifi-
rithm designer flexibility in defining consistency models.
cantly modified filter, which only pushes entries that have
This is a substantial difference to other machine learning
changed by more than a threshold since their last synchro-
systems.
nization. In Section 5.1, we discuss another filter named
We show three different models that can be imple-
KKT which takes advantage of the optimality condition of
mented by task dependency. Their associated directed
the optimization problem: a worker only pushes gradients
acyclic graphs are given in Figure 6.
that are likely to affect the weights on the servers.
Sequential In sequential consistency, all tasks are exe-
cuted one by one. The next task can be started only 4 Implementation
if the previous one has finished. It produces results
identical to the single-thread implementation, and The servers store the parameters (key-value pairs) using
also named Bulk Synchronous Processing. consistent hashing [45] (Sec. 4.3). For fault tolerance, en-
Eventual Eventual consistency is the opposite: all tasks tries are replicated using chain replication [47] (Sec. 4.4).
may be started simultaneously. For instance, [43] Different from prior (key,value) systems, the parameter
describes such a system. However, this is only rec- server is optimized for range based communication with
ommendable if the underlying algorithms are robust compression on both data (Sec. 4.2) and range based vec-
with regard to delays. tor clocks (Sec. 4.1).
Bounded Delay When a maximal delay time τ is set, a
new task will be blocked until all previous tasks τ
times ago have been finished. Algorithm 3 uses such
4.1 Vector Clock
a model. This model provides more flexible controls Given the potentially complex task dependency graph and
than the previous two: τ = 0 is the sequential consis- the need for fast recovery, each (key,value) pair is associ-
tency model, and an infinite delay τ = ∞ becomes ated with a vector clock [30, 15], which records the time
the eventual consistency model. of each individual node on this (key,value) pair. Vector
clocks are convenient, e.g., for tracking aggregation sta-
Note that the dependency graphs may be dynamic. For tus or rejecting doubly sent data. However, a naive im-
instance the scheduler may increase or decrease the max- plementation of the vector clock requires O(nm) space
imal delay according to the runtime progress to balance to handle n nodes and m parameters. With thousands of
system efficiency and convergence of the underlying op- nodes and billions of parameters, this is infeasible in terms
timization algorithm. In this case the caller traverses the of memory and bandwidth.
DAG. If the graph is static, the caller can send all tasks Fortunately, many parameters hare the same timestamp
with the DAG to the callee to reduce synchronization cost. as a result of the range-based communication pattern of
the parameter server: If a node pushes the parameters in
3.6 User-defined Filters a range, then the timestamps of the parameters associated
with the node are likely the same. Therefore, they can be
Complementary to a scheduler-based flow control, the compressed into a single range vector clock. More specif-
parameter server supports user-defined filters to selec- ically, assume that vci (k) is the time of key k for node i.
tively synchronize individual (key,value) pairs, allowing Given a key range R, the ranged vector clock vci (R) = t
fine-grained control of data consistency within a task. means for any key k ∈ R, vci (k) = t.
The insight is that the optimization algorithm itself usu- Initially, there is only one range vector clock for each
ally possesses information on which parameters are most node i. It covers the entire parameter key space as its

8
range with 0 as its initial timestamp. Each range set may node IDs are both inserted into the hash ring (Figure 7).
Each server node manages the key range starting with its
split the range and create at most 3 new vector clocks (see
Algorithm 2). Let k be the total number of unique ranges insertion point to the next point by other nodes in the
communicated by the algorithm, then there are at most counter-clockwise direction. This node is called the mas-
O(mk) vector clocks, where m is the number of nodes. ter of this key range. A physical server is often repre-
k is typically much smaller than the total number of pa- sented in the ring via multiple “virtual” servers to improve
load balancing and recovery.
rameters. This significantly reduces the space required for
range vector clocks.3 We simplify the management by using a direct-mapped
DHT design. The server manager handles the ring man-
agement. All other nodes cache the key partition locally.
4.2 Messages
This way they can determine directly which server is re-
Nodes may send messages to individual nodes or node sponsible for a key range, and are notified of any changes.
groups. A message consists of a list of (key,value) pairs
in the key range R and the associated range vector clock: 4.4 Replication and Consistency

[vc(R), (k1 , v1 ), . . . , (kp , vp )] kj ∈ R and j ∈ {1, . . . p} Each server node stores a replica of the k counterclock-
wise neighbor key ranges relative to the one it owns. We
This is the basic communication format of the parameter refer to nodes holding copies as slaves of the appropriate
server not only for shared parameters but also for tasks. key range. The above diagram shows an example with
For the latter, a (key,value) pair might assume the form k = 2, where server 1 replicates the key ranges owned by
(task ID, arguments or return results). server 2 and server 3.
Messages may carry a subset of all available keys Worker nodes communicate with the master of a key
within range R. The missing keys are assigned the same range for both push and pull. Any modification on the
timestamp without changing their values. A message can master is copied with its timestamp to the slaves. Mod-
be split by the key range. This happens when a worker ifications to data are pushed synchronously to the slaves.
sends a message to the whole server group, or when the Figure 8 shows a case where worker 1 pushes x into server
key assignment of the receiver node has changed. By do- 1, which invokes a user defined function f to modify the
ing so, we partition the (key,value) lists and split the range shared data. The push task is completed only once the
vector clock similar to Algorithm 2. data modification f (x) is copied to the slave.
Because machine learning problems typically require Naive replication potentially increases the network traf-
high bandwidth, message compression is desirable. Train- fic by k times. This is undesirable for many machine
ing data often remains unchanged between iterations. A learning applications that depend on high network band-
worker might send the same key lists again. Hence it is de- width. The parameter server framework permits an impor-
sirable for the receiving node to cache the key lists. Later, tant optimization for many algorithms: replication after
the sender only needs to send a hash of the list rather than aggregation. Server nodes often aggregate data from the
the list itself. Values, in turn, may contain many zero worker nodes, such as summing local gradients. Servers
entries. For example, a large portion of parameters re- may therefore postpone replication until aggregation is
main unchanged in sparse logistic regression, as evalu- complete. In the righthand side of the diagram, two work-
ated in Section 5.1. Likewise, a user-defined filter may ers push x and y to the server, respectively. The server first
also zero out a large fraction of the values (see Figure 12). aggregates the push by x + y, then applies the modifica-
Hence we need only send nonzero (key,value) pairs. We tion f (x+y), and finally performs the replication. With n
use the fast Snappy compression library [21] to compress workers, replication uses only k/n bandwidth. Often k is
messages, effectively removing the zeros. Note that key- a small constant, while n is hundreds to thousands. While
caching and value-compression can be used jointly. aggregation increases the delay of the task reply, it can be
hidden by relaxed consistency conditions.
4.3 Consistent Hashing
The parameter server partitions keys much as a conven-
4.5 Server Management
tional distributed hash table does [8, 41]: keys and server To achieve fault tolerance and dynamic scaling we must
3 Ranges can be also merged to reduce the number of fragments. support addition and removal of nodes. For convenience
However, in practice both m and k are small enough to be easily han- we refer to virtual servers below. The following steps hap-
dled. We leave merging for future work. pen when a server joins.

9
S3 replicated
by S1
S4'
S2 push: ack: W1 5a 2: f(x+y)
2: f(x) 1a: 4
S2' x S1 S2
owned 5 4
key ring by S1 W1 S1 S2 5b 3: f(x+y)
1: x 3: f(x) W2 y
1b:
S1' S1
S4 S3'
Figure 8: Replica generation. Left: single worker. Right: multiple workers updating
Figure 7: Server node layout. values simultaneously.

1. The server manager assigns the new node a key range 4.6 Worker Management
to serve as master. This may cause another key range
Adding a new worker node W is similar but simpler than
to split or be removed from a terminated node.
adding a new server node:
2. The node fetches the range of data to maintains as
master and k additional ranges to keep as slave. 1. The task scheduler assigns W a range of data.
3. The server manager broadcasts the node changes. 2. This node loads the range of training data from a net-
The recipients of the message may shrink their own work file system or existing workers. Training data is
data based on key ranges they no longer hold and to often read-only, so there is no two-phase fetch. Next,
resubmit unfinished tasks to the new node. W pulls the shared parameters from servers.
3. The task scheduler broadcasts the change, possibly
Fetching the data in the range R from some node S causing other workers to free some training data.
proceeds in two stages, similar to the Ouroboros proto-
When a worker departs, the task scheduler may start a
col [38]. First S pre-copies all (key,value) pairs in the
replacement. We give the algorithm designer the option
range together with the associated vector clocks. This
to control recovery for two reasons: If the training data
may cause a range vector clock to split similar to Algo-
is huge, recovering a worker node be may more expen-
rithm 2. If the new node fails at this stage, S remains
sive than recovering a server node. Second, losing a small
unchanged. At the second stage S no longer accepts mes-
amount of training data during optimization typically af-
sages affecting the key range R by dropping the messages
fects the model only a little. Hence the algorithm designer
without executing and replying. At the same time, S sends
may prefer to continue without replacing a failed worker.
the new node all changes that occurred in R during the
It may even be desirable to terminate the slowest workers.
pre-copy stage.
On receiving the node change message a node N first
checks if it also maintains the key range R. If true and 5 Evaluation
if this key range is no longer to be maintained by N , it
deletes all associated (key,value) pairs and vector clocks We evaluate our parameter server based on the use cases
in R. Next, N scans all outgoing messages that have not of Section 2 — Sparse Logistic Regression and Latent
received replies yet. If a key range intersects with R, then Dirichlet Allocation. We also show results of sketching
the message will be split and resent. to illustrate the generality of our framework. The experi-
Due to delays, failures, and lost acknowledgements N ments were run on clusters in two (different) large inter-
may send messages twice. Due to the use of vector clocks net companies and a university research cluster to demon-
both the original recipient and the new node are able to strate the versatility of our approach.
reject this message and it does not affect correctness.
The departure of a server node (voluntary or due to fail- 5.1 Sparse Logistic Regression
ure) is similar to a join. The server manager tasks a new
Problem and Data: Sparse logistic regression is one
node with taking the key range of the leaving node. The
of the most popular algorithms for large scale risk min-
server manager detects node failure by a heartbeat sig-
imization [9]. It combines the logistic loss4 with the `1
nal. Integration with a cluster resource manager such as
Yarn [17] or Mesos [23] is left for future work. 4 `(x , y , w) = log(1 + exp(−y hx , wi))
i i i i

10
Algorithm 3 Delayed Block Proximal Gradient [31] 10
10.7
System−A
Scheduler:
System−B
1: Partition features into b ranges R1 , . . . , Rb
Parameter Server

objective value
2: for t = 0 to T do
3: Pick random range Rit and issue task to workers
4: end for
Worker r at iteration t 10.6
10
1: Wait until all iterations before t − τ are finished
(t)
2: Compute first-order gradient gr and diagonal
(t)
second-order gradient ur on range Rit
(t) (t)
3: Push gr and ur to servers with the KKT filter −1 0 1
(t+1) 10 10 10
4: Pull wr from servers time (hours)
Servers at iteration t
1: Aggregate gradients to obtain g (t) and u(t)
Figure 9: Convergence of sparse logistic regression. The
2: Solve the proximal operator
goal is to minimize the objective rapidly.
1
w(t+1) ← argmin Ω(u) + kw(t) − ηg (t) + uk2H , 5
u 2η
computing
where H = diag(h(t) ) and kxk2H = xT Hx
waiting
4

Method Consistency LOC time (hours)


System A L-BFGS Sequential 10,000 3
System B Block PG Sequential 30,000
Parameter Bounded Delay 2
Block PG 300
Server KKT Filter

Table 3: Systems evaluated. 1

0
regularizer5 of Section 2.2. The latter biases a compact System−A System−B Parameter Server
solution with a large portion of 0 value entries. The non-
smoothness of this regularizer, however, makes learning Figure 10: Time per worker spent on computation and
more difficult. waiting during sparse logistic regression.
We collected an ad click prediction dataset with 170 bil-
lion examples and 65 billion unique features. This dataset
putation: the servers update the model by solving a prox-
is 636 TB uncompressed (141 TB compressed). We ran
imal operator based on the aggregated local gradients.
the parameter server on 1000 machines, each with 16
Fourth, we use a bounded-delay model over iterations and
physical cores, 192GB DRAM, and connected by 10 Gb
use a “KKT” filter to suppress transmission of parts of the
Ethernet. 800 machines acted as workers, and 200 were
generated gradient update that are small enough that their
parameter servers. The cluster was in concurrent use by
effect is likely to be negligible.6
other (unrelated) tasks during operation.
To the best of our knowledge, no open source system
can scale sparse logistic regression to the scale described
Algorithm: We used a state-of-the-art distributed re- in this paper.7 We compare the parameter server with two
gression algorithm (Algorithm 3, [31, 32]). It differs from special-purpose systems, named System A and B, devel-
the simpler variant described earlier in four ways: First,
6 A user-defined Karush-Kuhn-Tucker (KKT) filter [26]. Feature k is
only a block of parameters is updated in an iteration. Sec-
filtered if wk = 0 and |ĝk | ≤ ∆. Here ĝk is an estimate of the global
ond, the workers compute both gradients and the diagonal
gradient based on the worker’s local information and ∆ > 0 is a user-
part of the second derivative on this block. Third, the pa- defined parameter.
rameter servers themselves must perform complex com- 7 Graphlab provides only a multi-threaded, single machine imple-

Pn mentation, while Petuum, Mlbase and REEF do not support sparse lo-
5 Ω(w) = |wi | gistic regression. We confirmed this with the authors as per 4/2014.
i=1

11
oped by a large internet company. waiting time decreases when the allowed delay increases.
Notably, both Systems A and B consist of more than Workers are 50% idle when using the sequential consis-
10K lines of code. The parameter server only requires tency model (τ = 0), while the idle rate is reduced to
300 lines of code for the same functionality as System 1.7% when τ is set to be 16. However, the computing time
B.8 The parameter server successfully moves most of the increases nearly linearly with τ . Because the data incon-
system complexity from the algorithmic implementation sistency slows convergence, more iterations are needed to
into a reusable generalized component. achieve the same convergence criteria. As a result, τ = 8
is the best trade-off between algorithm convergence and
Results: We first compare these three systems by run- system performance.
ning them to reach the same objective value. A better
system achieves a lower objective in less time. Figure 9 5.2 Latent Dirichlet Allocation
shows the results: System B outperforms system A be-
cause it uses a better algorithm. The parameter server, in Problem and Data: To demonstrate the versatility of
turn, outperforms System B while using the same algo- our approach, we applied the same parameter server ar-
rithm. It does so because of the efficacy of reducing the chitecture to the problem of modeling user interests based
network traffic and the relaxed consistency model. upon which domains appear in the URLs they click on in
Figure 10 shows that the relaxed consistency model search results. We collected search log data containing 5
substantially increases worker node utilization. Workers billion unique user identifiers and evaluated the model for
can begin processing the next block without waiting for the 5 million most frequently clicked domains in the re-
the previous one to finish, hiding the delay otherwise im- sult set. We ran the algorithm using 800 workers and 200
posed by barrier synchronization. Workers in System A servers and 5000 workers and 1000 servers respectively.
are 32% idle, and in system B, they are 53% idle, while The machines had 10 physical cores, 128GB DRAM, and
waiting for the barrier in each block. The parameter server at least 10 Gb/s of network connectivity. We again shared
reduces this cost to under 2%. This is not entirely free: the cluster with production jobs running concurrently.
the parameter server uses slightly more CPU than System
B for two reasons. First, and less fundamentally, System Algorithm: We performed LDA using a combination
B optimizes its gradient calculations by careful data pre- of Stochastic Variational Methods [25], Collapsed Gibbs
processing. Second, asynchronous updates with the pa- sampling [20] and distributed gradient descent. Here, gra-
rameter server require more iterations to achieve the same dients are aggregated asynchronously as they arrive from
objective value. Due to the significantly reduced commu- workers, along the lines of [1].
nication cost, the parameter server halves the total time. We divided the parameters in the model into local
Next we evaluate the reduction of network traffic by and global parameters. The local parameters (i.e. auxil-
each system components. Figure 11 shows the results for iary metadata) are pertinent to a given user and they are
servers and workers. As can be seen, allowing the senders streamed the from disk whenever we access a given user.
and receivers to cache the keys can save near 50% traffic. The global parameters are shared among users and they
This is because both key (int64) and value (double) are represented as (key,value) pairs to be stored using the
are of the same size, and the key set is not changed during parameter server. User data is sharded over workers. Each
optimization. In addition, data compression is effective of them runs a set of computation threads to perform in-
for compressing the values for both servers (>20x) and ference over its assigned users. We synchronize asyn-
workers when applying the KKT filter (>6x). The reason chronously to send and receive local updates to the server
is twofold. First, the `1 regularizer encourages a sparse and receive new values of the global parameters.
model (w), so that most of values pulled from servers are To our knowledge, no other system (e.g., YahooLDA,
0. Second, the KKT filter forces a large portion of gra- Graphlab or Petuum) can handle this amount of data and
dients sending to servers to be 0. This can be seen more model complexity for LDA, using up to 10 billion (5
clearly in Figure 12, which shows that more than 93% million tokens and 2000 topics) shared parameters. The
unique features are filtered by the KKT filter. largest previously reported experiments [2] had under 100
Finally, we analyze the bounded delay consistency million users active at any time, less than 100,000 tokens
model. The time decomposition of workers to achieve and under 1000 topics (2% the data, 1% the parameters).
the same convergence criteria under different maximum
allowed delay (τ ) is shown in Figure 13. As expected, the
Results: To evaluate the quality of the inference algo-
8 System B was developed by an author of this paper. rithm we monitor how rapidly the training log-likelihood

12
100 100
relative network traffic (%) non−compressed 1.1x non−compressed

relative network traffic (%)


80
compressed 80
compressed

60 60
2x 2x 2x 1.9x 1.9x
2.5x
40 40

20 20
12.3x
40.8x 40.3x
0 0
baseline +caching keys +KKT filter baseline +caching keys +KKT filter

Figure 11: The savings of outgoing network traffic by different components. Left: per server. Right: per worker.

97.5 2
computing
97 waiting
1.5
time (hours)
96.5
filtered (%)

96 1

95.5
0.5
95

94.5 0
0 0.5 1 0 1 2 4 8 16
time (hours) maximal delays
Figure 12: Unique features (keys) filtered by the Figure 13: Time a worker spent to achieve the same
KKT filter as optimization proceeds. convergence criteria by different maximal delays.

(measuring goodness of fit) converges. As can be seen 5.3 Sketches


in Figure 14, we observe an approximately 4x speedup
in convergence when increasing the number of machines Problem and Data: We include sketches as part of our
from 1000 to 6000. The stragglers observed in Figure 14 evaluation as a test of generality, because they operate
(leftmost) also illustrate the importance of having an ar- very differently from machine learning algorithms. They
chitecture that can cope with performance variation across typically observe a large number of writes of events com-
workers. ing from a streaming data source [11, 5].
Topic name # Top urls
We evaluate the time required to insert a streaming log
Programming
of pageviews into an approximate structure that can effi-
stackoverflow.com w3schools.com cplusplus.com github.com tutorials-
point.com jquery.com codeproject.com oracle.com qt-project.org bytes.com
ciently track pageview counts for a large collection of web
android.com mysql.com
ultimate-guitar.com guitaretab.com 911tabs.com e-chords.com song-
Music
pages. We use the Wikipedia (and other Wiki projects)
sterr.com chordify.net musicnotes.com ukulele-tabs.com
babycenter.com whattoexpect.com babycentre.co.uk circleofmoms.com
Baby Related page view statistics as benchmark. Each entry is an unique
thebump.com parents.com momtastic.com parenting.com americanpreg-
nancy.org kidshealth.org
Strength Train- key of a webpage with the corresponding number of re-
bodybuilding.com muscleandfitness.com mensfitness.com menshealth.com
t-nation.com livestrong.com muscleandstrength.com myfitnesspal.com elit-
ing quests served in a hour. From 12/2007 to 1/2014, there
efitness.com crossfit.com steroid.com gnc.com askmen.com

are 300 billion entries for more than 100 million unique
Table 4: Example topics learned using LDA over the .5 keys. We run the parameter server with 90 virtual server
billion dataset. Each topic represents a user interest nodes on 15 machines of a research cluster [40] (each has

13
Figure 14: Left: Distribution over worker log-likelihoods as a function of time for 1000 machines and 5 billion users.
Some of the low values are due to stragglers synchronizing slowly initially. Middle: the same distribution, stratified
by the number of iterations. Right: convergence (time in 1000s) using 1000 and 6000 machines on 500M users.

Algorithm 4 CountMin Sketch Peak inserts per second 1.3 billion


Init: M [i, j] = 0 for i ∈ {1, . . . n} and j ∈ {1, . . . k}. Average inserts per second 1.1 billion
Insert(x) Peak net bandwidth per machine 4.37 GBit/s
1: for i = 1 to k do
Time to recover a failed node 0.8 second
2: M [i, hash(i, x)] ← M [i, hash(i, x)] + 1
Table 5: Results of distributed CountMin
Query(x)
1: return min {M [i, hash(i, x)] for 1 ≤ i ≤ k}
age (key,value) size to around 50 bits. Importantly, when
we terminated a server node during the insertion, the pa-
64 cores and is connected by a 40Gb Ethernet). rameter server was able to recover the failed node within
1 second, making our system well equipped for realtime.
Algorithm: Sketching algorithms efficiently store sum-
maries of huge volumes of data so that approximate
queries can be quickly answered. These algorithms are 6 Summary and Discussion
particularly important in streaming applications where
data and queries arrive in real-time. Some of the highest- We described a parameter server framework to solve dis-
volume applications involve examples such as Cloud- tributed machine learning problems. This framework is
flare’s DDoS-prevention service, which must analyze easy to use: Globally shared parameters can be used as
page requests across its entire content delivery service ar- local sparse vectors or matrices to perform linear algebra
chitecture to identify likely DDoS targets and attackers. operations with local training data. It is efficient: All com-
The volume of data logged in such applications consid- munication is asynchronous. Flexible consistentcy mod-
erably exceeds the capacity of a single machine. While els are supported to balance the trade-off between system
a conventional approach might be to shard a workload efficiency and fast algorithm convergence rate. Further-
across a key-value cluster such as Redis, these systems more, it provides elastic scalability and fault tolerance,
typically do not allow the user-defined aggregation se- aiming for stable long term deployment. Finally, we show
mantics needed to implement approximate aggregation. experiments for several challenging tasks on real datasets
Algorithm 4 gives a brief overview of the CountMin with billions of variables to demonstrate its efficiency. We
sketch [11]. By design, the result of a query is an up- believe that this third generation parameter server is an
per bound on the number of observed keys x. Splitting important building block for scalable machine learning.
keys into ranges automatically allows us to parallelize the The codes are available at parameterserver.org.
sketch. Unlike the two previous applications, the workers
simply dispatch updates to the appropriate servers. Acknowledgments: This work was supported in part by
gifts and/or machine time from Google, Amazon, Baidu,
Results: The system achieves very high insert rates, PRObE, and Microsoft; by NSF award 1409802; and by
which are shown in Table 5. It performs well for two rea- the Intel Science and Technology Center for Cloud Com-
sons: First, bulk communication reduces the communica- puting. We are grateful to our reviewers and colleagues
tion cost. Second, message compression reduces the aver- for their comments on earlier versions of this paper.

14
References and W. Vogels. Dynamo: Amazon’s highly available key-
value store. In T. C. Bressoud and M. F. Kaashoek, editors,
[1] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and Symposium on Operating Systems Principles, pages 205–
A. J. Smola. Scalable inference in latent variable models. 220. ACM, 2007.
In Proceedings of The 5th ACM International Conference [16] J. J. Dongarra, J. Du Croz, S. Hammarling, and R. J. Han-
on Web Search and Data Mining (WSDM), 2012. son. An extended set of fortran basic linear algebra sub-
[2] A. Ahmed, Y. Low, M. Aly, V. Josifovski, and A. J. programs. ACM Transactions on Mathematical Software,
Smola. Scalable inference of dynamic user interests for 14:18–32, 1988.
behavioural targeting. In Knowledge Discovery and Data [17] The Apache Software Foundation. Apache hadoop
Mining, 2011. nextgen mapreduce (yarn). http://hadoop.
[3] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, apache.org/.
J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, [18] The Apache Software Foundation. Apache hadoop, 2009.
S. Ostrouchov, and D. Sorensen. LAPACK Users’ Guide. http://hadoop.apache.org/core/.
SIAM, Philadelphia, second edition, 1995.
[19] F. Girosi, M. Jones, and T. Poggio. Priors, stabilizers and
[4] Apache Foundation. Mahout project, 2012. http:// basis functions: From regularization to radial, tensor and
mahout.apache.org. additive splines. A.I. Memo 1430, Artificial Intelligence
[5] R. Berinde, G. Cormode, P. Indyk, and M.J. Strauss. Laboratory, Massachusetts Institute of Technology, 1993.
Space-optimal heavy hitters with strong error bounds. In [20] T.L. Griffiths and M. Steyvers. Finding scientific top-
J. Paredaens and J. Su, editors, Proceedings of the Twenty- ics. Proceedings of the National Academy of Sciences,
Eigth ACM SIGMOD-SIGACT-SIGART Symposium on 101:5228–5235, 2004.
Principles of Database Systems, PODS, pages 157–166. [21] S. H. Gunderson. Snappy: A fast compressor/decompres-
ACM, 2009. sor. https://code.google.com/p/snappy/.
[6] C. Bishop. Pattern Recognition and Machine Learning. [22] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of
Springer, 2006. Statistical Learning. Springer, New York, 2 edition, 2009.
[7] D. Blei, A. Ng, and M. Jordan. Latent Dirichlet alloca- [23] B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi,
tion. Journal of Machine Learning Research, 3:993–1022, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A
January 2003. platform for fine-grained resource sharing in the data cen-
ter. In Proceedings of the 8th USENIX conference on Net-
[8] J. Byers, J. Considine, and M. Mitzenmacher. Simple load
worked systems design and implementation, pages 22–22,
balancing for distributed hash tables. In Peer-to-peer sys-
2011.
tems II, pages 80–87. Springer, 2003.
[24] Q. Ho, J. Cipar, H. Cui, S. Lee, J. Kim, P. Gibbons, G. Gib-
[9] K. Canini. Sibyl: A system for large scale supervised ma-
son, G. Ganger, and E. Xing. More effective distributed ml
chine learning. Technical Talk, 2012.
via a stale synchronous parallel parameter server. In NIPS,
[10] B.-G. Chun, T. Condie, C. Curino, C. Douglas, S. Matu- 2013.
sevych, B. Myers, S. Narayanamurthy, R. Ramakrishnan, [25] M. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochas-
S. Rao, J. Rosen, R. Sears, and M. Weimer. Reef: Retain- tic variational inference. In International Conference on
able evaluator execution framework. Proceedings of the Machine Learning, 2012.
VLDB Endowment, 6(12):1370–1373, 2013.
[26] W. Karush. Minima of functions of several variables with
[11] G. Cormode and S. Muthukrishnan. Summarizing and inequalities as side constraints. Master’s thesis, Dept. of
mining skewed data streams. In SDM, 2005. Mathematics, Univ. of Chicago, 1939.
[12] W. Dai, J. Wei, X. Zheng, J. K. Kim, S. Lee, J. Yin, [27] L. Kim. How many ads does Google serve in a day?, 2012.
Q. Ho, and E. P. Xing. Petuum: A framework http://goo.gl/oIidXO.
for iterative-convergent distributed ml. arXiv preprint [28] D. Koller and N. Friedman. Probabilistic Graphical Mod-
arXiv:1312.7651, 2013. els: Principles and Techniques. MIT Press, 2009.
[13] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, [29] T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith,
M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, and M. J. Franklin, and M. I. Jordan. Mlbase: A distributed
A. Ng. Large scale distributed deep networks. In Neural machine-learning system. In CIDR, 2013.
Information Processing Systems, 2012.
[30] L. Lamport. Paxos made simple. ACM Sigact News,
[14] J. Dean and S. Ghemawat. MapReduce: simplified data 32(4):18–25, 2001.
processing on large clusters. CACM, 51(1):107–113, 2008. [31] M. Li, D. G. Andersen, and A. J. Smola. Distributed de-
[15] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, layed proximal gradient methods. In NIPS Workshop on
A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, Optimization for Machine Learning, 2013.

15
[32] M. Li, D. G. Andersen, and A. J. Smola. Communication [46] C.H. Teo, Q. Le, A. J. Smola, and S. V. N. Vishwanathan.
Efficient Distributed Machine Learning with the Parameter A scalable modular convex solver for regularized risk min-
Server. In Neural Information Processing Systems, 2014. imization. In Proc. ACM Conf. Knowledge Discovery and
Data Mining (KDD). ACM, 2007.
[33] M. Li, L. Zhou, Z. Yang, A. Li, F. Xia, D.G. Andersen,
and A. J. Smola. Parameter server for distributed machine [47] R. van Renesse and F. B. Schneider. Chain replication for
learning. In Big Learning NIPS Workshop, 2013. supporting high throughput and availability. In OSDI, vol-
ume 4, pages 91–104, 2004.
[34] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin,
and J. M. Hellerstein. Distributed Graphlab: A frame- [48] V. Vapnik. The Nature of Statistical Learning Theory.
work for machine learning and data mining in the cloud. Springer, New York, 1995.
In PVLDB, 2012. [49] R.C. Whaley, A. Petitet, and J.J. Dongarra. Automated
[35] H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, empirical optimization of software and the ATLAS project.
J. Grady, L. Nie, T. Phillips, E. Davydov, and D. Golovin. Parallel Computing, 27(1–2):3–35, 2001.
Ad click prediction: a view from the trenches. In KDD, [50] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. M. Ma,
2013. M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica.
Fast and interactive analytics over Hadoop data with Spark.
[36] K. P. Murphy. Machine learning: a probabilistic perspec-
USENIX ;login:, 37(4):45–51, August 2012.
tive. MIT Press, 2012.
[37] D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham,
and M. Abadi. Naiad: a timely dataflow system. In Pro-
ceedings of the Twenty-Fourth ACM Symposium on Oper-
ating Systems Principles, pages 439–455. ACM, 2013.
[38] A. Phanishayee, D. G. Andersen, H. Pucha, A. Povzner,
and W. Belluomini. Flex-KV: Enabling high-performance
and flexible KV systems. In Proceedings of the 2012 work-
shop on Management of big data systems, pages 19–24.
ACM, 2012.
[39] R. Power and J. Li. Piccolo: Building fast, distributed pro-
grams with partitioned tables. In R. H. Arpaci-Dusseau and
B. Chen, editors, Operating Systems Design and Imple-
mentation, OSDI, pages 293–306. USENIX Association,
2010.
[40] PRObE Project. Parallel Reconfigurable Observational En-
vironment. https://www.nmc-probe.org/wiki/
Machines:Susitna,
[41] A. Rowstron and P. Druschel. Pastry: Scalable, decen-
tralized object location and routing for large-scale peer-to-
peer systems. In IFIP/ACM International Conference on
Distributed Systems Platforms (Middleware), pages 329–
350, Heidelberg, Germany, November 2001.
[42] B. Schölkopf and A. J. Smola. Learning with Kernels. MIT
Press, Cambridge, MA, 2002.
[43] A. J. Smola and S. Narayanamurthy. An architecture for
parallel topic models. In Very Large Databases (VLDB),
2010.
[44] E. Sparks, A. Talwalkar, V. Smith, J. Kottalam, X. Pan,
J. Gonzalez, M. J. Franklin, M. I. Jordan, and T. Kraska.
Mli: An api for distributed machine learning. 2013.
[45] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and
H. Balakrishnan. Chord: A scalable peer-to-peer lookup
service for internet applications. ACM SIGCOMM Com-
puter Communication Review, 31(4):149–160, 2001.

16

You might also like