A New Platform For Distributed
A New Platform For Distributed
2, APRIL-JUNE 2015 49
Abstract—What is a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using
Big Models (up to 100 s of billions of parameters) on Big Data (up to terabytes or petabytes)? Modern parallelization strategies employ
fine-grained operations and scheduling beyond the classic bulk-synchronous processing paradigm popularized by MapReduce, or
even specialized graph-based execution that relies on graph representations of ML programs. The variety of approaches tends to pull
systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of
ML programs at scale. We propose a general-purpose framework, Petuum, that systematically addresses data- and model-parallel
challenges in large-scale ML, by observing that many ML programs are fundamentally optimization-centric and admit error-tolerant,
iterative-convergent algorithmic solutions. This presents unique opportunities for an integrative system design, such as bounded-error
network synchronization and dynamic scheduling based on ML program structure. We demonstrate the efficacy of these system
designs versus well-known implementations of modern ML algorithms, showing that Petuum allows ML programs to run in much less
time and at considerably larger model sizes, even on modestly-sized compute clusters.
Index Terms—Machine learning, big data, big model, distributed systems, theory, data-parallelism, model-parallelism
1 INTRODUCTION
Hadoop—whilst preserving the easy-to-use MapReduce considered such a wide spectrum of ML algorithms, which
programming interface. However, Spark ML implementa- exhibit diverse representation abstractions, model and data
tions are often still slower than specially-designed ML access patterns, and synchronization and scheduling
implementations, in part because Spark does not offer requirements. So what are the shared properties across such
flexible and fine-grained scheduling of computation and a “zoo of ML algorithms”? We believe that the key lies in
communication, which has been shown to be hugely advan- the recognition of a clear dichotomy between data (which is
tageous, if not outright necessary, for fast and correct execu- conditionally independent and persistent throughout the
tion of advanced ML algorithms [14]. Graph-centric algorithm) and model (which is internally coupled, and is
platforms such as GraphLab [13] and Pregel [15] efficiently transient before converging to an optimum). This inspires a
partition graph-based models with built-in scheduling and simple yet statistically-rooted bimodal approach to parallel-
consistency mechanisms, but due to limited theoretical ism: data parallel and model parallel distribution and execu-
work, it is unclear whether asynchronous graph-based con- tion of a big ML program over a cluster of machines. This
sistency models and scheduling will always yield correct data parallel, model parallel approach keenly exploits the
execution of ML programs. Other systems provide low-level unique statistical nature of ML algorithms, particularly the
programming interfaces [16], [17], that, while powerful and following three properties: (1) Error tolerance—iterative-
versatile, do not yet offer higher-level general-purpose convergent algorithms are often robust against limited
building blocks such as scheduling, model partitioning errors in intermediate calculations; (2) Dynamic structural
strategies, and managed communication that are key to sim- dependency—during execution, the changing correlation
plifying the adoption of a wide range of ML methods. In strengths between model parameters are critical to efficient
summary, existing systems supporting distributed ML each parallelization; (3) Non-uniform convergence—the number
manifest a unique tradeoff on efficiency, correctness, of steps required for a parameter to converge can be highly
programmability, and generality. skewed across parameters. The core goal of Petuum is to
In this paper, we explore the problem of building a distrib- execute these iterative updates in a manner that quickly
uted machine learning framework with a new angle toward converges to an optimum of the ML program’s objective
the efficiency, correctness, programmability, and generality function, by exploiting these three statistical properties of
tradeoff. We observe that, a hallmark of most (if not all) ML ML, which we argue are fundamental to efficient large-scale
programs is that they are defined by an explicit objective func- ML in cluster environments.
tion over data (e.g., likelihood, error-loss, graph cut), and the This design principle contrasts that of several existing
goal is to attain optimality of this function, in the space programmable frameworks discussed earlier. For example,
defined by the model parameters and other intermediate vari- central to the Spark framework [12] is the principle of per-
ables. Moreover, these algorithms all bear a common style, in fect fault tolerance and recovery, supported by a persistent
that they resort to an iterative-convergent procedure (see memory architecture (Resilient Distributed Datasets);
Eq. (1)). It is noteworthy that iterative-convergent computing whereas central to the GraphLab framework is the principle
tasks are vastly different from conventional programmatic of local and global consistency, supported by a vertex pro-
computing tasks (such as database queries and keyword gramming model (the Gather-Apply-Scatter abstraction).
extraction), which reach correct solutions only if every deter- While these design principles reflect important aspects of
ministic operation is correctly executed, and strong consis- correct ML algorithm execution—e.g., atomic recoverability
tency is guaranteed on the intermediate program state—thus, of each computing step (Spark), or consistency satisfaction
operational objectives such as fault tolerance and strong con- for all subsets of model variables (GraphLab)—some other
sistency are absolutely necessary. However, an ML program’s important aspects, such as the three statistical properties
true goal is fast, efficient convergence to an optimal solution, discussed above, or perhaps ones that could be more funda-
and we argue that fine-grained fault tolerance and strong con- mental and general, and which could open more room for
sistency are but one vehicle to achieve this goal, and might not efficient system designs, remain unexplored.
even be the most efficient one. To exploit these properties, Petuum introduces three
We present a new distributed ML framework, Petuum, novel system objectives grounded in the aforementioned
built on an ML-centric optimization-theoretic principle, key properties of ML programs, in order to accelerate their
as opposed to various operational objectives explored ear- convergence at scale: (1) Petuum synchronizes the parame-
lier. We begin by formalizing ML algorithms as iterative- ter states with bounded staleness guarantees, thereby
convergent programs, which encompass a large space of achieves provably correct outcomes due to the error-toler-
modern ML, such as stochastic gradient descent (SGD)[18] ant nature of ML, but at a much cheaper communication
and coordinate descent (CD) [10] for convex optimization cost than conventional per-iteration bulk synchronization;
problems, proximal methods [19] for more complex con- (2) Petuum offers dynamic scheduling policies that take
strained optimization, as well as MCMC [20] and varia- into account the changing structural dependencies between
tional inference [7] for inference in probabilistic models. To model parameters, so as to minimize parallelization error
our knowledge, no existing programmable2 platform has and synchronization costs; and (3) Since parameters in ML
programs exhibit non-uniform convergence costs (i.e., dif-
ferent numbers of updates required), Petuum prioritizes
2. Our discussion is focused on platforms which provide libraries computation towards non-converged model parameters, so
and tools for writing new ML algorithms. Because programmability is as to achieve faster convergence.
an important criteria for writing new ML algorithms, we exclude ML
software that does not allow new algorithms to be implemented on top To demonstrate this approach, we show how data-paral-
of them, such as AzureML and Mahout. lel and model-parallel algorithms can be implemented on
XING ET AL.: PETUUM: A NEW PLATFORM FOR DISTRIBUTED MACHINE LEARNING ON BIG DATA 51
Fig. 2. The difference between data and model parallelism: data samples
are always conditionally independent given the model, but some model
parameters are not independent of each other.
Fig. 1. The scale of Big ML efforts in recent literature. A key goal of Pet-
uum is to enable larger ML models to be run on fewer resources, even
relative to highly-specialized implementations.
intermediate results to be aggregated with the current esti-
mate of A by F ðÞ to produce the new estimate of A. For sim-
Petuum, allowing them to scale to large data/model sizes plicity, in the rest of the paper we omit L in the subscript
with improved algorithm convergence times. The experi- with the understanding that all ML programs of our interest
ments section provides detailed benchmarks on a range of here bear an explicit loss function that can be used to moni-
ML programs: topic modeling, matrix factorization, deep tor the quality of convergence to a solution, as opposed to
learning, Lasso regression, and distance metric learning heuristics or procedures not associated such a loss function.
(DML). These algorithms are only a subset of the full open- Also for simplicity, we focus on iterative-convergent equa-
source Petuum ML library3—the PMLlib, which we will tions with an additive form:
briefly discuss in this paper. As illustrated in Fig. 1, Petuum
PMLlib covers a rich collection of advanced ML methods AðtÞ ¼ Aðt1Þ þ DðAðt1Þ ; DÞ; (2)
not usually seen in existing ML libraries; the Petuum plat-
form enables PMLlib to solve a range of ML problems at i.e., the aggregation F ðÞ is replaced with a simple addition.
large scales—scales that have only been previously The approaches we propose can also be applied to this gen-
attempted in a case-specific manner with corporate-scale eral F ðÞ.
efforts and resources—but using relatively modest clusters In large-scale ML, both data D and model A can be very
(10-100 machines) that are within reach of most ML large. Data-parallelism, in which data is divided across
practitioners. machines, is a common strategy for solving Big Data prob-
lems, while model-parallelism, which divides the ML model,
2 PRELIMINARIES: ON DATA PARALLELISM AND is common for Big Models. Both strategies are not exclusive,
MODEL PARALLELISM and can be combined to tackle challenging problems with
large data D and large model A. Hence, every Petuum ML
We begin with a principled formulation of iterative- program is either data-parallel, model-parallel, or data-and-
convergent ML programs, which exposes a dichotomy of model-parallel, depending on problem needs. Below, we
data and model, that inspires the parallel system architec- discuss the (different) mathematical implications of each
ture (Section 3), algorithm design (Section 4), and theoretical parallelism (see Fig. 2).
analysis (Section 6) of Petuum. Consider the following pro-
grammatic view of ML as iterative-convergent programs, 2.1 Data Parallelism
driven by an objective function. In data-parallel ML, the data D is partitioned and assigned to
Iterative-convergent ML algorithm: Given data D and a computational workers (indexed by p ¼ 1::P ); we denote
model objective function L (e.g., mean-squared loss, likeli- the pth data partition by Dp . The function DðÞ can be applied
hood, margin), a typical ML problem can be grounded as to each data partition independently, and the results com-
executing the following update equation iteratively, until bined additively, yielding a data-parallel equation (left
the model state (i.e., parameters and/or latent variables) A panel of Fig. 2):
reaches some stopping criteria:
PP
AðtÞ ¼ Aðt1Þ þ p¼1 DðAðt1Þ ; Dp Þ: (3)
ðtÞ ðt1Þ ðt1Þ
A ¼ F ðA ; DL ðA ; DÞÞ; (1)
This form is commonly seen in stochastic gradient descent or
where superscript ðtÞ denotes the iteration counter. The sampling-based algorithms. For example, in distance metric
update function DL ðÞ (which improves the loss L) performs learning optimized via stochastic gradient descent, the data
computation on data D and model state A, and outputs pairs are partitioned over different workers, and the interme-
diate results (subgradients) are computed on each partition,
3. Petuum is available as open source at http://petuum.org. before being summed and applied to the model parameters.
52 IEEE TRANSACTIONS ON BIG DATA, VOL. 1, NO. 2, APRIL-JUNE 2015
in the Big Data setting, and we are not aware of any distrib-
uted implementations of DML.
The most popular version of DML tries to learn a Maha-
lanobis distance matrix M (symmetric and positive-semide-
finite), which can then be used to measure the distance
between two samples Dðx; yÞ ¼ ðx yÞT Mðx yÞ. Given a
jSj
set of “similar” sample pairs S ¼ fðxi ; yi Þgi¼1 , and a set of
jDj
“dissimilar” pairs D ¼ fðxi ; yi Þgi¼1 , DML learns the Mahala-
nobis distance by optimizing
X
minM ðx yÞT Mðx yÞ Fig. 8. Petuum DML data-parallel pseudocode.
ðx;yÞ2S
(5)
s:t: ðx yÞT Mðx yÞ 1; 8ðx; yÞ 2 D ensures high-throughput execution, via a bounded-asyn-
M 0; chronous consistency model (Stale Synchronous Parallel
(SSP)) that can provide workers with stale local copies of
where M 0 denotes that M is required to be positive semi- the parameters L, instead of forcing workers to wait for
definite. This optimization problem minimizes the Mahala- network communication. Later, we will review the strong
nobis distances between all pairs labeled as similar, while consistency and convergence guarantees provided by the
separating dissimilar pairs with a margin of 1. SSP model.
In its original form, this optimization problem is diffi- Since DML is a data-parallel algorithm, only the paral-
cult to parallelize due to the constraint set. To create a lel update push() needs to be implemented (Fig. 8). The
data-parallel optimization algorithm and implement it on scheduling function schedule() is empty (because
Petuum, we shall relax the constraints via slack variables every worker touches every model parameter L), and
(similar to SVMs). First, we replace M with LT L, and we do not need aggregation push() for this SGD algo-
introduce slack variables to relax the hard constraint in rithm. In our next example, we will show how sched-
Eq.(5), yielding ule() and push() can be used to implement model-
X X parallel execution.
minL kLðx yÞk2 þ x;y
ðx;yÞ2S ðx;yÞ2D (6) 4.2 Model-Parallel Lasso
2
s:t: kLðx yÞk 1 x;y ; x;y 0; 8ðx; yÞ 2 D: Lasso is a widely used model to select features in high-
dimensional problems, such as gene-disease association
Using hinge loss, the constraint in Eq.(6) can be eliminated, studies, or in online advertising via ‘1 -penalized regres-
yielding an unconstrained optimization problem: sion [27]. Lasso takes the form of an optimization problem:
X P
minL kLðx yÞk2 minb ‘ðX; y; b Þ þ j jbj j; (9)
ðx;yÞ2S
X (7)
þ maxð0; 1 kLðx yÞk2 Þ: where denotes a regularization parameter that deter-
ðx;yÞ2D mines the sparsity of b , and ‘ðÞ is a non-negative convex
loss function such as squared-loss or logistic-loss; we
Unlike the original constrained DML problem, this relaxa- assume that X and y are standardized and consider (9)
tion is fully data-parallel, because it now treats the dis- without an intercept. For simplicity but without loss of
similar pairs as iid data to the loss function (just like the bk22 ; other loss func-
generality, we let ‘ðX; y; b Þ ¼ 12 ky Xb
similar pairs); hence, it can be solved via data-parallel tions (e.g., logistic) are straightforward and can be solved
Stochastic Gradient Descent. SGD can be naturally paral- using the same approach [10]. We shall solve this via a
lelized over data, and we partition the data pairs onto P coordinate descent model-parallel approach, similar but
machines. Every iteration, each machine p randomly sam- not identical to [10], [22].
ples a minibatch of similar pairs S p and dissimilar pairs The simplest parallel CD Lasso , shotgun [10], selects a
Dp from its data shard, and computes the following random subset of parameters to be updated in parallel. We
update to L: now present a scheduled model-parallel Lasso that
X improves upon shotgun: the Petuum scheduler chooses
4Lp ¼ 2Lðx yÞðx yÞT parameters that are nearly independent with each other,4
ðx;yÞ2S p
X (8) thus guaranteeing convergence of the Lasso objective. In
2Lða bÞða bÞT IðkLða bÞk2 1Þ; addition, it prioritizes these parameters based on their dis-
ða;bÞ2Dp tance to convergence, thus speeding up optimization.
Why is it important to choose independent parameters
where IðÞ is the indicator function. via scheduling? Parameter dependencies affect the CD
Fig. 8 shows pseudocode for Petuum DML, which is
simple to implement because the parameter server system
4. In the context of Lasso, this means the data columns xj corre-
PS abstracts away complex networking code under a sim- sponding to the chosen parameters j have very small pair-wise dot
ple get()/read() API. Moreover, the PS automatically product, below a threshold t.
56 IEEE TRANSACTIONS ON BIG DATA, VOL. 1, NO. 2, APRIL-JUNE 2015
bþVwij ;k
P ðzij ¼ k j U; V Þ / PK
aþUik
þ PM ; (12)
Fig. 9. Petuum Lasso model-parallel pseudocode. Kaþ Ui‘ Mbþ V
‘¼1 m¼1 mk
update equation in the following manner: by taking the gra- where zij are topic assignments to each word “token” wij in
dient of (9), we obtain the CD update for bj : document i. The document word tokens wij , topic assign-
ðtÞ P ðt1Þ
ments zij and doc-topic table rows Ui are partitioned across
dij xij yi k6¼j xij xik bk ; (10) worker machines and kept fixed, as is common practice
P with Big Data. Although there are multiple parameters, the
ðtÞ N ðtÞ
bj S d
i¼1 ij ; ; (11) only one that is read and updated by all parallel worker
(and hence needs to be globally-stored) is the word-topic
where Sð; Þ is a soft-thresholding operator, defined by table V .
Sðbj ; Þ signðbÞðjbj Þ. In (11), if xTj xk 6¼ 0 (i.e., nonzero We adopt a model-parallel approach to LDA, and use a
ðt1Þ ðt1Þ
correlation) and bj 6¼ 0 and bk 6¼ 0, then a coupling schedule() (Algorithm 10) that cycles rows of the word-
effect is created between the two features bj and bk . Hence, topic table (rows correspond to different words, e.g.,
“machine” or “learning”) across machines, to be updated
they are no longer conditionally independent given the
via push() and pull(); data is kept fixed at each machine.
data: bj 6? bk jX; y. If the jth and the kth coefficients are
This schedule() ensures that no two workers will ever
updated concurrently, parallelization error may occur,
touch the same rows of V in the same iteration,5 unlike pre-
causing the Lasso problem to converge slowly (or even
vious LDA implementations [28] which allow workers to
diverge outright).
update any parameter, resulting in dependency violations.
Petuum’s schedule(), push() and pull() interface
Note that the function LDAGibbsSample() in push()
is readily suited to implementing scheduled model-parallel
can be replaced with any recent state-of-the art Gibbs
Lasso. We use schedule() to choose parameters with low
dependency, and to prioritize non-converged parameters.
Petuum pipelines schedule() and push(); thus sched- 5. Petuum LDA’s “cyclic” schedule differs from the model streaming
in [3]; the latter has workers touch the same set of parameters, one set at
ule() does not slow down workers running push(). Fur- a time. Model streaming can easily be implemented in Petuum, by
thermore, by separating the scheduling code schedule() changing schedule() to output the same word range for every jp .
XING ET AL.: PETUUM: A NEW PLATFORM FOR DISTRIBUTED MACHINE LEARNING ON BIG DATA 57
TABLE 1
Petuum ML Library (PMLlib): ML Applications and Achievable Problem Scale for a Given Cluster Size
Petuum’s goal is to solve large model and data problems using medium-sized clusters with only 10 s of machines (100-1,000 cores, 1 TB+ memory). Running time
varies between 10 s of minutes to several days, depending on the application.
4.5.2 Logstic Regression (LR) and Support Vector all Petuum PMLlib applications (and new user-written
Machines (SVM) applications) can be run in stand-alone mode or deployed
Petuum implements LR and SVM using the same depen- as YARN jobs to be scheduled alongside other MapReduce
dency-checking, prioritized model-parallel strategy as the jobs, and PMLlib applications can also read/write input
Lasso Algorithm 9. Dependency checking and prioritization data and output results from both the local machine filesys-
are not easily implemented on MapReduce and Spark. tem as well as HDFS. More specifically, Petuum provides a
While GraphLab has support for these features; the key dif- YARN launcher that will deploy any Petuum application
ference with Petuum is that Petuum’s scheduler performs (including user-written ones) onto a Hadoop cluster; the
dependency checking on small subsets of parameters at a YARN launcher will automatically restart failed Petuum
time, whereas GraphLab performs graph partitioning on all jobs and ensure that they always complete. Petuum also
parameters at once (which can be costly). provides a data access library with C++ iostreams (or Java
file streams) for HDFS access, which allows users to write
4.5.3 Maximum Entropy Discrimination LDA (MedLDA) generic file stream code that works on both HDFS files the
local filesystem. The data access library also provides pre-
MedLDA [32] is a constrained variant of LDA, that uses side implemented routines to load common data formats, such
information to constrain and improve the recovered topics. as CSV, libSVM, and sparse matrix.
Petuum implements MedLDA using a data-parallel strategy While Petuum PMLlib applications are written in C++
schedule_fix(), where each worker uses push() to alter- for maximum performance, new Petuum applications can
nate between Gibbs sampling (like regular LDA) and solv- be written in either Java or C++; Java has the advantages of
ing for Lagrange multiplers associated with the constraints. easier deployment and a wider user base.
5 BIG DATA ECOSYSTEM SUPPORT 6 PRINCIPLES AND THEORY
To support ML at scale in production, academic, or cloud- Our iterative-convergent formulation of ML programs, and
compute clusters, Petuum provides a ready-to-run ML the explicit notion of data and model parallelism, make it
library, called PMLlib; Table 1 shows the current list of ML convenient to explore three key properties of ML pro-
applications, with more applications are actively being grams—error-tolerant convergence, non-uniform conver-
developed for future releases. Petuum also integrates with gence, dependency structures (Fig. 12)—and to analyze
Hadoop ecosystem modules, thus reducing the engineering how Petuum exploits these properties in a theoretically-
effort required to deploy Petuum in academic and real-
world settings.
Many industrial and academic clusters run Hadoop,
which provides, in addition to the MapReduce program-
ming interface, a job scheduler that allows multiple pro-
grams to run on the same cluster (YARN) and a distributed
filesystem for storing Big Data (HDFS). However, programs
that are written for stand-alone clusters are not compatible
with YARN/HDFS, and vice versa, applications written for
YARN/HDFS are not compatible with stand alone clusters.
Fig. 12. Key properties of ML algorithms: (a) Non-uniform convergence;
Petuum solves this issue by providing common libraries (b) Error-tolerant convergence; (c) Dependency structures amongst
that work on either Hadoop or non-Hadoop clusters. Hence, variables.
XING ET AL.: PETUUM: A NEW PLATFORM FOR DISTRIBUTED MACHINE LEARNING ON BIG DATA 59
sound manner to speed up ML program completion at Big (2) keeping staleness (i.e., asynchrony) as low as possible
Learning scales. improves per-iteration convergence—specifically, the
Some of these properties have previously been success- bound becomes tighter with lower maximum staleness s,
fully exploited by a number of bespoke, large-scale imple- and lower average mg and variance s g of the staleness expe-
mentations of popular ML algorithms: e.g., topic models [3], rienced by the workers. Conversely, naive asynchronous
[17], matrix factorization [33], [34], and deep learning [1]. It systems (e.g., Hogwild! [35] and YahooLDA [28]) may expe-
is notable that MapReduce-style systems (such as rience poor convergence, particularly in production envi-
Hadoop [11] and Spark [12]) often do not fare competitively ronments where machines may slow down due to other
against these custom-built ML implementations, and one of tasks or users. Such slowdown can cause the maximum
the reasons is that these key ML properties are difficult to staleness s and staleness variance s g to become arbitrarily
exploit under a MapReduce-like abstraction. Other abstrac- large, leading to poor convergence rates. In addition to the
tions may offer a limited degree of opportunity—for exam- above theorem (which bounds the distribution of x), Dai
ple, vertex programming [13] permits graph dependencies et al. also showed that the variance of x can be bounded,
to influence model-parallel execution. ensuring reliability and stability near an optimum [14].
This theorem has two implications: (1) learning under the In the regression setting, fðwÞ represents a least-squares
SSP model is correct (like Bulk Synchronous Parallel (BSP) loss, rðwÞ represents a separable regularizer (e.g., ‘1 pen-
learning), because R½X
T —which is the difference between the alty), and xi represents the ith feature column of the design
SSP parameter estimate and the true optimum—converges (data) matrix, each element in xi is a separate data sample.
to OðT 1=2 Þ in probability with an exponential tail-bound; In particular, jx>
i xj j is the correlation between the ith and
60 IEEE TRANSACTIONS ON BIG DATA, VOL. 1, NO. 2, APRIL-JUNE 2015
jth feature columns. The parameters w are simply the ðtÞ ðtÞ 2dPm
E½jwideal wRRP j
L2 X> XC; (20)
regression coefficients. ðt þ 1Þ2 P^
In the context of the model-parallel equation (4), we can for constants C; m; L; P^. The proof for this theorem can be
map the model A ¼ w, the data D ¼ X, and the update found in the Appendix.
equation DðA; Sp ðAÞÞ to
This theorem says that the difference between the SRRP ðÞ
b 1 parameter estimate wRRP and the ideal oracle estimate
wþ arg min ½z ðwjp gjp Þ
2 þ rðzÞ; (18)
jp z2R 2 b wideal rapidly vanishes, at a fast 1=ðt þ 1Þ2 rate. In other
words, one cannot do much better than SRRP ðÞ schedul-
where SpðtÞ ðAÞ has selected a single coordinate jp to be ing—it is near-optimal.
updated by worker p—thus, P coordinates are updated in We close this section by noting that SRRP ðÞ is different
every iteration. The aggregation function F ðÞ simply allows from Scherrer et al. [22], who pre-cluster all M features
each update wjp to pass through without change. before starting coordinate descent, in order to find “blocks”
The effectiveness of parallel coordinate descent depends of nearly-independent parameters. In the Big Data and
on how the schedule SpðtÞ ðÞ selects the coordinates jp . In par- especially Big Model setting, feature clustering can be pro-
ticular, naive random selection can lead to poor conver- hibitive—fundamentally, it requires OðM 2 Þ evaluations of
gence rate or even divergence, with error proportional to jx> 2
i xj j for all M feature combinations ði; jÞ, and although
the correlation jx> ja xjb j between the randomly-selected coor- greedy clustering algorithms can mitigate this to some
dinates ja ; jb [10], [22]. An effective and cheaply-computable extent, feature clustering is still impractical when M is very
ðtÞ
schedule SRRP;p ðÞ involves randomly proposing a small set large, as seen in some regression problems [27]. The pro-
of Q > P features fj1 ; . . . ; jQ g, and then finding P features posed SRRP ðÞ only needs to evaluate a small number of
in this set such that jx> jx>
i xj j every iteration. Furthermore, the random selection in
ja xjb j u for some threshold u, where
ja ; jb are any two features in the set of P . This requires at SRRP ðÞ can be replaced with prioritization to exploit non-uni-
form convergence in ML problems, as explained next.
most OðB2 Þ evaluations of jx> ja xjb j u (if we cannot find P
features that meet the criteria, we simply reduce the 6.3 Non-Uniform Convergence
degree of parallelism). We have the following convergence In model-parallel ML programs, it has been empirically
theorem: observed that some parameters Aj can converge in much
2
Theorem 2 (SRRP ðÞ convergence). Let :¼ dðE½P
=E½P d2
1
Þðr1Þ
fewer/more updates than other parameters [21]. For
ðE½P 1
Þðr1Þ instance, this happens in Lasso regression because the
d < 1, where r is a constant that depends on the
model enforces sparsity, so most parameters remain at zero
input data x and the scheduler SRRP ðÞ. After t steps, we have throughout the algorithm, with low probability of becoming
Cdb 1 non-zero again. Prioritizing Lasso parameters according to
$
E½F ðwðtÞ Þ F ðw Þ
; (19) their magnitude greatly improves convergence per itera-
E½P ð1 Þ
t
tion, by avoiding frequent (and wasteful) updates to zero
$
where F ðwÞ :¼ fðwÞ þ rðwÞ and w is a minimizer of F . parameters [21].
E½P
is the average degree of parallelization over all itera- We call this non-uniform ML convergence, which can be
tions—we say “average” to account for situations where the exploited via a dynamic scheduling function SpðtÞ ðAðtÞ Þ
scheduler cannot select P nearly-independent parameters (due whose output changes according to the iteration t—for
to high correlation in the data). The proof for this theorem can instance, we can write a scheduler Smag ðÞ that proposes
be found in the Appendix. For most real-world data sets, this is parameters with probability proportional to their current
ðtÞ
not a problem, and E½P
is equal to the number of workers. magnitude ðAj Þ2 . This Smag ðÞ can be combined with the
earlier dependency structure checking, leading to a depen-
This theorem states that SRRP ðÞ-scheduling (which is dency-aware, prioritizing scheduler. Unlike the dependency
used by Petuum Lasso) achieves close to P -fold (linear) structure issue, prioritization has not received as much
improvement in per-iteration convergence (where P is the attention in the ML literature, though it has been used to
number of workers). This comes from the 1=E½P ð1 Þ
fac- speed up the PageRank algorithm, which is iterative-
tor on the RHS of Eq. (19); for input data x that is sparse and convergent [37].
high-dimensional, the SRRP ðÞ scheduler will cause r 1 to The prioritizing schedule Smag ðÞ can be analyzed in the
become close to zero, and therefore will also be close to context of the Lasso problem. First, we rewrite it by dupli-
zero—thus the per-iteration convergence rate is improved cating the original J features with opposite sign, as in [10]:
by nearly P -fold. We contrast this against a naive system P
F ðb bk22 þ 2J
bÞ :¼ minb 12 ky Xb j¼1 bj : Here, X contains 2J
that selects coordinates at random; such a system will have
features and bj 0, for all j ¼ 1; . . . ; 2J.
far larger r 1, thus degrading per-iteration convergence.
In addition to asymptotic convergence, we show that Theorem 4 (Adapted from [21] Optimality of Lasso prior-
SRRP ’s trajectory is close to ideal parallel execution: ity scheduler). Suppose B is the set of indices of coefficients
Theorem 3 (SRRP ðÞ is close to ideal execution). Let Sideal ðÞ be updated in parallel at the tth iteration, and r is sufficiently
ðtÞ ðtÞ
an oracle schedule that always proposes P random features small constant such that rdbj dbk 0, for all j 6¼ k 2 B.
ðtÞ ðtÞ
with zero correlation. Let wideal be its parameter trajectory, Then, the sampling distribution pðjÞ / ðdbj Þ2 approximately
ðtÞ
and let wRRP be the parameter trajectory of SRRP ðÞ. Then, bðtÞ Þ F ðb
maximizes a lower bound on EB ½F ðb bðtÞ þ db
bðtÞ Þ
.
XING ET AL.: PETUUM: A NEW PLATFORM FOR DISTRIBUTED MACHINE LEARNING ON BIG DATA 61
7 PERFORMANCE
Petuum’s ML-centric system design supports a variety of
ML programs, and improves their performance on Big Data
in the following senses: (1) Petuum ML implementations
achieve significantly faster convergence rate than well-opti-
mized single-machine baselines (i.e., DML implemented on
single machine, and Shotgun [10]); (2) Petuum ML imple-
mentations can run faster than other programmable plat-
forms (e.g., Spark, GraphLab7), because Petuum can exploit
model dependencies, uneven convergence and error toler-
ance; (3) Petuum ML implementations can reach larger
model sizes than other programmable platforms, because
Petuum stores ML program variables in a lightweight fash-
ion (on the parameter server and scheduler); (4) for ML
programs without distributed implementations, we can
implement them on Petuum and show good scaling with an
increasing number of machines. We emphasize that Petuum Fig. 13. Performance increase in ML applications due to the Petuum
is, for the moment, primarily about allowing ML practi- Parameter Server (PS) and Scheduler. The Eager Stale Synchro-
tioners to implement and experiment with new data/ nous Parallel consistency model (on the PS) improves the number
of iterations executed per second (throughput) while maintaining
model-parallel ML algorithms on small-to-medium clusters. per-iteration quality. Prioritized, dependency-aware scheduling
Our experiments are therefore focused on clusters with 10- allows the Scheduler to improve the quality of each iteration, while
100 machines, in accordance with our target users. maintaining iteration throughput. In both cases, overall real-time con-
vergence rate is improved—30 percent improvement for the PS
Matrix Factorization example, and several orders of magnitude for
7.1 Hardware Configuration the Scheduler Lasso example.
To demonstrate that Petuum is adaptable to different hard-
ware generations, our experiments used three clusters with
varying specifications: Cluster-1 has up to 128 machines using PLMlib’s Matrix Factorization with the schedule
with two AMD cores, 8 GB RAM, 1 Gbps Ethernet; Cluster-2 () function disabled (in order to remove the beneficial
has up to 16 machines with 64 AMD cores, 128 GB RAM, effect of scheduling, so we may focus on the PS). This
40 Gbps Infiniband; Cluster-3 has up to 64 machines with experiment was conducted using 64 Cluster-3 machines
16 Intel cores, 128 GB RAM, 10 Gbps Ethernet. on a 332 GB sparse matrix (7.7 m by 288 k entries, 26 b
nonzeros, created by duplicating the Netflix dataset 16
7.2 Parameter Server and Scheduler Performance times horizontally and 16 times vertically). We compare
Petuum’s Parameter Server (PS) and Scheduler speed up the performance of MF running under Petuum PS’s
existing ML algorithms by improving iteration throughput and Eager SSP mode (using staleness s ¼ 2; higher staleness
iteration quality respectively. We measure iteration throughput values did not produce additional benefit), versus run-
as “iterations executed per second”, and we quantify iteration ning under MapReduce-style Bulk Synchronous Parallel
quality by plotting the ML objective function L against itera- mode. Fig. 13 shows that ESSP provides a 30 percent
tion number—“objective progress per iteration”. In either improvement to iteration throughput (top left), without a
case, the goal is to improve the ML algorithm’s real-time con- significantly affecting iteration quality (top right). Conse-
vergence rate, quantified by plotting the objective function L quently, the MF application converges 30 percent faster
against real time (“objective progress per second”). in real time (middle left).
The iteration throughput improvement occurs because
7.2.1 Parameter Server ESSP allows both gradient computation and inter-worker
We consider how the PS improves iteration throughput communication to occur at the same time, whereas classic
(through the Eager SSP consistency model), evaluated BSP execution requires computation and communication to
alternate in a mutually exclusive manner. Because the maxi-
6. Without this approximation, pipelining is impossible because mum staleness s ¼ 2 is small, and because ESSP eagerly
ðtÞ ðtÞ
dbj is unavailable until all computations on bj have finished. pushes parameter updates as soon as they are available,
7. We omit Hadoop and Mahout, as it is already well-established there is almost no penalty to iteration quality despite allow-
that Spark and GraphLab significantly outperform it [12], [13]. ing staleness.
62 IEEE TRANSACTIONS ON BIG DATA, VOL. 1, NO. 2, APRIL-JUNE 2015
Fig. 15. Left: LDA convergence time: Petuum versus YahooLDA (lower
is better). Petuum’s data-and-model-parallel LDA converges faster than
YahooLDA’s data-parallel-only implementation, and scales to more LDA
Fig. 14. Left: Petuum relative speedup versus popular platforms (larger parameters (larger vocab size, number of topics). Right panels: Matrix
is better). Across ML programs, Petuum is at least 2-10 times faster Factorization convergence time: Petuum vs GraphLab vs Spark. Petuum
than popular implementations. Right: Petuum allows single-machine is fastest and the most memory-efficient, and is the only platform that
algorithms to be transformed into cluster versions, while still achieving could handle Big MF models with rank K 1;000 on the given hardware
near-linear speedup with increasing number of machines (Caffe CNN budget.
and DML).
7.2.2 Scheduler it performs twice as fast as Spark, while GraphLab ran out
We examine how the Scheduler improves iteration quality of memory (due to the need to construct an explicit graph
(through a well-designed schedule() function), evaluated representation, which consumes significant memory). On
using PMLlib’s Lasso application. This experiment was con- the other hand, Petuum LDA is nearly six times faster than
ducted using 16 Cluster-2 on a simulated 150 GB sparse data- YahooLDA; the speedup mostly comes from the Petuum
set (50 m features); adjacent features in the dataset are highly LDA schedule() (Fig. 10), which performs correct model-
correlated in order to simulate the effects of realistic feature parallel execution by only allowing each worker to operate
engineering. We compare the original PMLlib Lasso (whose on disjoint parts of the vocabulary. This is similar to Graph-
schedule() performs prioritization and dependency check- Lab’s implementation, but is far more memory-efficient
ing) to a simpler version whose schedule() selects parame- because Petuum does not need to construct a full graph
ters at random (the shotgun algorithm [10]). Fig. 13 shows representation of the problem.
that PMLlib Lasso’s schedule() slightly decreases iteration
7.3.2 Model Size
throughput (middle right), but greatly improves iteration
quality (bottom left), resulting in several orders of magnitude We show that Petuum supports larger ML models for the
improvement to real-time convergence (bottom right). same amount of cluster memory. Fig. 15 shows ML program
The iteration quality improvement is mostly due to prior- running time versus model size, given a fixed number of
itization; we note that without prioritization, 85 percent of machines—the left panel compares Petuum LDA and
the parameters would converge within five iterations, but YahooLDA; PetuumLDA converges faster and supports
the remaining 15 percent would take over 100 iterations. LDA models that are > 10 times larger,8 allowing long-tail
Moreover, prioritization alone is not enough to achieve fast topics to be captured. The right panels compare Petuum MF
convergence speed—when we repeated the experiment versus Spark and GraphLab; again Petuum is faster and
with a prioritization-only schedule() (not shown), the supports much larger MF models (higher rank) than either
parameters became unstable, which caused the objective baseline. Petuum’s model scalability comes from two fac-
function to diverged. This is because dependency checking tors: (1) model-parallelism, which divides the model across
is necessary to avoid correlation effects in Lasso (discussed machines; (2) a lightweight parameter server system with
in the proof to Theorem 2), which we observed were greatly minimal storage overhead. In contrast, Spark and GraphLab
amplified under the prioritization-only schedule(). have additional overheads that may not be necessary in an
ML context—Spark needs to construct a “lineage graph” in
order to preserve its strict fault recovery guarantees, while
7.3 Comparison to Programmable Platforms
GraphLab needs to represent the ML problem in graph
Fig. 14(left) compares Petuum to popular platforms for writ-
form. Because ML applications are error-tolerant, fault
ing new ML programs (Spark v1.2 and GraphLab), as well as
recovery can be performed with lower overhead through
a well-known cluster implementation of LDA (YahooLDA).
periodic checkpointing.
We compared Petuum to Spark, GraphLab and YahooLDA
on two applications: LDA and MF. We ran LDA on 128 Clus- 7.4 Fast Cluster Implementations of New ML
ter-1 machines, using 3.9 m English Wikipedia abstracts with Programs
unigram (V ¼ 2:5 m) and bigram (V ¼ 21:8 m) vocabularies; Petuum facilitates the development of new ML programs
the bigram vocabulary is an example of feature engineering without existing cluster implementations; here we present
to improve performance at the cost of additional computa- two case studies. The first is a cluster version of the open-
tion. The MF comparison was performed on 10 Cluster-2 source Caffe CNN toolkit, created by adding 600 lines of
machines using the original Netflix dataset. Petuum code. The basic data-parallel strategy in Caffe was
left unchanged, so the Petuum port directly tests Petuum’s
7.3.1 Speed efficiency. We tested on four Cluster-3 machines, using a
For MF and LDA, Petuum is between 2-6 times faster than 250k subset of Imagenet with 200 classes, and 1.3 m model
other platforms (Figs. 14, 15). For MF, Petuum uses the
same model-parallel approach as Spark and GraphLab, but 8. LDA model size is equal to vocab size times number of topics.
XING ET AL.: PETUUM: A NEW PLATFORM FOR DISTRIBUTED MACHINE LEARNING ON BIG DATA 63
APPENDIX
PROOF OF THEOREM 2
We prove that the Petuum SRRP ðÞ scheduler makes the Reg-
Fig. 16. Petuum DML objective versus time convergence curves, from
one to four machines. ularized Regression Problem converge. We note that SRRP ðÞ
has the following properties: (1) the scheduler uniformly ran-
parameters. Compared to the original single-machine domly selects Q out of d coordinates (where d is the number
Caffe (which does not have the overhead of network com- of features); (2) the scheduler performs dependency checking
munication), Petuum approaches linear speedup (3:1-times and retains P out of Q coordinates; (3) in parallel, each of
speedup on 4 machines, Fig. 14 (right plot)) due to the the P workers is assigned one coordinate, and performs
parameter server’s ESSP consistency for managing network coordinate descent on it:
communication.
2
Second, we compare the Petuum DML program against b 1
wþ arg min z wjp gjp þrðzÞ; (21)
the original DML algorithm proposed in [25] (denoted by
jp z2R 2 b
Xing2002), implemented using SGD on a single machine
(with parallelization over matrix operations). The intent is to where gj ¼ rj fðwÞ is the j-th partial derivative, and the
show that, even for ML algorithms that have received less coordinate jp is assigned to the p-th worker. Note that (21) is
research attention towards scalability (such as DML), one simply the gradient update: w w b1 g, followed by apply-
can still implement a reasonably simple data-parallel SGD ing the proximity operator of r.
algorithm on Petuum, and enjoy the benefits of paralleliza- As we just noted, SRRP ðÞ scheduling selects P coordi-
tion over a cluster. The DML experiment was run on four nates out of Q by performing dependency checking: effec-
Cluster-2 machines, using the 1-million-sample Imagenet tively, the scheduler will put coordinates i and j into the
[38] dataset with 1,000 classes (21.5k-by-21.5k matrix with same “block” iff jx>i xj j u for some “correlation threshold”
220 m model parameters), and 200 m similar/dissimilar u 2 ð0; 1Þ. The idea is that coordinates in the same block will
statements. The Petuum DML implementation converges 3.8 never be updated in parallel; the algorithm must choose the
times faster than Xing2002 on four machines (Fig. 14, right P coordinates from P distinct blocks. In order to analyze
plot). We also evaluated Petuum DML’s convergence speed the effectiveness of this procedure, we will consider the fol-
on 1-4 machines (Fig. 16)—compared to using 1 machine, lowing matrix:
Petuum DML achieves 3.8 times speedup with four
>
machines and 1.9 times speedup with two machines. xi xj ; if jx> i xj j u :
8i; Aii ¼ 1; 8i 6¼ j; Aij ¼ (22)
0; otherwise
8 SUMMARY AND FUTURE WORK This matrix A captures the impact of grouping coordinates
Petuum provides ML practitioners with an ML library and into blocks, and its spectral radius r ¼ rðAÞ will be used to
ML programming platform, capable of handling Big Data show that scheduling entails a nearly P -fold improvement
and Big ML Models with performance that is competitive in convergence with P processors. A simple bound for the
with specialized implementations, while running on reason- spectral radius rðAÞ is:
able cluster sizes (10-100 machines). This is made possible X
by systematically exploiting the unique properties of itera- jr 1j jAij j ðd 1Þu: (23)
tive-convergent ML algorithms—error tolerance, depen- j6¼i
number of coordinates selected for parallel update by the the variance E½P 2
, the faster the algorithm converges (since
scheduler, is a random variable (because it may not always is proportional to it).
be possible to select P independent coordinates). Our analy-
Remark. We compare Theorem 2 with Shotgun [10] and the
sis therefore considers the expected value E½P
. We are now
Block greedy algorithm in [22]. The convergence rate we
ready to prove Theorem 2:
2
=E½P 1
Þðr1Þ
get is similar to shotgun, but with a significant difference:
Theorem 2. Let :¼ dðE½P N ðE½P 1
Þðr1Þ
d < 1, then Our spectral radius r ¼ rðAÞ is potentially much smaller
after t steps, we have than shotgun’s rðX> XÞ, since by partitioning we zero
out all entries in the correlation matrix X> X that are big-
$ Cdb 1
E½F ðwðtÞ Þ F ðw Þ
; (24) ger than the threshold u. In other words, we get to control
E½P ð1 Þ
t the spectral radius while shotgun is totally passive.
CB 1
$
where F ðwÞ :¼ fðwÞ þ rðwÞ and w denotes a (global) mini- The convergence rate in [22] is P ð1 0 Þ t , where
0
mizer of F (whose existence is assumed for simplicity). 0 ¼ ðP 1Þðr
B1
1Þ
. Compared with ours, we have a bigger
(hence worse) numerator (d versus B) but the denomina-
Proof of Theorem 2. We first bound the algorithm’s prog-
tor (0 versus ) are not directly comparable: we have a
ress at step t. To avoid cumbersome double indices, let
bigger spectral radius r and bigger d while [22] has a
w ¼ wt and z ¼ wtþ1 . Then, by applying (17), we have
smaller spectral radius r0 (essentially taking a submatrix
E½F ðzÞ F ðwÞ
of our A) and smaller B 1. Nevertheless, we note that
X P [22] may have a higher per-step complexity: each worker
E gjp ðwþ þ
jp wjp Þ þ rðwjp Þ rðwjp Þ
needs to check all of its assigned t coordinates just to
p¼1 update one “optimal” coordinate. In contrast, we simply
E½P
> þ þ b þ 2
¼ g ðw wÞ þ rðw Þ rðwÞ þ kw wk2 PROOF OF THEOREM 3
d 2
bE½P ðP 1Þ
þ For the Regularized Regression Problem, we prove that the
þ ðw wÞ> ðA IÞðwþ wÞ
2N Petuum SRRP ðÞ scheduler produces a solution trajectory
bE½P
bE½P ðP 1Þ
ðr 1Þ þ ðtÞ
kwþ wk22 þ kw wk22 wRRP that is close to ideal execution:
2d 2N
bE½P ð1 Þ
Theorem 3. (SRRP ðÞ is close to ideal execution). Let Sideal ðÞ
kwþ wk22 ; be an oracle schedule that always proposes P random features
2d
2
ðtÞ
with zero correlation. Let wideal be its parameter trajectory,
where we define ¼ dðE½P
=E½PN
1
Þðr1Þ
, and the second ðtÞ
inequality follows from the optimality of wþ as defined and let wRRP be the parameter trajectory of SRRP ðÞ. Then,
in (21). Therefore as long as < 1, the algorithm is
ðtÞ ðtÞ 2JPm
decreasing the objective. This in turn puts a limit on the E½jwideal wRRP j
L2 XT XC; (27)
maximum number of parallel workers, which is inversely ðT þ 1Þ2 P^
proportional to the spectral radius r.
The rest of the proof follows the same line as the C is a data dependent constant, m is the strong convexity con-
shotgun paper [10]. Briefly, consider the case where stant, L is the domain width of Aj , and P^ is the expected num-
0 2 @rðwt Þ, then ber of indexes that SRRP ðÞ can actually parallelize in each
iteration (since it may not be possible to find P nearly-indepen-
F ðwtþ1 Þ F ðw Þ ðwtþ1 w Þ> g kwtþ1 w k2 kgk2 ;
$ $ $
dent parameters).
Proof of Theorem 3. By using Lemma 1, and telescoping Finally, we apply the strong convexity assumption
sum: to get
ðT Þ ð0Þ
F ðwideal Þ F ðwideal Þ ðtÞ ðtÞ 2dPm
E½jwideal wRRP j
L2 X > XC; (34)
X
T
ðtÞ ðtÞ 1 ðtÞ ðtÞ
ðt þ 1Þ2 P^
¼ ðDwideal Þ> Dwideal þ ðDwideal Þ> X > XDwideal :
t¼1
2
where m is the strong convexity constant. u
t
(29)
Since Sideal chooses P features with 0 correlation,
ACKNOWLEDGMENTS
ðT Þ ð0Þ
X
T
ðtÞ ðtÞ
F ðwideal Þ F ðwideal Þ ¼ ðDwideal Þ> Dwideal : This work is supported in part by the US Defense Advanced
t¼1 Research Projects Agency (DARPA) FA87501220324, and
the US National Science Foundation (NSF) IIS1447676
Again using Lemma 1, and telescoping sum: grants to Eric P. Xing.
ðT Þ ð0Þ
F ðwRRP Þ F ðwRRP Þ REFERENCES
X
T
ðtÞ ðtÞ 1 ðtÞ ðtÞ [1] Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado,
¼ ðDwRRP Þ> DwRRP þ ðDwRRP Þ> X> XDwRRP : J. Dean, and A. Ng, “Building high-level features using large scale
t¼1
2 unsupervised learning,” in Proc. 29th Int. Conf. Mach. Learn., 2012,
(30) pp. 81–88.
[2] Y. Wang, X. Zhao, Z. Sun, H. Yan, L. Wang, Z. Jin, L. Wang,
Y. Gao, J. Zeng, Q. Yang, et al., “Towards topic modeling for big
Taking the difference of the two sequences, we have: data,” arXiv preprint arXiv:1405.4402, 2014.
[3] J. Yuan, F. Gao, Q. Ho, W. Dai, J. Wei, X. Zheng, E. P. Xing,
ðT Þ ðT Þ
F ðwideal Þ F ðwRRP Þ T.-Y. Liu, and W.-Y. Ma, “Lightlda: Big topic models on mod-
! est compute clusters,” in Proc. 24th Int. World Wide Web Conf.,
X
T
ðtÞ ðtÞ 2015, pp. 1351–1361.
¼ ðDwideal Þ> Dwideal [4] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan, “Large-scale par-
t¼1 allel collaborative filtering for the Netflix prize,” in Proc. 4th Int.
!
X
T
ðtÞ ðtÞ 1 ðtÞ ðtÞ
Conf. Algorithmic Aspects Inf. Manage., 2008, pp. 337–348.
ðDwRRP Þ> DwRRP þ ðDwRRP Þ> X> XDwRRP : [5] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao,
t¼1
2 M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, “Large scale
distributed deep networks,” in Proc. Adv. Neural Inf. Process. Syst.,
(31) 2012, pp. 1232–1240.
[6] S. A. Williamson, A. Dubey, and E. P. Xing, “Parallel Markov
Taking expectations w.r.t. the randomness in iteration, chain Monte Carlo for nonparametric mixture models,” in Proc.
indices chosen at each iteration, and the inherent ran- Int. Conf. Mach. Learn., 2013, pp. 98–106.
domness in the two sequences, we have: [7] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, “Stochastic
variational inference,” J. Mach. Learn. Res., vol. 14, pp. 1303–1347,
ðT Þ ðT Þ
2013.
E½jF ðwideal Þ F ðwRRP Þj
[8] M. Zinkevich, J. Langford, and A. J. Smola, “Slow learners are
! fast,” in Proc. Adv. Neural Inf. Process. Syst., 2009, pp. 2331–2339.
X
T
ðtÞ ðtÞ
¼ E½j ðDwideal ÞT Dwideal [9] A. Agarwal and J. C. Duchi, “Distributed delayed stochastic
optimization,” in Proc. Adv. Neural Inf. Process. Syst., 2011,
t¼1
! pp. 873–881.
XT ðtÞ
T ðtÞ 1 ðtÞ
> > ðtÞ [10] J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin, “Parallel
DwRRP DwRRP þ DwRRP X XDwRRP j
coordinate descent for l1-regularized loss minimization,” in Proc.
t¼1
2
Int. Conf. Mach. Learn., 2011, pp. 321–328.
X [11] T. White, Hadoop: The Definitive Guide. Sebastopol, CA, USA:
1 T
ðtÞ ðtÞ
¼ Cdata þ E½j ðDwRRP Þ> X > XDwRRP Þj
; O’Reilly Media, 2012.
2 t¼1 [12] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I.
Stoica, “Spark: Cluster computing with working sets,” in Proc. 2nd
(32) USENIX Conf. Hot Topics Cloud Comput., 2010, p. 10.
[13] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M.
where Cdata is a data dependent constant. Here, the dif- Hellerstein, “Distributed GraphLab: A framework for machine
ðtÞ ðtÞ ðtÞ ðtÞ
ference between ðDwideal Þ> Dwideal and ðDwRRP Þ> DwRRP learning and data mining in the cloud,” in Proc. VLDB Endowment,
ðtÞ ðtÞ vol. 5, pp. 716–727, 2012.
can only be possible due to ðDwRRP Þ> X> XDwRRP . [14] W. Dai, A. Kumar, J. Wei, Q. Ho, G. Gibson, and E. P. Xing, “High-
Following the proof in the shotgun paper [10], we get performance distributed ML at scale through parameter server
consistency models,” in Proc. 29th Nat. Conf. Artif. Intell., 2015,
pp. 79–87.
ðtÞ ðtÞ 2dP
E½jF ðwideal Þ F ðwRRP Þj
L2 X> XC; (33) [15] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N.
ðt þ 1Þ2 P^ Leiser, and G. Czajkowski, “Pregel: A system for large-scale graph
processing,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2010,
where d is the length of w (number of features), C is a pp. 135–146.
[16] R. Power and J. Li, “Piccolo: Building fast, distributed programs
data dependent constant, L is the domain width of wj with partitioned tables,” in Proc. USENIX Conf. Operating Syst.
(i.e., the difference between its maximum and minimum Des. Implementation, article 10, 2010, pp. 1–14.
possible values), and P^ is the expected number of [17] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V.
Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed
indexes that SRRP ðÞ can actually parallelize in each machine learning with the parameter server,” in Proc. 11th USE-
iteration. NIX Conf. Operating Syst. Des. Implementation, 2014, pp. 583–598.
66 IEEE TRANSACTIONS ON BIG DATA, VOL. 1, NO. 2, APRIL-JUNE 2015
[18] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis, “Large-scale Eric P. Xing received the PhD degree in molecu-
matrix factorization with distributed stochastic gradient descent,” lar biology from Rutgers University, and the
in Proc. ACM SIGKDD 17th Int. Conf. Knowl. Discovery Data Min- another PhD degree in computer science from
ing, 2011, pp. 69–77. UC Berkeley. He is a professor of machine learn-
[19] X. Chen, Q. Lin, S. Kim, J. Carbonell, and E. Xing, “Smoothing ing in the School of Computer Science,Carnegie
proximal gradient method for general structured sparse Mellon University, and the director in the CMU
learning,” in Proc. 27th Conf. Annu. Conf. Uncertainty Artif. Intell., Center for Machine Learning and Health. His
2011, pp. 105–114. principal research interests lie in the development
[20] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proc. of machine learning and statistical methodology;
Nat. Acad. Sci. USA, vol. 101, no. suppl. 1, pp. 5228–5235, 2004. especially for solving problems involving auto-
[21] S. Lee, J. K. Kim, X. Zheng, Q. Ho, G. Gibson, and E. P. Xing, “On mated learning, reasoning, and decision-making
model parallelism and scheduling strategies for distributed in high-dimensional, multimodal, and dynamic possible worlds in social
machine learning,” in Proc. Adv. Neural Inf. Process. Syst., 2014, and biological systems. His current work involves, 1) foundations of
pp. 2834–2842. statistical learning, including theory and algorithms for estimating time/
[22] C. Scherrer, A. Tewari, M. Halappanavar, and D. Haglin, “Feature space varying-coefficient models, sparse structured input/output mod-
clustering for accelerating parallel coordinate descent,” in Proc. els, and nonparametric Bayesian models; 2) framework for parallel
Adv. Neural Inf. Process. Syst., 2012, pp. 28–36. machine learning on big data with big model in distributed systems or in
[23] Q. Ho, J. Cipar, H. Cui, J.-K. Kim, S. Lee, P. B. Gibbons, G. Gibson, the cloud; 3) computational and statistical analysis of gene regulation,
G. R. Ganger, and E. P. Xing, “More effective distributed ml via a genetic variation, and disease associations; and 4) application of
stale synchronous parallel parameter server,” in Proc. Adv. Neural machine learning in social networks, natural language processing, and
Inf. Process. Syst., 2013, pp. 1223–1231. computer vision. He is an associate editor of the Annals of Applied Sta-
[24] Y. Zhang, Q. Gao, L. Gao, and C. Wang, “Priter: A distributed tistics (AOAS), the Journal of American Statistical Association (JASA),
framework for prioritized iterative computations,” in Proc. 2nd the IEEE Transactions of Pattern Analysis and Machine Intelligence
ACM Symp. Cloud Comput., article 13, 2011, pp. 1–14. (PAMI), the PLoS Journal of Computational Biology, and an action editor
[25] E. P. Xing, M. I. Jordan, S. Russell, and A. Y. Ng, “Distance metric of the Machine Learning Journal (MLJ), the Journal of Machine Learning
learning with application to clustering with side-information,” in Research (JMLR). He is a member of the US Defense Advanced
Proc. Adv. Neural Inf. Process. Syst., 2002, pp. 505–512. Research Projects Agency (DARPA) Information Science and Technol-
[26] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information- ogy (ISAT) Advisory Group, and a Program chair of ICML 2014.
theoretic metric learning,” in Proc. 24th Int. Conf. Mach. Learn.,
2007, pp. 209–216. Qirong Ho received the PhD degree in 2014,
[27] H. B. MacMahan, et al., “Ad click prediction: A view from the under Eric P. Xing at Carnegie Mellon University’s
trenches,” in Proc. ACM SIGKDD 19th Int. Conf. Knowl. Discovery Machine Learning Department. He is a scientist at
Data Mining, 2013, pp. 1222–1230. the Institute for Infocomm Research, A*STAR,
[28] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Singapore, and an adjunct assistant professor at
Smola, “Scalable inference in latent variable models,” in Proc. the Singapore Management University School
ACM 5th Int. Conf. Web Search Data Mining, 2012, pp. 123–132. of Information Systems. His primary research
[29] K. P. Murphy, Machine Learning: A Probabilistic Perspective. Cam- focus is distributed cluster software systems for
bridge, MA, USA: MIT Press, 2012. machine learning at Big Data scales, with a view
[30] L. Yao, D. Mimno, and A. McCallum, “Efficient methods for topic toward correctness and performance guarantees.
model inference on streaming document collections,” in Proc. 15th In addition, he has performed research on statisti-
ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2009, cal models for large-scale network analysis—particularly latent space
pp. 937–946. models for visualization, community detection, user personalization and
[31] A. Kumar, A. Beutel, Q. Ho, and E. P. Xing, “Fugue: Slow-worker- interest prediction—as well as social media analysis on hyperlinked docu-
agnostic distributed learning for big models on big data,” in Proc. ments with text and network data. He received the Singapore A*STAR
Int. Conf. Artif. Intell. Statist., 2014, pp. 531–539. National Science Search Undergraduate and PhD fellowships.
[32] J. Zhu, X. Zheng, L. Zhou, and B. Zhang, “Scalable inference in
max-margin topic models,” in Proc. 19th ACM SIGKDD Int. Conf. Wei Dai is a PhD student in the Machine Learning
Knowl. Discovery Data Mining, 2013, pp. 964–972. Department at Carnegie Mellon University,
[33] H.-F. Yu, C.-J. Hsieh, S. Si, and I. Dhillon, “Scalable coordinate
advised by Prof. Eric Xing. His research focuses
descent approaches to parallel matrix factorization for recom-
on large scale machine learning systems and
mender systems,” in Proc. IEEE 12th Int. Conf. Data Mining, 2012, algorithms. He has designed system abstractions
pp. 765–774. for machine learning programs that runs efficiently
[34] A. Kumar, A. Beutel, Q. Ho, and E. P. Xing, “Fugue: Slow-worker- in distributed settings, and provided analysis for
agnostic distributed learning for big models on big data,” in Proc. the correctness of algorithms under such abstrac-
17th Int. Conf. Artif. Intell. Statist., 2014, pp. 531–539. tions. He also works on databases to manage big
[35] F. Niu, B. Recht, C. Re, and S. J. Wright, “Hogwild!: A lock-free
data, with a particular focus on feature engineer-
approach to parallelizing stochastic gradient descent,” in Proc. ing geared toward machine learning problems.
Adv. Neural Inf. Process. Syst., 2011, pp. 693–701.
[36] P. Richtarik and M. Takac, “Parallel coordinate descent methods
for big data optimization,” arXiv preprint arXiv:1212.0873, 2012. Jin Kyu Kim is a PhD candidate at Carnegie Mel-
[37] Y. Zhang, Q. Gao, L. Gao, and C. Wang, “Priter: A distributed lon University, advised by professors Garth Gib-
framework for prioritizing iterative computations,” IEEE Trans. son and Eric P. Xing. He received the BS in
Parallel Distrib. Syst., vol. 24, no. 9, pp. 1884–1893, Sep. 2013. computer science from Sogang University, Korea
[38] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, in 2001 and MS in computer science, University
“Imagenet: A large-scale hierarchical image database,” in Proc. of Michigan, Ann-Arbor in 2003. He joined Sam-
IEEE Conf. Comput. Vis. Pattern Recog., 2009, pp. 248–255. sung Electronics, Ltd, Korea in 2003, where he
has been involved in the design of various NAND
flash based storage systems such as SD card,
mobile phone storage, and Solid State Drives.
Since he joined CMU in 2010 for Ph.D study, he
has been engaged in large-scale machine learning research. His
research interests include distributed frameworks for parallel machine
learning, collaborative filtering, topic modeling, and sparse linear optimi-
zation solvers.
XING ET AL.: PETUUM: A NEW PLATFORM FOR DISTRIBUTED MACHINE LEARNING ON BIG DATA 67
Jinliang Wei is a PhD candidate at Carnegie Pengtao Xie is a PhD candidate working with
Mellon University Computer Science Depart- Prof. Eric Xing in Carnegie Mellon University’s
ment, under supervision of Prof. Garth A. Gibson SAILING lab. His research interests lie in the
and Prof. Eric P. Xing. He’s doctoral research is diversity, regularization and scalability of latent
focused on large-scale and distributed systems variable models. Before coming to CMU, he
for ML computation, with an emphasis on sys- obtained a ME from Tsinghua University in 2013
tems research. Before coming to CMU, he and a BE from Sichuan University in 2010. He is
obtained his BS in computer engineering from the recipient of the Siebel Scholarship, Goldman
Purdue University, West Lafayette. Sachs Global Leader Scholarship and National
Scholarship of China.
Seunghak Lee is a research scientist at Human Abhimanu Kumar is a Senior Data Scientist at
Longevity, Inc. His research interests include Groupon Inc, Palo Alto, California. His principal
machine learning and computational biology. Spe- research interests lie in statistical machine learn-
cifically, he has performed research on genome- ing and large-scale computational system and
wide association mapping, structural variants architecture. Abhimanu received a masters in
detection, distributed optimization, and large-scale Computer Science from University of Texas at
machine learning algorithms and systems. In par- Austin, and another masters in Natural Language
ticular, he has focused on high-dimensional prob- Processing and Statistical Learning from Carnegie
lems with applications in computational biology. Mellon University. His current work involves, 1) the-
Dr. Lee received his PhD in 2015, under Dr. Eric ory for statistical learning, 2) distributed machine
P. Xing in Computer Science Department at Car- learning 3) application of statistical learning in nat-
negie Mellon University. ural language, user-interest mining, text and social networks.
Xun Zheng is a Masters student in the Machine Yaoliang Yu is currently a research scientist in
Learning Department at Carnegie Mellon Univer- the CMU Center for machine learning and Health.
sity, advised by Eric P. Xing. His research focuses His primary research interests include optimiza-
on efficient Bayesian inference methods, espe- tion theory and algorithms, nonparametric regres-
cially bringing classical ideas in Markov chain sion, kernel methods, and robust statistics. On the
Monte Carlo (MCMC) to boost inference for latent application side, Dr. Yu has worked on dimension-
variable models. He has performed research in ality reduction, face recognition, multimedia event
large scale machine learning as well. In particular, detection and topic models. Dr. Yu obtained his
he is working on designing algorithms that are eas- PhD from the Computing Science Department of
ier to parallelize and building suitable distributed University of Alberta in 2013.
systems for different type of learning algorithms.