0% found this document useful (0 votes)
122 views19 pages

A New Platform For Distributed

The document introduces Petuum, a new distributed machine learning platform for applying advanced machine learning programs to large-scale problems using big models on big data. Existing parallelization strategies have limitations and it is difficult to find a universal platform for a wide range of machine learning programs at scale. Petuum addresses this by observing that many machine learning programs are fundamentally optimization-centric and iterative-convergent in nature. This allows for an integrative system design using bounded-error network synchronization and dynamic scheduling based on machine learning program structure. The authors demonstrate Petuum allows machine learning programs to run much faster and at larger model sizes compared to well-known implementations, even on modest compute clusters.

Uploaded by

NITIN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
122 views19 pages

A New Platform For Distributed

The document introduces Petuum, a new distributed machine learning platform for applying advanced machine learning programs to large-scale problems using big models on big data. Existing parallelization strategies have limitations and it is difficult to find a universal platform for a wide range of machine learning programs at scale. Petuum addresses this by observing that many machine learning programs are fundamentally optimization-centric and iterative-convergent in nature. This allows for an integrative system design using bounded-error network synchronization and dynamic scheduling based on machine learning program structure. The authors demonstrate Petuum allows machine learning programs to run much faster and at larger model sizes compared to well-known implementations, even on modest compute clusters.

Uploaded by

NITIN
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

IEEE TRANSACTIONS ON BIG DATA, VOL. 1, NO.

2, APRIL-JUNE 2015 49

Petuum: A New Platform for Distributed


Machine Learning on Big Data
Eric P. Xing, Qirong Ho, Wei Dai, Jin Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng,
Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu

Abstract—What is a systematic way to efficiently apply a wide spectrum of advanced ML programs to industrial scale problems, using
Big Models (up to 100 s of billions of parameters) on Big Data (up to terabytes or petabytes)? Modern parallelization strategies employ
fine-grained operations and scheduling beyond the classic bulk-synchronous processing paradigm popularized by MapReduce, or
even specialized graph-based execution that relies on graph representations of ML programs. The variety of approaches tends to pull
systems and algorithms design in different directions, and it remains difficult to find a universal platform applicable to a wide range of
ML programs at scale. We propose a general-purpose framework, Petuum, that systematically addresses data- and model-parallel
challenges in large-scale ML, by observing that many ML programs are fundamentally optimization-centric and admit error-tolerant,
iterative-convergent algorithmic solutions. This presents unique opportunities for an integrative system design, such as bounded-error
network synchronization and dynamic scheduling based on ML program structure. We demonstrate the efficacy of these system
designs versus well-known implementations of modern ML algorithms, showing that Petuum allows ML programs to run in much less
time and at considerably larger model sizes, even on modestly-sized compute clusters.

Index Terms—Machine learning, big data, big model, distributed systems, theory, data-parallelism, model-parallelism

1 INTRODUCTION

M ACHINE Learning (ML) is becoming a primary mecha-


nism for extracting information from data. However,
the surging volume of Big Data from Internet activities and
such big models with a single machine can be prohibitively
slow, if not impossible. While careful model design and fea-
ture engineering can certainly reduce the size of the model,
sensory advancements, and the increasing needs for Big they require domain-specific expertise and are fairly labor-
Models for ultra high-dimensional problems have put tre- intensive, hence the recent appeal (as seen in the above
mendous pressure on ML methods to scale beyond a single papers) of building high-capacity Big Models in order to
machine, due to both space and time bottlenecks. For exam- substitute computation cost for labor cost.
ple, on the Big Data front, the Clueweb 2012 web crawl1 con- Despite the recent rapid development of many new ML
tains over 700 million web pages as 27 TB of text data; while models and algorithms aiming at scalable applications [5],
photo-sharing sites such as Flickr, Instagram and Facebook [6], [7], [8], [9], [10], adoption of these technologies remains
are anecdotally known to possess 10 s of billions of images, generally unseen in the wider data mining, NLP, vision,
again taking up TBs of storage. It is highly inefficient, if and other application communities for big problems, espe-
possible, to use such big data sequentially in a batch or cially those built on advanced probabilistic or optimization
scholastic fashion in a typical iterative ML algorithm. On programs. A likely reason for such a gap, at least from
the Big Model front, state-of-the-art image recognition sys- the scalable execution point of view, that prevents many
tems have now embraced large-scale deep learning (DL) state-of-the-art ML models and algorithms from being more
models with billions of parameters [1]; topic models with widely applied at Big-Learning scales is the difficult migra-
up to 106 topics can cover long-tail semantic word sets for tion from an academic implementation, often specialized
substantially improved online advertising [2], [3]; and very- for a small, well-controlled computer platform such as desk-
high-rank matrix factorization (MF) yields improved pre- top PCs and small lab-clusters, to a big, less predictable
diction on collaborative filtering problems [4]. Training platform such as a corporate cluster or the cloud, where
correct execution of the original programs require careful
control and mastery of low-level details of the distributed
1. http://www.lemurproject.org/clueweb12.php/
environment and resources through highly nontrivial dis-
 E.P. Xing, W. Dai, J.K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, tributed programming.
and Y. Yu are with the School of Computer Science, Carnegie Mellon Many programmable platforms have provided partial sol-
University. E-mail: {epxing, wdai, jinkyuk, jinlianw, seunghak, xunzheng, utions to bridge this research-to-production gap: while
pengtaox, yaoliang}@cs.cmu.edu, [email protected].
 Q. Ho is with the Institute of Infocomm Research, A*STAR Singapore.
Hadoop [11] is a popular and easy to program platform, its
E-mail: [email protected]. implementation of MapReduce requires the program state
Manuscript received 31 Mar. 2015; revised 18 Aug. 2015; accepted 19 Aug. to be written to disk every iteration, thus its performance on
2015. Date of publication 2 Sept. 2015; date of current version 9 Dec. 2015. many ML programs has been surpassed by in-memory
Recommended for acceptance by Q. Yang. alternatives [12], [13]. One such alternative is Spark [12],
For information on obtaining reprints of this article, please send e-mail to:
which improves upon Hadoop by keeping ML program
[email protected], and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TBDATA.2015.2472014 state in memory—resulting in large performance gains over
2332-7790 ß 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
50 IEEE TRANSACTIONS ON BIG DATA, VOL. 1, NO. 2, APRIL-JUNE 2015

Hadoop—whilst preserving the easy-to-use MapReduce considered such a wide spectrum of ML algorithms, which
programming interface. However, Spark ML implementa- exhibit diverse representation abstractions, model and data
tions are often still slower than specially-designed ML access patterns, and synchronization and scheduling
implementations, in part because Spark does not offer requirements. So what are the shared properties across such
flexible and fine-grained scheduling of computation and a “zoo of ML algorithms”? We believe that the key lies in
communication, which has been shown to be hugely advan- the recognition of a clear dichotomy between data (which is
tageous, if not outright necessary, for fast and correct execu- conditionally independent and persistent throughout the
tion of advanced ML algorithms [14]. Graph-centric algorithm) and model (which is internally coupled, and is
platforms such as GraphLab [13] and Pregel [15] efficiently transient before converging to an optimum). This inspires a
partition graph-based models with built-in scheduling and simple yet statistically-rooted bimodal approach to parallel-
consistency mechanisms, but due to limited theoretical ism: data parallel and model parallel distribution and execu-
work, it is unclear whether asynchronous graph-based con- tion of a big ML program over a cluster of machines. This
sistency models and scheduling will always yield correct data parallel, model parallel approach keenly exploits the
execution of ML programs. Other systems provide low-level unique statistical nature of ML algorithms, particularly the
programming interfaces [16], [17], that, while powerful and following three properties: (1) Error tolerance—iterative-
versatile, do not yet offer higher-level general-purpose convergent algorithms are often robust against limited
building blocks such as scheduling, model partitioning errors in intermediate calculations; (2) Dynamic structural
strategies, and managed communication that are key to sim- dependency—during execution, the changing correlation
plifying the adoption of a wide range of ML methods. In strengths between model parameters are critical to efficient
summary, existing systems supporting distributed ML each parallelization; (3) Non-uniform convergence—the number
manifest a unique tradeoff on efficiency, correctness, of steps required for a parameter to converge can be highly
programmability, and generality. skewed across parameters. The core goal of Petuum is to
In this paper, we explore the problem of building a distrib- execute these iterative updates in a manner that quickly
uted machine learning framework with a new angle toward converges to an optimum of the ML program’s objective
the efficiency, correctness, programmability, and generality function, by exploiting these three statistical properties of
tradeoff. We observe that, a hallmark of most (if not all) ML ML, which we argue are fundamental to efficient large-scale
programs is that they are defined by an explicit objective func- ML in cluster environments.
tion over data (e.g., likelihood, error-loss, graph cut), and the This design principle contrasts that of several existing
goal is to attain optimality of this function, in the space programmable frameworks discussed earlier. For example,
defined by the model parameters and other intermediate vari- central to the Spark framework [12] is the principle of per-
ables. Moreover, these algorithms all bear a common style, in fect fault tolerance and recovery, supported by a persistent
that they resort to an iterative-convergent procedure (see memory architecture (Resilient Distributed Datasets);
Eq. (1)). It is noteworthy that iterative-convergent computing whereas central to the GraphLab framework is the principle
tasks are vastly different from conventional programmatic of local and global consistency, supported by a vertex pro-
computing tasks (such as database queries and keyword gramming model (the Gather-Apply-Scatter abstraction).
extraction), which reach correct solutions only if every deter- While these design principles reflect important aspects of
ministic operation is correctly executed, and strong consis- correct ML algorithm execution—e.g., atomic recoverability
tency is guaranteed on the intermediate program state—thus, of each computing step (Spark), or consistency satisfaction
operational objectives such as fault tolerance and strong con- for all subsets of model variables (GraphLab)—some other
sistency are absolutely necessary. However, an ML program’s important aspects, such as the three statistical properties
true goal is fast, efficient convergence to an optimal solution, discussed above, or perhaps ones that could be more funda-
and we argue that fine-grained fault tolerance and strong con- mental and general, and which could open more room for
sistency are but one vehicle to achieve this goal, and might not efficient system designs, remain unexplored.
even be the most efficient one. To exploit these properties, Petuum introduces three
We present a new distributed ML framework, Petuum, novel system objectives grounded in the aforementioned
built on an ML-centric optimization-theoretic principle, key properties of ML programs, in order to accelerate their
as opposed to various operational objectives explored ear- convergence at scale: (1) Petuum synchronizes the parame-
lier. We begin by formalizing ML algorithms as iterative- ter states with bounded staleness guarantees, thereby
convergent programs, which encompass a large space of achieves provably correct outcomes due to the error-toler-
modern ML, such as stochastic gradient descent (SGD)[18] ant nature of ML, but at a much cheaper communication
and coordinate descent (CD) [10] for convex optimization cost than conventional per-iteration bulk synchronization;
problems, proximal methods [19] for more complex con- (2) Petuum offers dynamic scheduling policies that take
strained optimization, as well as MCMC [20] and varia- into account the changing structural dependencies between
tional inference [7] for inference in probabilistic models. To model parameters, so as to minimize parallelization error
our knowledge, no existing programmable2 platform has and synchronization costs; and (3) Since parameters in ML
programs exhibit non-uniform convergence costs (i.e., dif-
ferent numbers of updates required), Petuum prioritizes
2. Our discussion is focused on platforms which provide libraries computation towards non-converged model parameters, so
and tools for writing new ML algorithms. Because programmability is as to achieve faster convergence.
an important criteria for writing new ML algorithms, we exclude ML
software that does not allow new algorithms to be implemented on top To demonstrate this approach, we show how data-paral-
of them, such as AzureML and Mahout. lel and model-parallel algorithms can be implemented on
XING ET AL.: PETUUM: A NEW PLATFORM FOR DISTRIBUTED MACHINE LEARNING ON BIG DATA 51

Fig. 2. The difference between data and model parallelism: data samples
are always conditionally independent given the model, but some model
parameters are not independent of each other.
Fig. 1. The scale of Big ML efforts in recent literature. A key goal of Pet-
uum is to enable larger ML models to be run on fewer resources, even
relative to highly-specialized implementations.
intermediate results to be aggregated with the current esti-
mate of A by F ðÞ to produce the new estimate of A. For sim-
Petuum, allowing them to scale to large data/model sizes plicity, in the rest of the paper we omit L in the subscript
with improved algorithm convergence times. The experi- with the understanding that all ML programs of our interest
ments section provides detailed benchmarks on a range of here bear an explicit loss function that can be used to moni-
ML programs: topic modeling, matrix factorization, deep tor the quality of convergence to a solution, as opposed to
learning, Lasso regression, and distance metric learning heuristics or procedures not associated such a loss function.
(DML). These algorithms are only a subset of the full open- Also for simplicity, we focus on iterative-convergent equa-
source Petuum ML library3—the PMLlib, which we will tions with an additive form:
briefly discuss in this paper. As illustrated in Fig. 1, Petuum
PMLlib covers a rich collection of advanced ML methods AðtÞ ¼ Aðt1Þ þ DðAðt1Þ ; DÞ; (2)
not usually seen in existing ML libraries; the Petuum plat-
form enables PMLlib to solve a range of ML problems at i.e., the aggregation F ðÞ is replaced with a simple addition.
large scales—scales that have only been previously The approaches we propose can also be applied to this gen-
attempted in a case-specific manner with corporate-scale eral F ðÞ.
efforts and resources—but using relatively modest clusters In large-scale ML, both data D and model A can be very
(10-100 machines) that are within reach of most ML large. Data-parallelism, in which data is divided across
practitioners. machines, is a common strategy for solving Big Data prob-
lems, while model-parallelism, which divides the ML model,
2 PRELIMINARIES: ON DATA PARALLELISM AND is common for Big Models. Both strategies are not exclusive,
MODEL PARALLELISM and can be combined to tackle challenging problems with
large data D and large model A. Hence, every Petuum ML
We begin with a principled formulation of iterative- program is either data-parallel, model-parallel, or data-and-
convergent ML programs, which exposes a dichotomy of model-parallel, depending on problem needs. Below, we
data and model, that inspires the parallel system architec- discuss the (different) mathematical implications of each
ture (Section 3), algorithm design (Section 4), and theoretical parallelism (see Fig. 2).
analysis (Section 6) of Petuum. Consider the following pro-
grammatic view of ML as iterative-convergent programs, 2.1 Data Parallelism
driven by an objective function. In data-parallel ML, the data D is partitioned and assigned to
Iterative-convergent ML algorithm: Given data D and a computational workers (indexed by p ¼ 1::P ); we denote
model objective function L (e.g., mean-squared loss, likeli- the pth data partition by Dp . The function DðÞ can be applied
hood, margin), a typical ML problem can be grounded as to each data partition independently, and the results com-
executing the following update equation iteratively, until bined additively, yielding a data-parallel equation (left
the model state (i.e., parameters and/or latent variables) A panel of Fig. 2):
reaches some stopping criteria:
PP
AðtÞ ¼ Aðt1Þ þ p¼1 DðAðt1Þ ; Dp Þ: (3)
ðtÞ ðt1Þ ðt1Þ
A ¼ F ðA ; DL ðA ; DÞÞ; (1)
This form is commonly seen in stochastic gradient descent or
where superscript ðtÞ denotes the iteration counter. The sampling-based algorithms. For example, in distance metric
update function DL ðÞ (which improves the loss L) performs learning optimized via stochastic gradient descent, the data
computation on data D and model state A, and outputs pairs are partitioned over different workers, and the interme-
diate results (subgradients) are computed on each partition,
3. Petuum is available as open source at http://petuum.org. before being summed and applied to the model parameters.
52 IEEE TRANSACTIONS ON BIG DATA, VOL. 1, NO. 2, APRIL-JUNE 2015

Unlike data-parallelism which enjoys iid data properties,


the model parameters Aj are not, in general, independent of
each other (Fig. 2), and it has been established that model-
parallel algorithms can only be effective if the parallel
updates are restricted to independent (or weakly-corre-
lated) parameters [10], [13], [21], [22]. Hence, our definition
of model-parallelism includes the global scheduling mecha-
nism Sp ðÞ that can select carefully-chosen parameters for
parallel updating.
The scheduling function SðÞ opens up a large design
space, such as fixed, randomized, or even dynamically-
Fig. 3. Conceptual illustration of data and model parallelism. In data-par-
changing scheduling on the whole space, or a subset of, the
allelism, workers are responsible for generating updates DðÞ on different model parameters. SðÞ not only can provide safety and cor-
data partitions, in order to updated the (shared) model state. In model- rectness (e.g., by selecting independent parameters and thus
parallelism, workers generate D on different model partitions, possibly minimize parallelization error), but can offer substantial
using all of the data.
speed-up (e.g., by prioritizing computation onto non-
converged parameters). In the Lasso example, Petuum uses
P
A slightly modified form, AðtÞ ¼ Pp¼1 DðAðt1Þ ; Dp Þ, is used SðÞ to select coefficients that are weakly correlated (thus
by some algorithms, such as variational EM. preventing divergence), while at the same time prioritizing
Importantly, this additive updates property allows the coefficients far from zero (which are more likely to be non-
updates DðÞ to be computed at each local worker before converged).
transmission over the network, which is crucial because
CPUs can produce updates DðÞ much faster than they can 2.3 Implementing Data- and Model-Parallel
be (individually) transmitted over the network. Additive Programs
updates are the foundation for a host of techniques to Data- and model-parallel programs (Fig. 3) exhibit a certain
speed up data-parallel execution, such as minibatch, programming and systems desiderata: they are stateful, in
asynchronous and bounded-asynchronous execution, and that they continually update shared model parameters A.
parameter servers. Key to the validity of additivity of Thus, an ML platform needs to synchronize A across all run-
updates from different workers is the notion of indepen- ning threads and processes, and this should be done via a
dent and identically distributed (iid) data, which is assumed high-performance, non-blocking asynchronous strategy that
for many ML programs, and implies that each parallel still guarantees convergence. If the program is model-paral-
worker contributes “equally” (in a statistical sense) to the lel, it may require fine control over the order of parameter
ML algorithm’s progress via DðÞ, no matter which data updates, in order to avoid non-convergence due to depen-
subset Dp it uses. dency violations—thus, the ML platform needs to provide
fine-grained scheduling capability. We discuss some of the
2.2 Model Parallelism difficulties associated with achieving these desiderata.
In model-parallel ML, the model A is partitioned and Data- and model-parallel programs can certainly be writ-
assigned to workers p ¼ 1::P and updated therein in paral- ten in a message-passing style, in which the programmer
lel, running update functions DðÞ. Because the outputs from explicitly writes code to send and receive parameters over
each DðÞ affect different elements of A (hence denoted now the network. However, we believe it is more desirable to
by Dp ðÞ to make explicit the parameter subset affected at provide a Distributed Shared Memory (DSM) abstraction,
worker p), they are first concatenated into a full vector of in which the programmer simply treats A like a global pro-
updates (i.e., the full DðÞ), before aggregated with model gram variable, accessible from any thread/process in a
parameter vector A (see right panel of Fig. 2): manner similar to single-machine programming—no explict
network code is required from the user, and all synchroni-
 P  zation is done in the background. While DSM-like interfaces
AðtÞ ¼ Aðt1Þ þ Con Dp ðAðt1Þ ; Spðt1Þ ðAðt1Þ ÞÞ p¼1 ; (4) could be added to alternative ML platforms like Hadoop,
Spark and GraphLab, these systems usually operate in
where we have omitted the data D for brevity and clarity. either a bulk synchronous (prone to stragglers and blocking
due to the high rate of update DðÞ generation) or asynchro-
Coordinate descent algorithms for regression and matrix
nous (having no parameter consistency guarantee, and
factorization are a canonical example of model-parallelism.
hence no convergence guarantee) fashion.
Each update function Dp ðÞ also takes a scheduling function
Model-parallel programs pose an additional challenge, in
Spðt1Þ ðAÞ, which restricts Dp ðÞ to modify only a carefully-
that they require fine-grained control over the parallel order-
chosen subset of the model parameters A. Spðt1Þ ðAÞ outputs ing of variable updates. Again, while it is completely
a set of indices fj1 ; j2 ; . . . ; g, so that Dp ðÞ only performs possible to achieve such control via message-passing pro-
updates on Aj1 ; Aj2 ; . . . —we refer to such selection of model gramming style, there is nevertheless an opportunity to pro-
parameters as scheduling. In some cases, the additive update vide a simpler abstraction, in which the user merely has to
formula above can be replaced by a replace operator that define the model scheduling function Spðt1Þ ðAÞ. In such an
directly replaces corresponding elements in A with ones in abstraction, networking and synchronization code is again
the concatenated update vector. hidden from the user. While Hadoop and Spark provide
XING ET AL.: PETUUM: A NEW PLATFORM FOR DISTRIBUTED MACHINE LEARNING ON BIG DATA 53

Fig. 5. Petuum system: scheduler, workers, parameter servers.

update function push() (analogous to Map in MapRe-


duce), and a central aggregation function pull() (analo-
gous to Reduce). Data-parallel programs can be written
with just push(), while model-parallel programs are writ-
ten with all three functions schedule(), push() and
pull().
The model variables A are held in the parameter server,
which can be accessed at any time from any function via the
PS object. The PS object can be accessed from any function,
and has 3 functions: PS.get() to read a parameter, PS.
Fig. 4. Petuum program structure. inc() to add to a parameter, and PS.put() to overwrite a
parameter. With just these operations, the parameter server
automatically ensures parameter consistency between all
easy-to-use abstractions, their design does not give users
Petuum components; no additional user programming is
fine-grained control over the ordering of updates—for exam-
necessary. In the example pseudocode, DATA is a place-
ple, MapReduce provides no control over the order in which
holder for data D; this can be any 3rd-party data structure,
mappers or reducers are executed. We note that GraphLab
database, or distributed file system.
has a priority-based scheduler specialized for some model-
parallel applications, but still does not expose a dedicated
scheduling function Spðt1Þ ðAÞ. One could certainly modify 3.2 Petuum System Design
Hadoop’s or Spark’s built-in schedulers to expose the ML algorithms exhibit several principles that can be
required level of control, but we do not consider this exploited to speed up distributed ML programs: depen-
reasonable for the average ML practitioner without strong dency structures between parameters, non-uniform conver-
systems expertise. gence of parameters, and a limited degree of error tolerance
These considerations make effective data- and model- [13], [14], [17], [21], [23], [24]. Through schedule(), push
parallel programming challenging, and there is an opportu- () and pull(), Petuum allows practitioners to write data-
nity to abstract them away via a platform that is specifically parallel and model-parallel ML programs that exploit these
focused on data/model-parallel ML. principles, and can be scaled to Big Data and Big Model
applications. The Petuum system comprises three compo-
3 PETUUM – A PLATFORM FOR DISTRIBUTED ML nents (Fig. 5): parameter server, scheduler, and workers.

A core goal of Petuum is to allow practitioners to easily


implement data-parallel and model-parallel ML algorithms. 3.2.1 Parameter Server
Petuum provides APIs to key systems that make data- and The parameter server (PS) enables data-parallelism, by pro-
model-parallel programming easier: (1) a parameter server viding users with global read/write access to model param-
system, which allows programmers to access global model eters A, via a convenient distributed shared memory API
state A from any machine via a convenient distributed that is similar to table-based or key-value stores. The PS API
shared-memory interface that resembles single-machine pro- consists of three functions: PS.get(), PS.inc() and PS.
gramming, and adopts a bounded-asychronous consistency put(). As the names suggest, the first function reads a part
model that preserves data-parallel convergence guarantees, of the global A into local memory, while the latter two add
thus freeing users from explicit network synchronization; or overwrite local changes into the global A.
(2) a scheduler, which allows fine-grained control over To take advantage of ML error tolerance, the PS
the parallel ordering of model-parallel updates DðÞ—in implements the Eager Stale Synchronous Parallel (ESSP)
essence, the scheduler allows users to define their own ML consistency model [14], [23], which reduces network syn-
application consistency rules. chronization and communication costs, while maintaining
bounded-staleness convergence guarantees implied by
3.1 Programming Interface ESSP. The ESSP consistency model ensures that, if a worker
Fig. 4 shows pseudocode for a generic Petuum program, reads from parameter server at iteration c, it will definitely
consisting of three user-written functions (in either C++ or receive all updates from all workers computed at and before
Java): a central scheduler function schedule(), a parallel iteration c  s  1, where s is a staleness threshold—see
54 IEEE TRANSACTIONS ON BIG DATA, VOL. 1, NO. 2, APRIL-JUNE 2015

Fig. 6. ESSP consistency model, used by the Parameter Server. Work-


ers are allowed to run at different speeds, but are prevented from being
more than s iterations apart. Updates from the most recent s iterations
are “eagerly” pushed out to workers, but are not guaranteed to be visible.

Fig. 7. Scheduler system. Using algorithm or model-specific criteria,


Fig. 6 for an illustration. In Section 6, we will cover theoreti- the Petuum scheduler prioritizes a small subset of parameters from the
cal guarantees enjoyed by ESSP consistency. full model A. This is followed by a dependency analysis on the prioritized
subset: parameters are further subdivided into groups, where a parame-
ter Ai in group gu is must be independent of all other parameters Aj in all
3.2.2 Scheduler other groups gv . This is illustrated as a graph partitioning, although the
The scheduler system (Fig. 7) enables model-parallelism, by implementation need not actually construct a graph.
allowing users to control which model parameters are
updated by worker machines. This is performed through a
workers might sample one data point at a time, while in
user-defined scheduling function schedule() (corre-
batch algorithms, workers might instead pass through all
sponding to Spðt1Þ ðÞ), which outputs a set of parameters for
data points in one iteration.
each worker. The scheduler sends the identities of these
parameters to workers via the scheduling control channel
(Fig. 5), while the actual parameter values are delivered 4 PETUUM PARALLEL ALGORITHMS
through the parameter server system. In Section 6, we will Now we turn to development of parallel algorithms for
discuss the theoretical guarantees enjoyed by model-paral- large-scale distributed ML problems, in light of the data
lel schedules. and model parallel principles underlying Petuum. As exam-
Several common patterns for schedule design are worth ples, we explain how to use Petuum’s programming inter-
highlighting: fixed-scheduling (schedule_fix()) dispatches face to implement novel or state-of-the-art versions of the
model parameters A in a pre-determined order; static, following four algorithms: (1) data-parallel Distance Metric
round-robin schedules (e.g., repeatedly loop over all param- Learning, (2) model-parallel Lasso regression, (3) model-
eters) fit the schedule_fix() model. Dependency-aware parallel topic modeling [ Latent Dirichlet Allocation (LDA)],
(schedule_dep()) scheduling allows re-ordering of vari- and (4) model-parallel Matrix Factorization. These algo-
able/parameter updates to accelerate model-parallel ML rithms all enjoy significant performance advantages over
algorithms, e.g., in Lasso regression, by analyzing the the previous state-of-the-art and existing open-source soft-
dependency structure over model parameters A. Finally, ware, as we will show.
prioritized scheduling (schedule_pri()) exploits uneven Through pseudocode, it can be seen that Petuum
convergence in ML, by prioritizing subsets of variables allows these algorithms to be easily realized on distrib-
U sub  A according to algorithm-specific criteria, such as uted clusters, without dwelling on low level system pro-
the magnitude of each parameter, or boundary conditions gramming, or non-trivial recasting of our ML problems
such as KKT. These common schedules are provided as pre- into representations such as RDDs or vertex programs.
implemented software libraries, or users can opt to write Instead our ML problems can be coded at a high level,
their own schedule(). more akin to Matlab or R. We round off with a brief
description of how we used Petuum implement a couple
3.2.3 Workers of other ML algorithms.
Each worker p receives parameters to be updated from
schedule(), and then runs parallel update functions 4.1 Data-Parallel Distance Metric Learning
push() (corresponding to DðÞ) on data D. While push() is Let us first consider a large-scale Distance Metric Learning
being executed, the model state A can be easily synchro- problem. DML improves the performance of other ML pro-
nized with the parameter server, using the PS.get() and grams such as clustering, by allowing domain experts to
PS.inc() API. After the workers finish push(), the incorporate prior knowledge of the form “data points x, y
scheduler may use the new model state to generate future are similar (or dissimilar)” [25]—for example, we could
scheduling decisions. enforce that “books about science are different from books
Petuum intentionally does not enforce a data abstraction, about art”. The output is a distance function dðx; yÞ that cap-
so that any data storage system may be used—workers may tures the aforementioned prior knowledge. Learning a
read from data loaded into memory, or from disk, or over a proper distance metric [25], [26] is essential for many dis-
distributed file system or database such as HDFS. Further- tance based data mining and machine learning algorithms,
more, workers may touch the data in any order desired by such as retrieval, k-means clustering and k-nearest neighbor
the programmer: in data-parallel stochastic algorithms, (k-NN) classification. DML has not received much attention
XING ET AL.: PETUUM: A NEW PLATFORM FOR DISTRIBUTED MACHINE LEARNING ON BIG DATA 55

in the Big Data setting, and we are not aware of any distrib-
uted implementations of DML.
The most popular version of DML tries to learn a Maha-
lanobis distance matrix M (symmetric and positive-semide-
finite), which can then be used to measure the distance
between two samples Dðx; yÞ ¼ ðx  yÞT Mðx  yÞ. Given a
jSj
set of “similar” sample pairs S ¼ fðxi ; yi Þgi¼1 , and a set of
jDj
“dissimilar” pairs D ¼ fðxi ; yi Þgi¼1 , DML learns the Mahala-
nobis distance by optimizing
X
minM ðx  yÞT Mðx  yÞ Fig. 8. Petuum DML data-parallel pseudocode.
ðx;yÞ2S
(5)
s:t: ðx  yÞT Mðx  yÞ  1; 8ðx; yÞ 2 D ensures high-throughput execution, via a bounded-asyn-
M  0; chronous consistency model (Stale Synchronous Parallel
(SSP)) that can provide workers with stale local copies of
where M  0 denotes that M is required to be positive semi- the parameters L, instead of forcing workers to wait for
definite. This optimization problem minimizes the Mahala- network communication. Later, we will review the strong
nobis distances between all pairs labeled as similar, while consistency and convergence guarantees provided by the
separating dissimilar pairs with a margin of 1. SSP model.
In its original form, this optimization problem is diffi- Since DML is a data-parallel algorithm, only the paral-
cult to parallelize due to the constraint set. To create a lel update push() needs to be implemented (Fig. 8). The
data-parallel optimization algorithm and implement it on scheduling function schedule() is empty (because
Petuum, we shall relax the constraints via slack variables every worker touches every model parameter L), and
(similar to SVMs). First, we replace M with LT L, and we do not need aggregation push() for this SGD algo-
introduce slack variables  to relax the hard constraint in rithm. In our next example, we will show how sched-
Eq.(5), yielding ule() and push() can be used to implement model-
X X parallel execution.
minL kLðx  yÞk2 þ  x;y
ðx;yÞ2S ðx;yÞ2D (6) 4.2 Model-Parallel Lasso
2
s:t: kLðx  yÞk  1  x;y ; x;y  0; 8ðx; yÞ 2 D: Lasso is a widely used model to select features in high-
dimensional problems, such as gene-disease association
Using hinge loss, the constraint in Eq.(6) can be eliminated, studies, or in online advertising via ‘1 -penalized regres-
yielding an unconstrained optimization problem: sion [27]. Lasso takes the form of an optimization problem:
X P
minL kLðx  yÞk2 minb ‘ðX; y; b Þ þ  j jbj j; (9)
ðx;yÞ2S
X (7)
þ maxð0; 1  kLðx  yÞk2 Þ: where  denotes a regularization parameter that deter-
ðx;yÞ2D mines the sparsity of b , and ‘ðÞ is a non-negative convex
loss function such as squared-loss or logistic-loss; we
Unlike the original constrained DML problem, this relaxa- assume that X and y are standardized and consider (9)
tion is fully data-parallel, because it now treats the dis- without an intercept. For simplicity but without loss of
similar pairs as iid data to the loss function (just like the bk22 ; other loss func-
generality, we let ‘ðX; y; b Þ ¼ 12 ky  Xb
similar pairs); hence, it can be solved via data-parallel tions (e.g., logistic) are straightforward and can be solved
Stochastic Gradient Descent. SGD can be naturally paral- using the same approach [10]. We shall solve this via a
lelized over data, and we partition the data pairs onto P coordinate descent model-parallel approach, similar but
machines. Every iteration, each machine p randomly sam- not identical to [10], [22].
ples a minibatch of similar pairs S p and dissimilar pairs The simplest parallel CD Lasso , shotgun [10], selects a
Dp from its data shard, and computes the following random subset of parameters to be updated in parallel. We
update to L: now present a scheduled model-parallel Lasso that
X improves upon shotgun: the Petuum scheduler chooses
4Lp ¼ 2Lðx  yÞðx  yÞT parameters that are nearly independent with each other,4
ðx;yÞ2S p
X (8) thus guaranteeing convergence of the Lasso objective. In
 2Lða  bÞða  bÞT  IðkLða  bÞk2  1Þ; addition, it prioritizes these parameters based on their dis-
ða;bÞ2Dp tance to convergence, thus speeding up optimization.
Why is it important to choose independent parameters
where IðÞ is the indicator function. via scheduling? Parameter dependencies affect the CD
Fig. 8 shows pseudocode for Petuum DML, which is
simple to implement because the parameter server system
4. In the context of Lasso, this means the data columns xj corre-
PS abstracts away complex networking code under a sim- sponding to the chosen parameters j have very small pair-wise dot
ple get()/read() API. Moreover, the PS automatically product, below a threshold t.
56 IEEE TRANSACTIONS ON BIG DATA, VOL. 1, NO. 2, APRIL-JUNE 2015

from the core optimization code push() and pull(), Pet-


uum makes it easy to experiment with complex scheduling
policies that involve prioritization and dependency check-
ing, thus facilitating the implementation of new model-par-
allel algorithms—for example, one could use schedule()
to prioritize according to KKT conditions in a constrained
optimization problem, or to perform graph-based depen-
dency checking like in Graphlab [13]. Later, we will show
that the above Lasso schedule schedule() is guaranteed
to converge, and gives us near optimal solutions by control-
ling errors from parallel execution. The pseudocode
for scheduled model parallel Lasso under Petuum is shown
in Fig. 9.

4.3 Topic Model (LDA)


Topic Modeling uncovers semantically-coherent topics from
unstructured document corpora, and is widely used in
industry—e.g., Yahoo’s YahooLDA [28], and Google’s
Rephil [29]. The most well-known member of the topic
modeling family is Latent Dirichlet Allocation: given a cor-
pus of N documents and a pre-specified K for number of
topics, the objective of LDA inference is to output K
“topics” (discrete distributions over V unique words in the
corpus), as well as the topic distribution of each document
(a discrete distribution over topics).
One popular LDA inference technique is collapsed Gibbs
sampling, a Markov Chain Monte Carlo algorithm that sam-
ples the topic assignments for each “token” (word position)
in each document until convergence. This is an iterative-
convergent algorithm that repeatedly updates three types of
model state parameters: an M-by-K “word-topic table” V ,
an N-by-K “doc-topic” table U, and the token topic assign-
ments zij . The LDA Gibbs sampler update is

bþVwij ;k
P ðzij ¼ k j U; V Þ / PK
aþUik
þ PM ; (12)
Fig. 9. Petuum Lasso model-parallel pseudocode. Kaþ Ui‘ Mbþ V
‘¼1 m¼1 mk

update equation in the following manner: by taking the gra- where zij are topic assignments to each word “token” wij in
dient of (9), we obtain the CD update for bj : document i. The document word tokens wij , topic assign-
ðtÞ P ðt1Þ
ments zij and doc-topic table rows Ui are partitioned across
dij xij yi  k6¼j xij xik bk ; (10) worker machines and kept fixed, as is common practice
P  with Big Data. Although there are multiple parameters, the
ðtÞ N ðtÞ
bj S d
i¼1 ij ;  ; (11) only one that is read and updated by all parallel worker
(and hence needs to be globally-stored) is the word-topic
where Sð; Þ is a soft-thresholding operator, defined by table V .
Sðbj ; Þ signðbÞðjbj  Þ. In (11), if xTj xk 6¼ 0 (i.e., nonzero We adopt a model-parallel approach to LDA, and use a
ðt1Þ ðt1Þ
correlation) and bj 6¼ 0 and bk 6¼ 0, then a coupling schedule() (Algorithm 10) that cycles rows of the word-
effect is created between the two features bj and bk . Hence, topic table (rows correspond to different words, e.g.,
“machine” or “learning”) across machines, to be updated
they are no longer conditionally independent given the
via push() and pull(); data is kept fixed at each machine.
data: bj 6? bk jX; y. If the jth and the kth coefficients are
This schedule() ensures that no two workers will ever
updated concurrently, parallelization error may occur,
touch the same rows of V in the same iteration,5 unlike pre-
causing the Lasso problem to converge slowly (or even
vious LDA implementations [28] which allow workers to
diverge outright).
update any parameter, resulting in dependency violations.
Petuum’s schedule(), push() and pull() interface
Note that the function LDAGibbsSample() in push()
is readily suited to implementing scheduled model-parallel
can be replaced with any recent state-of-the art Gibbs
Lasso. We use schedule() to choose parameters with low
dependency, and to prioritize non-converged parameters.
Petuum pipelines schedule() and push(); thus sched- 5. Petuum LDA’s “cyclic” schedule differs from the model streaming
in [3]; the latter has workers touch the same set of parameters, one set at
ule() does not slow down workers running push(). Fur- a time. Model streaming can easily be implemented in Petuum, by
thermore, by separating the scheduling code schedule() changing schedule() to output the same word range for every jp .
XING ET AL.: PETUUM: A NEW PLATFORM FOR DISTRIBUTED MACHINE LEARNING ON BIG DATA 57

Fig. 10. Petuum Topic Model (LDA) model-parallel pseudocode.

sampling algorithm, such as the fast Metropolis-Hastings


algorithm in LightLDA [3]. Our experiments use the Spar-
seLDA algorithm [30], to ensure a fair comparison with
YahooLDA [28] (which also uses SparseLDA).

4.4 Matrix Factorization


MF is used in recommendation, where users’ item preferen-
ces are predicted based on other users’ partially observed
ratings. The MF algorithm decomposes an incomplete obser- Fig. 11. Petuum MF model-parallel pseudocode.
vation matrix XN
M into two smaller matrices W 2 RK
N
and H 2 RK
M such that W T H X, where K minfM; Ng which essentially repeats the available data xij at each
is a user-specified rank: worker to restore load balance. Petuum can implement
X Fugue SGD MF as Algorithm 11; we also provide an Alter-
minW;H jjXij  wTi h j jj2 þ RegðW; HÞ; (13) nating Least Squares implementation for comparison against
ði;jÞ2V other ALS-using systems like Spark and GraphLab.

where RegðW; HÞ is a regularizer such as the Frobenius 4.5 Other Algorithms


norm, and V indexes the observed entries in X. High-rank We have implemented other data- and model-parallel
decompositions of large matrices provide improved accu- algorithms on Petuum as well. Here, we briefly mention a
racy [4], and can be solved by a model-parallel stochastic few algorithms whose data/model-parallel implementa-
gradient approach (Fig. 11) that ensures workers never tion on Petuum substantially differs from existing soft-
touch the same elements of W; H in the same iteration. ware. Many other ML algorithms are included in the
There are two update equations, for W; H respectively: Petuum open-source code.
X
dWik ¼ Iða ¼ iÞ½2Xab Hkb þ 2Wa Hb Hkb ; (14) 4.5.1 Deep Learning
ða;bÞ2V We implemented two types on Petuum: a general-purpose
fully-connected Deep Neural Network (DNN) using the
X
dHkj ¼ Iðb ¼ jÞ½2Xab Wak þ 2Wa Hb Wak ; (15) cross-entropy loss, and a Convolutional Neural Network
ða;bÞ2V (CNN) for image classification based off the open-source
Caffe project. We adopt a data-parallel strategy schedu-
where IðÞ is the indicator function. le_fix(), where each worker uses its data subset to per-
Previous systems using this approach [18] exhibited a form updates push() to the full model A. While this data-
load-balancing issue, because the rows of X exhibit a power- parallel strategy could be amenable to MapReduce, Spark
law distribution of non-zero entries; this was theoretically and GraphLab, we are not aware of DL implementations on
solved by the Fugue algorithm implemented on Hadoop [31], those platforms.
58 IEEE TRANSACTIONS ON BIG DATA, VOL. 1, NO. 2, APRIL-JUNE 2015

TABLE 1
Petuum ML Library (PMLlib): ML Applications and Achievable Problem Scale for a Given Cluster Size

ML Application Problem scale achieved on given cluster size


Topic Model (LDA) 220 b params (22 m unique words, 10 k topics) on 256 cores and 1 TB memory
Constrained Topic Model (MedLDA) 610 m params (61k unique words, 10 k topics, 20 classes) on 16 cores and
128 GB memory
Convolutional Neural Networks 1b params on 1,024 CPU cores and 2 TB memory
Fully-connected Deep Neural Networks 24 m params on 96 CPU cores and 768 GB memory
Matrix Factorization (MF) 20 m-by-20k input matrix, rank 400 (8 b params) on 128 cores and 1 TB memory
Non-negative Matrix Factorization (NMF) 20 m-by-20k input matrix, rank 50 (1 b params) on 128 cores and 1 TB memory
Sparse Coding (SC) 1 b params on 128 cores and 1TB memory
Logistic Regression (LR) 100 m params (50 k samples, 2 b nonzeros) on 512 cores and 1 TB memory
Multi-class Logistic Regression (MLR) 10 m params (10 classes, 1 m features) on 512 cores and 1 TB memory
Lasso Regression 100 m params (50 k samples, 2 b nonzeros) on 512 cores and 1 TB memory
Support Vector Machines 100 m params (50 k samples, 2 b nonzeros) on 512 cores and 1 TB memory
Distance Metric Learning (DML) 441 m params (63 k samples, 21 k feature dimension) on 64 cores and 512 GB memory
K-means clustering 1 m params (10 m samples, 1 k feature dimension, 1 k clusters) on 32 cores and
256 GB memory
Random Forest 8,000 trees (2 m samples) on 80 cores and 640 GB memory

Petuum’s goal is to solve large model and data problems using medium-sized clusters with only 10 s of machines (100-1,000 cores, 1 TB+ memory). Running time
varies between 10 s of minutes to several days, depending on the application.

4.5.2 Logstic Regression (LR) and Support Vector all Petuum PMLlib applications (and new user-written
Machines (SVM) applications) can be run in stand-alone mode or deployed
Petuum implements LR and SVM using the same depen- as YARN jobs to be scheduled alongside other MapReduce
dency-checking, prioritized model-parallel strategy as the jobs, and PMLlib applications can also read/write input
Lasso Algorithm 9. Dependency checking and prioritization data and output results from both the local machine filesys-
are not easily implemented on MapReduce and Spark. tem as well as HDFS. More specifically, Petuum provides a
While GraphLab has support for these features; the key dif- YARN launcher that will deploy any Petuum application
ference with Petuum is that Petuum’s scheduler performs (including user-written ones) onto a Hadoop cluster; the
dependency checking on small subsets of parameters at a YARN launcher will automatically restart failed Petuum
time, whereas GraphLab performs graph partitioning on all jobs and ensure that they always complete. Petuum also
parameters at once (which can be costly). provides a data access library with C++ iostreams (or Java
file streams) for HDFS access, which allows users to write
4.5.3 Maximum Entropy Discrimination LDA (MedLDA) generic file stream code that works on both HDFS files the
local filesystem. The data access library also provides pre-
MedLDA [32] is a constrained variant of LDA, that uses side implemented routines to load common data formats, such
information to constrain and improve the recovered topics. as CSV, libSVM, and sparse matrix.
Petuum implements MedLDA using a data-parallel strategy While Petuum PMLlib applications are written in C++
schedule_fix(), where each worker uses push() to alter- for maximum performance, new Petuum applications can
nate between Gibbs sampling (like regular LDA) and solv- be written in either Java or C++; Java has the advantages of
ing for Lagrange multiplers associated with the constraints. easier deployment and a wider user base.
5 BIG DATA ECOSYSTEM SUPPORT 6 PRINCIPLES AND THEORY
To support ML at scale in production, academic, or cloud- Our iterative-convergent formulation of ML programs, and
compute clusters, Petuum provides a ready-to-run ML the explicit notion of data and model parallelism, make it
library, called PMLlib; Table 1 shows the current list of ML convenient to explore three key properties of ML pro-
applications, with more applications are actively being grams—error-tolerant convergence, non-uniform conver-
developed for future releases. Petuum also integrates with gence, dependency structures (Fig. 12)—and to analyze
Hadoop ecosystem modules, thus reducing the engineering how Petuum exploits these properties in a theoretically-
effort required to deploy Petuum in academic and real-
world settings.
Many industrial and academic clusters run Hadoop,
which provides, in addition to the MapReduce program-
ming interface, a job scheduler that allows multiple pro-
grams to run on the same cluster (YARN) and a distributed
filesystem for storing Big Data (HDFS). However, programs
that are written for stand-alone clusters are not compatible
with YARN/HDFS, and vice versa, applications written for
YARN/HDFS are not compatible with stand alone clusters.
Fig. 12. Key properties of ML algorithms: (a) Non-uniform convergence;
Petuum solves this issue by providing common libraries (b) Error-tolerant convergence; (c) Dependency structures amongst
that work on either Hadoop or non-Hadoop clusters. Hence, variables.
XING ET AL.: PETUUM: A NEW PLATFORM FOR DISTRIBUTED MACHINE LEARNING ON BIG DATA 59

sound manner to speed up ML program completion at Big (2) keeping staleness (i.e., asynchrony) as low as possible
Learning scales. improves per-iteration convergence—specifically, the
Some of these properties have previously been success- bound becomes tighter with lower maximum staleness s,
fully exploited by a number of bespoke, large-scale imple- and lower average mg and variance s g of the staleness expe-
mentations of popular ML algorithms: e.g., topic models [3], rienced by the workers. Conversely, naive asynchronous
[17], matrix factorization [33], [34], and deep learning [1]. It systems (e.g., Hogwild! [35] and YahooLDA [28]) may expe-
is notable that MapReduce-style systems (such as rience poor convergence, particularly in production envi-
Hadoop [11] and Spark [12]) often do not fare competitively ronments where machines may slow down due to other
against these custom-built ML implementations, and one of tasks or users. Such slowdown can cause the maximum
the reasons is that these key ML properties are difficult to staleness s and staleness variance s g to become arbitrarily
exploit under a MapReduce-like abstraction. Other abstrac- large, leading to poor convergence rates. In addition to the
tions may offer a limited degree of opportunity—for exam- above theorem (which bounds the distribution of x), Dai
ple, vertex programming [13] permits graph dependencies et al. also showed that the variance of x can be bounded,
to influence model-parallel execution. ensuring reliability and stability near an optimum [14].

6.1 Error Tolerant Convergence 6.2 Dependency Structures


Data-parallel ML algorithms are often robust against minor Naive parallelization of model-parallel algorithms (e.g.,
errors in intermediate calculations; as a consequence, they coordinate descent) may lead to uncontrolled parallelization
still execute correctly even when their model parameters A error and non-convergence, caused by inter-parameter
experience synchronization delays (i.e., the P workers only dependencies in the model. The mathematical definition of
see old or stale parameters), provided those delays are dependency differs between algorithms and models; exam-
strictly bounded [8], [9], [14], [23], [31], [35]. Petuum ples include the Markov Blanket structure of graphical
exploits this error-tolerance to reduce network communica- models (explored in GraphLab [13]) and deep neural net-
tion/synchronization overheads substantially, by imple- works (partially considered in [5]), or the correlation
menting the Stale Synchronous Parallel consistency between data dimensions in regression problems (explored
model [14], [23] on top of the parameter server system, in [22]).
which provides all machines with access to parameters A. Such dependencies have been thoroughly analyzed under
The SSP consistency model guarantees that if a worker fixed execution schedules (where each worker updates the
reads from parameter server at iteration c, it is guaranteed same set of parameters every iteration) [10], [22], [36], but
to receive all updates from all workers computed at and there has been little research on dynamic schedules that can
before iteration c  s  1, where s is the staleness threshold. react to changing model dependencies or model state A.
If this is impossible because some straggling worker is more Petuum’s scheduler allows users to write dynamic schedul-
than s iterations behind, the reader will stop until the strag- ing functions SpðtÞ ðAðtÞ Þ—whose output is a set of model indi-
gler catches up and sends its updates. For stochastic gradi- ces fj1 ; j2 0; . . .g, telling worker p to update Aj1 ; Aj2 ; . . .—as
ent descent algorithms (such as the DML program), SSP has per their application’s needs. This enables ML programs to
very attractive theoretical properties [14], which we par- analyze dependencies at run time (implemented via sched-
tially re-state here: ule()), and select subsets of independent (or nearly-inde-
Theorem 1 (adapted from [14] SGDPunder SSP, conver- pendent) parameters for parallel updates.
gence in probability). Let fðxÞ ¼ Tt¼1 ft ðxÞ be a convex To motivate this, we consider a generic optimization
function, where the ft are also convex. We search for a mini- problem, which many regularized regression problems
mizer x via stochastic gradient descent on each component (RRPs)—including the Petuum Lasso example—fit into:
rft under SSP, with staleness parameter s and P workers. Let Definition (Regularized Regression Problem).
ut :¼ ht rt ft ð~xt Þ with ht ¼ phffit. Under suitable conditions
(ft are L-Lipschitz and bounded divergence Dðxjjx0 Þ  F 2 ), min fðwÞ þ rðwÞ; (16)
w2Rd
we have
P
 
where w is the parameter vector, rðwÞ ¼ i rðwi Þ is separable
R½X 1 2 F2 2 and f has b-Lipschitz continuous gradient in the following
P  pffiffiffiffi hL þ þ 2hL mg  t
T T h sense:
( )
T t 2 b
 exp ; fðw þ zÞ  fðwÞ þ z> rfðwÞ þ z> X> Xz; (17)
hT s g þ 23 hL2 ð2s þ 1ÞP t
2 2

PT 2 L4 ðln T þ1Þ where X ¼ ½x1 ; . . . ; xd are d feature vectors. W.l.o.g., we


where R½X :¼ t¼1 ft ð~
xt Þ  fðx Þ, and h
T ¼ h T ¼ assume that each feature vector xi is normalized, i.e.,
oð1Þ as T ! 1. kxi k2 ¼ 1; i ¼ 1; . . . ; d. Therefore jx>
i xj j  1 for all i; j.

This theorem has two implications: (1) learning under the In the regression setting, fðwÞ represents a least-squares
SSP model is correct (like Bulk Synchronous Parallel (BSP) loss, rðwÞ represents a separable regularizer (e.g., ‘1 pen-
learning), because R½X
T —which is the difference between the alty), and xi represents the ith feature column of the design
SSP parameter estimate and the true optimum—converges (data) matrix, each element in xi is a separate data sample.
to OðT 1=2 Þ in probability with an exponential tail-bound; In particular, jx>
i xj j is the correlation between the ith and
60 IEEE TRANSACTIONS ON BIG DATA, VOL. 1, NO. 2, APRIL-JUNE 2015

jth feature columns. The parameters w are simply the ðtÞ ðtÞ 2dPm
E½jwideal  wRRP j  L2 X> XC; (20)
regression coefficients. ðt þ 1Þ2 P^
In the context of the model-parallel equation (4), we can for constants C; m; L; P^. The proof for this theorem can be
map the model A ¼ w, the data D ¼ X, and the update found in the Appendix.
equation DðA; Sp ðAÞÞ to
This theorem says that the difference between the SRRP ðÞ
b 1 parameter estimate wRRP and the ideal oracle estimate
wþ arg min ½z  ðwjp  gjp Þ 2 þ rðzÞ; (18)
jp z2R 2 b wideal rapidly vanishes, at a fast 1=ðt þ 1Þ2 rate. In other
words, one cannot do much better than SRRP ðÞ schedul-
where SpðtÞ ðAÞ has selected a single coordinate jp to be ing—it is near-optimal.
updated by worker p—thus, P coordinates are updated in We close this section by noting that SRRP ðÞ is different
every iteration. The aggregation function F ðÞ simply allows from Scherrer et al. [22], who pre-cluster all M features
each update wjp to pass through without change. before starting coordinate descent, in order to find “blocks”
The effectiveness of parallel coordinate descent depends of nearly-independent parameters. In the Big Data and
on how the schedule SpðtÞ ðÞ selects the coordinates jp . In par- especially Big Model setting, feature clustering can be pro-
ticular, naive random selection can lead to poor conver- hibitive—fundamentally, it requires OðM 2 Þ evaluations of
gence rate or even divergence, with error proportional to jx> 2
i xj j for all M feature combinations ði; jÞ, and although
the correlation jx> ja xjb j between the randomly-selected coor- greedy clustering algorithms can mitigate this to some
dinates ja ; jb [10], [22]. An effective and cheaply-computable extent, feature clustering is still impractical when M is very
ðtÞ
schedule SRRP;p ðÞ involves randomly proposing a small set large, as seen in some regression problems [27]. The pro-
of Q > P features fj1 ; . . . ; jQ g, and then finding P features posed SRRP ðÞ only needs to evaluate a small number of
in this set such that jx> jx>
i xj j every iteration. Furthermore, the random selection in
ja xjb j  u for some threshold u, where
ja ; jb are any two features in the set of P . This requires at SRRP ðÞ can be replaced with prioritization to exploit non-uni-
form convergence in ML problems, as explained next.
most OðB2 Þ evaluations of jx> ja xjb j  u (if we cannot find P
features that meet the criteria, we simply reduce the 6.3 Non-Uniform Convergence
degree of parallelism). We have the following convergence In model-parallel ML programs, it has been empirically
theorem: observed that some parameters Aj can converge in much
2
Theorem 2 (SRRP ðÞ convergence). Let  :¼ dðE½P =E½P d2
1 Þðr1Þ
fewer/more updates than other parameters [21]. For
ðE½P 1 Þðr1Þ instance, this happens in Lasso regression because the
d < 1, where r is a constant that depends on the
model enforces sparsity, so most parameters remain at zero
input data x and the scheduler SRRP ðÞ. After t steps, we have throughout the algorithm, with low probability of becoming
Cdb 1 non-zero again. Prioritizing Lasso parameters according to
$
E½F ðwðtÞ Þ  F ðw Þ  ; (19) their magnitude greatly improves convergence per itera-
E½P ð1  Þ t
tion, by avoiding frequent (and wasteful) updates to zero
$
where F ðwÞ :¼ fðwÞ þ rðwÞ and w is a minimizer of F . parameters [21].
E½P is the average degree of parallelization over all itera- We call this non-uniform ML convergence, which can be
tions—we say “average” to account for situations where the exploited via a dynamic scheduling function SpðtÞ ðAðtÞ Þ
scheduler cannot select P nearly-independent parameters (due whose output changes according to the iteration t—for
to high correlation in the data). The proof for this theorem can instance, we can write a scheduler Smag ðÞ that proposes
be found in the Appendix. For most real-world data sets, this is parameters with probability proportional to their current
ðtÞ
not a problem, and E½P is equal to the number of workers. magnitude ðAj Þ2 . This Smag ðÞ can be combined with the
earlier dependency structure checking, leading to a depen-
This theorem states that SRRP ðÞ-scheduling (which is dency-aware, prioritizing scheduler. Unlike the dependency
used by Petuum Lasso) achieves close to P -fold (linear) structure issue, prioritization has not received as much
improvement in per-iteration convergence (where P is the attention in the ML literature, though it has been used to
number of workers). This comes from the 1=E½P ð1  Þ fac- speed up the PageRank algorithm, which is iterative-
tor on the RHS of Eq. (19); for input data x that is sparse and convergent [37].
high-dimensional, the SRRP ðÞ scheduler will cause r  1 to The prioritizing schedule Smag ðÞ can be analyzed in the
become close to zero, and therefore  will also be close to context of the Lasso problem. First, we rewrite it by dupli-
zero—thus the per-iteration convergence rate is improved cating the original J features with opposite sign, as in [10]:
by nearly P -fold. We contrast this against a naive system P
F ðb bk22 þ 2J
bÞ :¼ minb 12 ky  Xb j¼1 bj : Here, X contains 2J
that selects coordinates at random; such a system will have
features and bj  0, for all j ¼ 1; . . . ; 2J.
far larger r  1, thus degrading per-iteration convergence.
In addition to asymptotic convergence, we show that Theorem 4 (Adapted from [21] Optimality of Lasso prior-
SRRP ’s trajectory is close to ideal parallel execution: ity scheduler). Suppose B is the set of indices of coefficients
Theorem 3 (SRRP ðÞ is close to ideal execution). Let Sideal ðÞ be updated in parallel at the tth iteration, and r is sufficiently
ðtÞ ðtÞ
an oracle schedule that always proposes P random features small constant such that rdbj dbk 0, for all j 6¼ k 2 B.
ðtÞ ðtÞ
with zero correlation. Let wideal be its parameter trajectory, Then, the sampling distribution pðjÞ / ðdbj Þ2 approximately
ðtÞ
and let wRRP be the parameter trajectory of SRRP ðÞ. Then, bðtÞ Þ  F ðb
maximizes a lower bound on EB ½F ðb bðtÞ þ db
bðtÞ Þ .
XING ET AL.: PETUUM: A NEW PLATFORM FOR DISTRIBUTED MACHINE LEARNING ON BIG DATA 61

This theorem shows that a prioritizing scheduler will


speed up Lasso convergence, by decreasing the objective as
much as is theoretically possible every iteration.
In practice, the Petuum scheduler system approximates
ðtÞ ðt1Þ
pðjÞ / ðdbj Þ2 with p0 ðjÞ / ðbj Þ2 þ h, in order to allow
pipelining of multiple iterations for faster real-time conver-
gence.6 The constant h ensures that all bj ’s have a non-zero
probability of being updated.

7 PERFORMANCE
Petuum’s ML-centric system design supports a variety of
ML programs, and improves their performance on Big Data
in the following senses: (1) Petuum ML implementations
achieve significantly faster convergence rate than well-opti-
mized single-machine baselines (i.e., DML implemented on
single machine, and Shotgun [10]); (2) Petuum ML imple-
mentations can run faster than other programmable plat-
forms (e.g., Spark, GraphLab7), because Petuum can exploit
model dependencies, uneven convergence and error toler-
ance; (3) Petuum ML implementations can reach larger
model sizes than other programmable platforms, because
Petuum stores ML program variables in a lightweight fash-
ion (on the parameter server and scheduler); (4) for ML
programs without distributed implementations, we can
implement them on Petuum and show good scaling with an
increasing number of machines. We emphasize that Petuum Fig. 13. Performance increase in ML applications due to the Petuum
is, for the moment, primarily about allowing ML practi- Parameter Server (PS) and Scheduler. The Eager Stale Synchro-
tioners to implement and experiment with new data/ nous Parallel consistency model (on the PS) improves the number
of iterations executed per second (throughput) while maintaining
model-parallel ML algorithms on small-to-medium clusters. per-iteration quality. Prioritized, dependency-aware scheduling
Our experiments are therefore focused on clusters with 10- allows the Scheduler to improve the quality of each iteration, while
100 machines, in accordance with our target users. maintaining iteration throughput. In both cases, overall real-time con-
vergence rate is improved—30 percent improvement for the PS
Matrix Factorization example, and several orders of magnitude for
7.1 Hardware Configuration the Scheduler Lasso example.
To demonstrate that Petuum is adaptable to different hard-
ware generations, our experiments used three clusters with
varying specifications: Cluster-1 has up to 128 machines using PLMlib’s Matrix Factorization with the schedule
with two AMD cores, 8 GB RAM, 1 Gbps Ethernet; Cluster-2 () function disabled (in order to remove the beneficial
has up to 16 machines with 64 AMD cores, 128 GB RAM, effect of scheduling, so we may focus on the PS). This
40 Gbps Infiniband; Cluster-3 has up to 64 machines with experiment was conducted using 64 Cluster-3 machines
16 Intel cores, 128 GB RAM, 10 Gbps Ethernet. on a 332 GB sparse matrix (7.7 m by 288 k entries, 26 b
nonzeros, created by duplicating the Netflix dataset 16
7.2 Parameter Server and Scheduler Performance times horizontally and 16 times vertically). We compare
Petuum’s Parameter Server (PS) and Scheduler speed up the performance of MF running under Petuum PS’s
existing ML algorithms by improving iteration throughput and Eager SSP mode (using staleness s ¼ 2; higher staleness
iteration quality respectively. We measure iteration throughput values did not produce additional benefit), versus run-
as “iterations executed per second”, and we quantify iteration ning under MapReduce-style Bulk Synchronous Parallel
quality by plotting the ML objective function L against itera- mode. Fig. 13 shows that ESSP provides a 30 percent
tion number—“objective progress per iteration”. In either improvement to iteration throughput (top left), without a
case, the goal is to improve the ML algorithm’s real-time con- significantly affecting iteration quality (top right). Conse-
vergence rate, quantified by plotting the objective function L quently, the MF application converges 30 percent faster
against real time (“objective progress per second”). in real time (middle left).
The iteration throughput improvement occurs because
7.2.1 Parameter Server ESSP allows both gradient computation and inter-worker
We consider how the PS improves iteration throughput communication to occur at the same time, whereas classic
(through the Eager SSP consistency model), evaluated BSP execution requires computation and communication to
alternate in a mutually exclusive manner. Because the maxi-
6. Without this approximation, pipelining is impossible because mum staleness s ¼ 2 is small, and because ESSP eagerly
ðtÞ ðtÞ
dbj is unavailable until all computations on bj have finished. pushes parameter updates as soon as they are available,
7. We omit Hadoop and Mahout, as it is already well-established there is almost no penalty to iteration quality despite allow-
that Spark and GraphLab significantly outperform it [12], [13]. ing staleness.
62 IEEE TRANSACTIONS ON BIG DATA, VOL. 1, NO. 2, APRIL-JUNE 2015

Fig. 15. Left: LDA convergence time: Petuum versus YahooLDA (lower
is better). Petuum’s data-and-model-parallel LDA converges faster than
YahooLDA’s data-parallel-only implementation, and scales to more LDA
Fig. 14. Left: Petuum relative speedup versus popular platforms (larger parameters (larger vocab size, number of topics). Right panels: Matrix
is better). Across ML programs, Petuum is at least 2-10 times faster Factorization convergence time: Petuum vs GraphLab vs Spark. Petuum
than popular implementations. Right: Petuum allows single-machine is fastest and the most memory-efficient, and is the only platform that
algorithms to be transformed into cluster versions, while still achieving could handle Big MF models with rank K  1;000 on the given hardware
near-linear speedup with increasing number of machines (Caffe CNN budget.
and DML).

7.2.2 Scheduler it performs twice as fast as Spark, while GraphLab ran out
We examine how the Scheduler improves iteration quality of memory (due to the need to construct an explicit graph
(through a well-designed schedule() function), evaluated representation, which consumes significant memory). On
using PMLlib’s Lasso application. This experiment was con- the other hand, Petuum LDA is nearly six times faster than
ducted using 16 Cluster-2 on a simulated 150 GB sparse data- YahooLDA; the speedup mostly comes from the Petuum
set (50 m features); adjacent features in the dataset are highly LDA schedule() (Fig. 10), which performs correct model-
correlated in order to simulate the effects of realistic feature parallel execution by only allowing each worker to operate
engineering. We compare the original PMLlib Lasso (whose on disjoint parts of the vocabulary. This is similar to Graph-
schedule() performs prioritization and dependency check- Lab’s implementation, but is far more memory-efficient
ing) to a simpler version whose schedule() selects parame- because Petuum does not need to construct a full graph
ters at random (the shotgun algorithm [10]). Fig. 13 shows representation of the problem.
that PMLlib Lasso’s schedule() slightly decreases iteration
7.3.2 Model Size
throughput (middle right), but greatly improves iteration
quality (bottom left), resulting in several orders of magnitude We show that Petuum supports larger ML models for the
improvement to real-time convergence (bottom right). same amount of cluster memory. Fig. 15 shows ML program
The iteration quality improvement is mostly due to prior- running time versus model size, given a fixed number of
itization; we note that without prioritization, 85 percent of machines—the left panel compares Petuum LDA and
the parameters would converge within five iterations, but YahooLDA; PetuumLDA converges faster and supports
the remaining 15 percent would take over 100 iterations. LDA models that are > 10 times larger,8 allowing long-tail
Moreover, prioritization alone is not enough to achieve fast topics to be captured. The right panels compare Petuum MF
convergence speed—when we repeated the experiment versus Spark and GraphLab; again Petuum is faster and
with a prioritization-only schedule() (not shown), the supports much larger MF models (higher rank) than either
parameters became unstable, which caused the objective baseline. Petuum’s model scalability comes from two fac-
function to diverged. This is because dependency checking tors: (1) model-parallelism, which divides the model across
is necessary to avoid correlation effects in Lasso (discussed machines; (2) a lightweight parameter server system with
in the proof to Theorem 2), which we observed were greatly minimal storage overhead. In contrast, Spark and GraphLab
amplified under the prioritization-only schedule(). have additional overheads that may not be necessary in an
ML context—Spark needs to construct a “lineage graph” in
order to preserve its strict fault recovery guarantees, while
7.3 Comparison to Programmable Platforms
GraphLab needs to represent the ML problem in graph
Fig. 14(left) compares Petuum to popular platforms for writ-
form. Because ML applications are error-tolerant, fault
ing new ML programs (Spark v1.2 and GraphLab), as well as
recovery can be performed with lower overhead through
a well-known cluster implementation of LDA (YahooLDA).
periodic checkpointing.
We compared Petuum to Spark, GraphLab and YahooLDA
on two applications: LDA and MF. We ran LDA on 128 Clus- 7.4 Fast Cluster Implementations of New ML
ter-1 machines, using 3.9 m English Wikipedia abstracts with Programs
unigram (V ¼ 2:5 m) and bigram (V ¼ 21:8 m) vocabularies; Petuum facilitates the development of new ML programs
the bigram vocabulary is an example of feature engineering without existing cluster implementations; here we present
to improve performance at the cost of additional computa- two case studies. The first is a cluster version of the open-
tion. The MF comparison was performed on 10 Cluster-2 source Caffe CNN toolkit, created by adding  600 lines of
machines using the original Netflix dataset. Petuum code. The basic data-parallel strategy in Caffe was
left unchanged, so the Petuum port directly tests Petuum’s
7.3.1 Speed efficiency. We tested on four Cluster-3 machines, using a
For MF and LDA, Petuum is between 2-6 times faster than 250k subset of Imagenet with 200 classes, and 1.3 m model
other platforms (Figs. 14, 15). For MF, Petuum uses the
same model-parallel approach as Spark and GraphLab, but 8. LDA model size is equal to vocab size times number of topics.
XING ET AL.: PETUUM: A NEW PLATFORM FOR DISTRIBUTED MACHINE LEARNING ON BIG DATA 63

focus on iterative-convergent ML properties. For example,


fault recovery in ML does not require perfect, synchronous
checkpoints (used in Hadoop and Spark); instead, check-
points with ESSP-style bounded error consistency can be
used. This in turn opens up new ways to achieve on-the-fly
resource adjustment and multi-tenancy.

APPENDIX
PROOF OF THEOREM 2
We prove that the Petuum SRRP ðÞ scheduler makes the Reg-
Fig. 16. Petuum DML objective versus time convergence curves, from
one to four machines. ularized Regression Problem converge. We note that SRRP ðÞ
has the following properties: (1) the scheduler uniformly ran-
parameters. Compared to the original single-machine domly selects Q out of d coordinates (where d is the number
Caffe (which does not have the overhead of network com- of features); (2) the scheduler performs dependency checking
munication), Petuum approaches linear speedup (3:1-times and retains P out of Q coordinates; (3) in parallel, each of
speedup on 4 machines, Fig. 14 (right plot)) due to the the P workers is assigned one coordinate, and performs
parameter server’s ESSP consistency for managing network coordinate descent on it:
communication.  
2
Second, we compare the Petuum DML program against b 1
wþ arg min z  wjp  gjp þrðzÞ; (21)
the original DML algorithm proposed in [25] (denoted by
jp z2R 2 b
Xing2002), implemented using SGD on a single machine
(with parallelization over matrix operations). The intent is to where gj ¼ rj fðwÞ is the j-th partial derivative, and the
show that, even for ML algorithms that have received less coordinate jp is assigned to the p-th worker. Note that (21) is
research attention towards scalability (such as DML), one simply the gradient update: w w  b1 g, followed by apply-
can still implement a reasonably simple data-parallel SGD ing the proximity operator of r.
algorithm on Petuum, and enjoy the benefits of paralleliza- As we just noted, SRRP ðÞ scheduling selects P coordi-
tion over a cluster. The DML experiment was run on four nates out of Q by performing dependency checking: effec-
Cluster-2 machines, using the 1-million-sample Imagenet tively, the scheduler will put coordinates i and j into the
[38] dataset with 1,000 classes (21.5k-by-21.5k matrix with same “block” iff jx>i xj j  u for some “correlation threshold”
220 m model parameters), and 200 m similar/dissimilar u 2 ð0; 1Þ. The idea is that coordinates in the same block will
statements. The Petuum DML implementation converges 3.8 never be updated in parallel; the algorithm must choose the
times faster than Xing2002 on four machines (Fig. 14, right P coordinates from P distinct blocks. In order to analyze
plot). We also evaluated Petuum DML’s convergence speed the effectiveness of this procedure, we will consider the fol-
on 1-4 machines (Fig. 16)—compared to using 1 machine, lowing matrix:
Petuum DML achieves 3.8 times speedup with four
>
machines and 1.9 times speedup with two machines. xi xj ; if jx> i xj j  u :
8i; Aii ¼ 1; 8i 6¼ j; Aij ¼ (22)
0; otherwise
8 SUMMARY AND FUTURE WORK This matrix A captures the impact of grouping coordinates
Petuum provides ML practitioners with an ML library and into blocks, and its spectral radius r ¼ rðAÞ will be used to
ML programming platform, capable of handling Big Data show that scheduling entails a nearly P -fold improvement
and Big ML Models with performance that is competitive in convergence with P processors. A simple bound for the
with specialized implementations, while running on reason- spectral radius rðAÞ is:
able cluster sizes (10-100 machines). This is made possible X
by systematically exploiting the unique properties of itera- jr  1j  jAij j  ðd  1Þu: (23)
tive-convergent ML algorithms—error tolerance, depen- j6¼i

dency structures and uneven convergence; these properties


SRRP ðÞ scheduling sets the correlation threshold u to a
have yet to be thoroughly explored in general-purpose Big
small constant, causing the spectral radius r to also be
Data platforms such as Hadoop and Spark.
small (which will lead to a nearly P -fold improvement in
In terms of feature set, Petuum is still relatively imma-
per-iteration convergence rate). We contrast SRRP ðÞ with
ture compared to Hadoop and Spark, and lacks the follow-
ing: fault recovery from partial program state (critical for random shotgun-style [10] scheduling, which is equivalent
scaling to 1,000+ machines), ability to adjust resource usage to setting u ¼ 1; this causes r to become large, which will
on-the-fly in running jobs, scheduling jobs for multiple degrade the per-iteration convergence rate.
users (multi-tenancy), a unified data interface that closely Finally, let N denote the number of pairs ði; jÞ that pass
integrates with databases and distributed file systems, and the dependency check jx> i xj j  u. In high-dimensional prob-
support for interactive scripting languages such as Python lems with over 100 million dimensions, it is often the case
and R. The lack of these features imposes a barrier to entry that N d2 , because each coordinate i is unlikely to be cor-
for new users, and future work on Petuum will address related with more than a few other coordinates j. We there-
these issues—but in a manner consistent with Petuum’s fore assume N d2 for our analysis. We note that P , the
64 IEEE TRANSACTIONS ON BIG DATA, VOL. 1, NO. 2, APRIL-JUNE 2015

number of coordinates selected for parallel update by the the variance E½P 2 , the faster the algorithm converges (since
scheduler, is a random variable (because it may not always  is proportional to it).
be possible to select P independent coordinates). Our analy-
Remark. We compare Theorem 2 with Shotgun [10] and the
sis therefore considers the expected value E½P . We are now
Block greedy algorithm in [22]. The convergence rate we
ready to prove Theorem 2:
2 =E½P 1 Þðr1Þ
get is similar to shotgun, but with a significant difference:
Theorem 2. Let  :¼ dðE½P N ðE½P 1 Þðr1Þ
d < 1, then Our spectral radius r ¼ rðAÞ is potentially much smaller
after t steps, we have than shotgun’s rðX> XÞ, since by partitioning we zero
out all entries in the correlation matrix X> X that are big-
$ Cdb 1
E½F ðwðtÞ Þ  F ðw Þ  ; (24) ger than the threshold u. In other words, we get to control
E½P ð1  Þ t the spectral radius while shotgun is totally passive.
CB 1
$
where F ðwÞ :¼ fðwÞ þ rðwÞ and w denotes a (global) mini- The convergence rate in [22] is P ð1 0 Þ t , where
0
mizer of F (whose existence is assumed for simplicity). 0 ¼ ðP 1Þðr
B1
1Þ
. Compared with ours, we have a bigger
(hence worse) numerator (d versus B) but the denomina-
Proof of Theorem 2. We first bound the algorithm’s prog-
tor (0 versus ) are not directly comparable: we have a
ress at step t. To avoid cumbersome double indices, let
bigger spectral radius r and bigger d while [22] has a
w ¼ wt and z ¼ wtþ1 . Then, by applying (17), we have
smaller spectral radius r0 (essentially taking a submatrix
E½F ðzÞ  F ðwÞ of our A) and smaller B  1. Nevertheless, we note that
X P [22] may have a higher per-step complexity: each worker
E gjp ðwþ þ
jp  wjp Þ þ rðwjp Þ  rðwjp Þ
needs to check all of its assigned t coordinates just to
p¼1 update one “optimal” coordinate. In contrast, we simply

b 2 bX þ pick a random coordinate, and hence can be much


þ ðwþ  wjp Þ þ ðw  w jp Þðwþ
 w Þx>
x
jq jp jq cheaper per-step.
2 jp 2 p6¼q jp jq

E½P > þ þ b þ 2
¼ g ðw  wÞ þ rðw Þ  rðwÞ þ kw  wk2 PROOF OF THEOREM 3
d 2
bE½P ðP  1Þ þ For the Regularized Regression Problem, we prove that the
þ ðw  wÞ> ðA  IÞðwþ  wÞ
2N Petuum SRRP ðÞ scheduler produces a solution trajectory
bE½P bE½P ðP  1Þ ðr  1Þ þ ðtÞ
 kwþ  wk22 þ kw  wk22 wRRP that is close to ideal execution:
2d 2N
bE½P ð1  Þ Theorem 3. (SRRP ðÞ is close to ideal execution). Let Sideal ðÞ
 kwþ  wk22 ; be an oracle schedule that always proposes P random features
2d
2
ðtÞ
with zero correlation. Let wideal be its parameter trajectory,
where we define  ¼ dðE½P =E½PN
1 Þðr1Þ
, and the second ðtÞ
inequality follows from the optimality of wþ as defined and let wRRP be the parameter trajectory of SRRP ðÞ. Then,
in (21). Therefore as long as  < 1, the algorithm is
ðtÞ ðtÞ 2JPm
decreasing the objective. This in turn puts a limit on the E½jwideal  wRRP j  L2 XT XC; (27)
maximum number of parallel workers, which is inversely ðT þ 1Þ2 P^
proportional to the spectral radius r.
The rest of the proof follows the same line as the C is a data dependent constant, m is the strong convexity con-
shotgun paper [10]. Briefly, consider the case where stant, L is the domain width of Aj , and P^ is the expected num-
0 2 @rðwt Þ, then ber of indexes that SRRP ðÞ can actually parallelize in each
iteration (since it may not be possible to find P nearly-indepen-
F ðwtþ1 Þ  F ðw Þ  ðwtþ1  w Þ> g  kwtþ1  w k2  kgk2 ;
$ $ $
dent parameters).

and kwtþ1  wt k22 ¼ kgk22 =b2 . Thus, defining dt ¼ F ðwt Þ


$
F ðw Þ, we have We assume that the objective function F ðwÞ ¼ fðwÞ þ
rðwÞ is strongly convex — for certain problems, this can be
E½P ð1  Þ achieved through parameter replication, e.g., minw 12 jjy
Eðdtþ1  dt Þ   Eðd2tþ1 Þ (25) P
2dbkwtþ1  w$ k22 Xwjj22 þ  2Mj¼1 wj is the replicated form of Lasso regression
seen in Shotgun [10].
E½P ð1  Þ
 ½Eðdtþ1 Þ 2 : (26) Lemma 1. The difference between successive updates is:
2dbkwtþ1  w$ k22
1 1
Using induction it follows that Eðdt Þ  E½PCdb
ð1Þ t for some F ðw þ DwÞ  F ðwÞ  ðDwÞT Dw þ ðDwÞT X T XDw:
universal constant C. u
t 2
(28)
The theorem confirms two intuitions: The larger the
number of selected coordinates E½P (i.e., more parallel Proof of Lemma 1. The Taylor expansion of F ðw þ DwÞ
000
workers), the faster the algorithm converges per-iteration; around w coupled with the fact that F ðwÞ (3rd-order)
however, this also increases , demonstrating a tradeoff and higher order derivatives are zero leads to the above
between parallelization and correctness. Also, the smaller result. u
t
XING ET AL.: PETUUM: A NEW PLATFORM FOR DISTRIBUTED MACHINE LEARNING ON BIG DATA 65

Proof of Theorem 3. By using Lemma 1, and telescoping Finally, we apply the strong convexity assumption
sum: to get

ðT Þ ð0Þ
F ðwideal Þ  F ðwideal Þ ðtÞ ðtÞ 2dPm
E½jwideal  wRRP j  L2 X > XC; (34)
X
T
ðtÞ ðtÞ 1 ðtÞ ðtÞ
ðt þ 1Þ2 P^
¼ ðDwideal Þ> Dwideal þ ðDwideal Þ> X > XDwideal :
t¼1
2
where m is the strong convexity constant. u
t
(29)
Since Sideal chooses P features with 0 correlation,
ACKNOWLEDGMENTS
ðT Þ ð0Þ
X
T
ðtÞ ðtÞ
F ðwideal Þ  F ðwideal Þ ¼ ðDwideal Þ> Dwideal : This work is supported in part by the US Defense Advanced
t¼1 Research Projects Agency (DARPA) FA87501220324, and
the US National Science Foundation (NSF) IIS1447676
Again using Lemma 1, and telescoping sum: grants to Eric P. Xing.
ðT Þ ð0Þ
F ðwRRP Þ  F ðwRRP Þ REFERENCES
X
T
ðtÞ ðtÞ 1 ðtÞ ðtÞ [1] Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado,
¼ ðDwRRP Þ> DwRRP þ ðDwRRP Þ> X> XDwRRP : J. Dean, and A. Ng, “Building high-level features using large scale
t¼1
2 unsupervised learning,” in Proc. 29th Int. Conf. Mach. Learn., 2012,
(30) pp. 81–88.
[2] Y. Wang, X. Zhao, Z. Sun, H. Yan, L. Wang, Z. Jin, L. Wang,
Y. Gao, J. Zeng, Q. Yang, et al., “Towards topic modeling for big
Taking the difference of the two sequences, we have: data,” arXiv preprint arXiv:1405.4402, 2014.
[3] J. Yuan, F. Gao, Q. Ho, W. Dai, J. Wei, X. Zheng, E. P. Xing,
ðT Þ ðT Þ
F ðwideal Þ  F ðwRRP Þ T.-Y. Liu, and W.-Y. Ma, “Lightlda: Big topic models on mod-
! est compute clusters,” in Proc. 24th Int. World Wide Web Conf.,
X
T
ðtÞ ðtÞ 2015, pp. 1351–1361.
¼ ðDwideal Þ> Dwideal [4] Y. Zhou, D. Wilkinson, R. Schreiber, and R. Pan, “Large-scale par-
t¼1 allel collaborative filtering for the Netflix prize,” in Proc. 4th Int.
!
X
T
ðtÞ ðtÞ 1 ðtÞ ðtÞ
Conf. Algorithmic Aspects Inf. Manage., 2008, pp. 337–348.
 ðDwRRP Þ> DwRRP þ ðDwRRP Þ> X> XDwRRP : [5] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, Q. Le, M. Mao,
t¼1
2 M. Ranzato, A. Senior, P. Tucker, K. Yang, and A. Ng, “Large scale
distributed deep networks,” in Proc. Adv. Neural Inf. Process. Syst.,
(31) 2012, pp. 1232–1240.
[6] S. A. Williamson, A. Dubey, and E. P. Xing, “Parallel Markov
Taking expectations w.r.t. the randomness in iteration, chain Monte Carlo for nonparametric mixture models,” in Proc.
indices chosen at each iteration, and the inherent ran- Int. Conf. Mach. Learn., 2013, pp. 98–106.
domness in the two sequences, we have: [7] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, “Stochastic
variational inference,” J. Mach. Learn. Res., vol. 14, pp. 1303–1347,
ðT Þ ðT Þ
2013.
E½jF ðwideal Þ  F ðwRRP Þj [8] M. Zinkevich, J. Langford, and A. J. Smola, “Slow learners are
! fast,” in Proc. Adv. Neural Inf. Process. Syst., 2009, pp. 2331–2339.
X
T
ðtÞ ðtÞ
¼ E½j ðDwideal ÞT Dwideal [9] A. Agarwal and J. C. Duchi, “Distributed delayed stochastic
optimization,” in Proc. Adv. Neural Inf. Process. Syst., 2011,
t¼1
! pp. 873–881.
XT ðtÞ T ðtÞ 1 ðtÞ > > ðtÞ [10] J. K. Bradley, A. Kyrola, D. Bickson, and C. Guestrin, “Parallel
  DwRRP DwRRP þ DwRRP X XDwRRP j coordinate descent for l1-regularized loss minimization,” in Proc.
t¼1
2
Int. Conf. Mach. Learn., 2011, pp. 321–328.
  X [11] T. White, Hadoop: The Definitive Guide. Sebastopol, CA, USA:
1 T
ðtÞ ðtÞ
¼ Cdata þ E½j ðDwRRP Þ> X > XDwRRP Þj ; O’Reilly Media, 2012.
2 t¼1 [12] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I.
Stoica, “Spark: Cluster computing with working sets,” in Proc. 2nd
(32) USENIX Conf. Hot Topics Cloud Comput., 2010, p. 10.
[13] Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M.
where Cdata is a data dependent constant. Here, the dif- Hellerstein, “Distributed GraphLab: A framework for machine
ðtÞ ðtÞ ðtÞ ðtÞ
ference between ðDwideal Þ> Dwideal and ðDwRRP Þ> DwRRP learning and data mining in the cloud,” in Proc. VLDB Endowment,
ðtÞ ðtÞ vol. 5, pp. 716–727, 2012.
can only be possible due to ðDwRRP Þ> X> XDwRRP . [14] W. Dai, A. Kumar, J. Wei, Q. Ho, G. Gibson, and E. P. Xing, “High-
Following the proof in the shotgun paper [10], we get performance distributed ML at scale through parameter server
consistency models,” in Proc. 29th Nat. Conf. Artif. Intell., 2015,
pp. 79–87.
ðtÞ ðtÞ 2dP
E½jF ðwideal Þ  F ðwRRP Þj  L2 X> XC; (33) [15] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N.
ðt þ 1Þ2 P^ Leiser, and G. Czajkowski, “Pregel: A system for large-scale graph
processing,” in Proc. ACM SIGMOD Int. Conf. Manage. Data, 2010,
where d is the length of w (number of features), C is a pp. 135–146.
[16] R. Power and J. Li, “Piccolo: Building fast, distributed programs
data dependent constant, L is the domain width of wj with partitioned tables,” in Proc. USENIX Conf. Operating Syst.
(i.e., the difference between its maximum and minimum Des. Implementation, article 10, 2010, pp. 1–14.
possible values), and P^ is the expected number of [17] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V.
Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed
indexes that SRRP ðÞ can actually parallelize in each machine learning with the parameter server,” in Proc. 11th USE-
iteration. NIX Conf. Operating Syst. Des. Implementation, 2014, pp. 583–598.
66 IEEE TRANSACTIONS ON BIG DATA, VOL. 1, NO. 2, APRIL-JUNE 2015

[18] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis, “Large-scale Eric P. Xing received the PhD degree in molecu-
matrix factorization with distributed stochastic gradient descent,” lar biology from Rutgers University, and the
in Proc. ACM SIGKDD 17th Int. Conf. Knowl. Discovery Data Min- another PhD degree in computer science from
ing, 2011, pp. 69–77. UC Berkeley. He is a professor of machine learn-
[19] X. Chen, Q. Lin, S. Kim, J. Carbonell, and E. Xing, “Smoothing ing in the School of Computer Science,Carnegie
proximal gradient method for general structured sparse Mellon University, and the director in the CMU
learning,” in Proc. 27th Conf. Annu. Conf. Uncertainty Artif. Intell., Center for Machine Learning and Health. His
2011, pp. 105–114. principal research interests lie in the development
[20] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proc. of machine learning and statistical methodology;
Nat. Acad. Sci. USA, vol. 101, no. suppl. 1, pp. 5228–5235, 2004. especially for solving problems involving auto-
[21] S. Lee, J. K. Kim, X. Zheng, Q. Ho, G. Gibson, and E. P. Xing, “On mated learning, reasoning, and decision-making
model parallelism and scheduling strategies for distributed in high-dimensional, multimodal, and dynamic possible worlds in social
machine learning,” in Proc. Adv. Neural Inf. Process. Syst., 2014, and biological systems. His current work involves, 1) foundations of
pp. 2834–2842. statistical learning, including theory and algorithms for estimating time/
[22] C. Scherrer, A. Tewari, M. Halappanavar, and D. Haglin, “Feature space varying-coefficient models, sparse structured input/output mod-
clustering for accelerating parallel coordinate descent,” in Proc. els, and nonparametric Bayesian models; 2) framework for parallel
Adv. Neural Inf. Process. Syst., 2012, pp. 28–36. machine learning on big data with big model in distributed systems or in
[23] Q. Ho, J. Cipar, H. Cui, J.-K. Kim, S. Lee, P. B. Gibbons, G. Gibson, the cloud; 3) computational and statistical analysis of gene regulation,
G. R. Ganger, and E. P. Xing, “More effective distributed ml via a genetic variation, and disease associations; and 4) application of
stale synchronous parallel parameter server,” in Proc. Adv. Neural machine learning in social networks, natural language processing, and
Inf. Process. Syst., 2013, pp. 1223–1231. computer vision. He is an associate editor of the Annals of Applied Sta-
[24] Y. Zhang, Q. Gao, L. Gao, and C. Wang, “Priter: A distributed tistics (AOAS), the Journal of American Statistical Association (JASA),
framework for prioritized iterative computations,” in Proc. 2nd the IEEE Transactions of Pattern Analysis and Machine Intelligence
ACM Symp. Cloud Comput., article 13, 2011, pp. 1–14. (PAMI), the PLoS Journal of Computational Biology, and an action editor
[25] E. P. Xing, M. I. Jordan, S. Russell, and A. Y. Ng, “Distance metric of the Machine Learning Journal (MLJ), the Journal of Machine Learning
learning with application to clustering with side-information,” in Research (JMLR). He is a member of the US Defense Advanced
Proc. Adv. Neural Inf. Process. Syst., 2002, pp. 505–512. Research Projects Agency (DARPA) Information Science and Technol-
[26] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon, “Information- ogy (ISAT) Advisory Group, and a Program chair of ICML 2014.
theoretic metric learning,” in Proc. 24th Int. Conf. Mach. Learn.,
2007, pp. 209–216. Qirong Ho received the PhD degree in 2014,
[27] H. B. MacMahan, et al., “Ad click prediction: A view from the under Eric P. Xing at Carnegie Mellon University’s
trenches,” in Proc. ACM SIGKDD 19th Int. Conf. Knowl. Discovery Machine Learning Department. He is a scientist at
Data Mining, 2013, pp. 1222–1230. the Institute for Infocomm Research, A*STAR,
[28] A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Singapore, and an adjunct assistant professor at
Smola, “Scalable inference in latent variable models,” in Proc. the Singapore Management University School
ACM 5th Int. Conf. Web Search Data Mining, 2012, pp. 123–132. of Information Systems. His primary research
[29] K. P. Murphy, Machine Learning: A Probabilistic Perspective. Cam- focus is distributed cluster software systems for
bridge, MA, USA: MIT Press, 2012. machine learning at Big Data scales, with a view
[30] L. Yao, D. Mimno, and A. McCallum, “Efficient methods for topic toward correctness and performance guarantees.
model inference on streaming document collections,” in Proc. 15th In addition, he has performed research on statisti-
ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2009, cal models for large-scale network analysis—particularly latent space
pp. 937–946. models for visualization, community detection, user personalization and
[31] A. Kumar, A. Beutel, Q. Ho, and E. P. Xing, “Fugue: Slow-worker- interest prediction—as well as social media analysis on hyperlinked docu-
agnostic distributed learning for big models on big data,” in Proc. ments with text and network data. He received the Singapore A*STAR
Int. Conf. Artif. Intell. Statist., 2014, pp. 531–539. National Science Search Undergraduate and PhD fellowships.
[32] J. Zhu, X. Zheng, L. Zhou, and B. Zhang, “Scalable inference in
max-margin topic models,” in Proc. 19th ACM SIGKDD Int. Conf. Wei Dai is a PhD student in the Machine Learning
Knowl. Discovery Data Mining, 2013, pp. 964–972. Department at Carnegie Mellon University,
[33] H.-F. Yu, C.-J. Hsieh, S. Si, and I. Dhillon, “Scalable coordinate
advised by Prof. Eric Xing. His research focuses
descent approaches to parallel matrix factorization for recom-
on large scale machine learning systems and
mender systems,” in Proc. IEEE 12th Int. Conf. Data Mining, 2012, algorithms. He has designed system abstractions
pp. 765–774. for machine learning programs that runs efficiently
[34] A. Kumar, A. Beutel, Q. Ho, and E. P. Xing, “Fugue: Slow-worker- in distributed settings, and provided analysis for
agnostic distributed learning for big models on big data,” in Proc. the correctness of algorithms under such abstrac-
17th Int. Conf. Artif. Intell. Statist., 2014, pp. 531–539. tions. He also works on databases to manage big
[35] F. Niu, B. Recht, C. Re, and S. J. Wright, “Hogwild!: A lock-free
data, with a particular focus on feature engineer-
approach to parallelizing stochastic gradient descent,” in Proc. ing geared toward machine learning problems.
Adv. Neural Inf. Process. Syst., 2011, pp. 693–701.
[36] P. Richtarik and M. Takac, “Parallel coordinate descent methods
for big data optimization,” arXiv preprint arXiv:1212.0873, 2012. Jin Kyu Kim is a PhD candidate at Carnegie Mel-
[37] Y. Zhang, Q. Gao, L. Gao, and C. Wang, “Priter: A distributed lon University, advised by professors Garth Gib-
framework for prioritizing iterative computations,” IEEE Trans. son and Eric P. Xing. He received the BS in
Parallel Distrib. Syst., vol. 24, no. 9, pp. 1884–1893, Sep. 2013. computer science from Sogang University, Korea
[38] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, in 2001 and MS in computer science, University
“Imagenet: A large-scale hierarchical image database,” in Proc. of Michigan, Ann-Arbor in 2003. He joined Sam-
IEEE Conf. Comput. Vis. Pattern Recog., 2009, pp. 248–255. sung Electronics, Ltd, Korea in 2003, where he
has been involved in the design of various NAND
flash based storage systems such as SD card,
mobile phone storage, and Solid State Drives.
Since he joined CMU in 2010 for Ph.D study, he
has been engaged in large-scale machine learning research. His
research interests include distributed frameworks for parallel machine
learning, collaborative filtering, topic modeling, and sparse linear optimi-
zation solvers.
XING ET AL.: PETUUM: A NEW PLATFORM FOR DISTRIBUTED MACHINE LEARNING ON BIG DATA 67

Jinliang Wei is a PhD candidate at Carnegie Pengtao Xie is a PhD candidate working with
Mellon University Computer Science Depart- Prof. Eric Xing in Carnegie Mellon University’s
ment, under supervision of Prof. Garth A. Gibson SAILING lab. His research interests lie in the
and Prof. Eric P. Xing. He’s doctoral research is diversity, regularization and scalability of latent
focused on large-scale and distributed systems variable models. Before coming to CMU, he
for ML computation, with an emphasis on sys- obtained a ME from Tsinghua University in 2013
tems research. Before coming to CMU, he and a BE from Sichuan University in 2010. He is
obtained his BS in computer engineering from the recipient of the Siebel Scholarship, Goldman
Purdue University, West Lafayette. Sachs Global Leader Scholarship and National
Scholarship of China.

Seunghak Lee is a research scientist at Human Abhimanu Kumar is a Senior Data Scientist at
Longevity, Inc. His research interests include Groupon Inc, Palo Alto, California. His principal
machine learning and computational biology. Spe- research interests lie in statistical machine learn-
cifically, he has performed research on genome- ing and large-scale computational system and
wide association mapping, structural variants architecture. Abhimanu received a masters in
detection, distributed optimization, and large-scale Computer Science from University of Texas at
machine learning algorithms and systems. In par- Austin, and another masters in Natural Language
ticular, he has focused on high-dimensional prob- Processing and Statistical Learning from Carnegie
lems with applications in computational biology. Mellon University. His current work involves, 1) the-
Dr. Lee received his PhD in 2015, under Dr. Eric ory for statistical learning, 2) distributed machine
P. Xing in Computer Science Department at Car- learning 3) application of statistical learning in nat-
negie Mellon University. ural language, user-interest mining, text and social networks.

Xun Zheng is a Masters student in the Machine Yaoliang Yu is currently a research scientist in
Learning Department at Carnegie Mellon Univer- the CMU Center for machine learning and Health.
sity, advised by Eric P. Xing. His research focuses His primary research interests include optimiza-
on efficient Bayesian inference methods, espe- tion theory and algorithms, nonparametric regres-
cially bringing classical ideas in Markov chain sion, kernel methods, and robust statistics. On the
Monte Carlo (MCMC) to boost inference for latent application side, Dr. Yu has worked on dimension-
variable models. He has performed research in ality reduction, face recognition, multimedia event
large scale machine learning as well. In particular, detection and topic models. Dr. Yu obtained his
he is working on designing algorithms that are eas- PhD from the Computing Science Department of
ier to parallelize and building suitable distributed University of Alberta in 2013.
systems for different type of learning algorithms.

" For more information on this or any other computing topic,


please visit our Digital Library at www.computer.org/publications/dlib.

You might also like