0% found this document useful (0 votes)
9 views87 pages

Distributed Optimization and Data Market Design

This thesis presents a novel approach to distributed optimization through local computation algorithms, specifically introducing the LOCO algorithm, which significantly reduces communication requirements compared to traditional methods. Additionally, it explores the design of a geo-distributed cloud data market, proposing a near-optimal algorithm for joint data purchasing and placement that minimizes operational costs. The findings demonstrate substantial improvements in efficiency for both distributed optimization and data market operations.

Uploaded by

Agesky Zhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views87 pages

Distributed Optimization and Data Market Design

This thesis presents a novel approach to distributed optimization through local computation algorithms, specifically introducing the LOCO algorithm, which significantly reduces communication requirements compared to traditional methods. Additionally, it explores the design of a geo-distributed cloud data market, proposing a near-optimal algorithm for joint data purchasing and placement that minimizes operational costs. The findings demonstrate substantial improvements in efficiency for both distributed optimization and data market operations.

Uploaded by

Agesky Zhang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 87

Distributed Optimization and Data Market Design

Thesis by
Palma London

In Partial Fulfillment of the Requirements for the


degree of
Master of Science

CALIFORNIA INSTITUTE OF TECHNOLOGY


Pasadena, California

2017
Submitted May 19, 2017
ii

c 2017
Palma London
All rights reserved
iii
ACKNOWLEDGEMENTS

I thank Prof. Adam Wierman for advising me throughout my Masters Degree.


I also thank Shai Vardi, Xiaoqi Ren, Niangjun Chen, and Juba Ziani for their
help during collaborations.
iv
ABSTRACT

In this thesis we propose a new approach for distributed optimization based


on an emerging area of theoretical computer science – local computation algo-
rithms. The approach is fundamentally different from existing methodologies
and provides a number of benefits, such as robustness to link failure and adap-
tivity to dynamic settings. Specifically, we develop an algorithm, LOCO, that
given a convex optimization problem P with n variables and a “sparse” linear
constraint matrix with m constraints, provably finds a solution as good as
that of the best online algorithm for P using only O(log(n + m)) messages
with high probability. The approach is not iterative and communication is re-
stricted to a localized neighborhood. In addition to analytic results, we show
numerically that the performance improvements over classical approaches for
distributed optimization are significant, e.g., it uses orders of magnitude less
communication than ADMM.

We also consider the operations of a geographically distributed cloud data


market. We consider design decisions that include which data to purchase
(data purchasing) and where to place or replicate the data for delivery (data
placement). We show that a joint approach to data purchasing and data
placement within a cloud data market improves operating costs. This problem
can be viewed as a facility location problem, and is thus NP-hard. However,
we give a provably optimal algorithm for the case of a data market consisting
of a single data center, and then generalize the result from the single data
center setting in order to develop a near-optimal, polynomial-time algorithm
for a geo-distributed data market. The resulting design, Datum, decomposes
the joint purchasing and placement problem into two subproblems, one for
data purchasing and one for data placement, using a transformation of the
underlying bandwidth costs. We show, via a case study, that Datum is near-
optimal in practical settings.
v
PUBLISHED CONTENT AND CONTRIBUTIONS

Palma London, Niangjun Chen, Shai Vardi, and Adam Wierman. Distributed
Optimization via Local Computation Algorithms. http://users.cms.caltech.
edu/~plondon/loco.pdf. Under submission. 2017.

P. London and S. Vardi came up with the results and proofs in this paper, and
P. London coded and ran all experiments. Article adapted and extended for
this thesis.

Xiaoqi Ren, Palma London, Juba Ziani, and Adam Wierman. Joint Data Pur-
chasing and Data Placement in a Geo-Distributed Data Market. Proceedings
of the 2016 ACM SIGMETRICS International Conference on Measurement
and Modeling of Computer Science. 2016.

X. Ren and P. London came up with the results and proofs in this paper, and
P. London coded and ran all experiments. Article adapted and extended for
this thesis.
vi
TABLE OF CONTENTS

Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

I Introduction and Motivation 1


Chapter I: Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 2

II Distributed Optimization via Local Computation


Algorithms 4
Chapter II: Introduction to Distributed Optimization . . . . . . . . . . 5
2.1 Contributions of this work . . . . . . . . . . . . . . . . . . . . 6
2.2 Related literature . . . . . . . . . . . . . . . . . . . . . . . . . 7
Chapter III: Network Utility Maximization . . . . . . . . . . . . . . . . 9
3.1 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Distributed Algorithms for Network Utility Maximization . . . 10
3.3 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . 10
Chapter IV: Local Convex Optimization . . . . . . . . . . . . . . . . . 12
4.1 An overview of LOCO . . . . . . . . . . . . . . . . . . . . . . . 12
4.2 Analysis of LOCO . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.3 Contrasting LOCO and ADMM . . . . . . . . . . . . . . . . . 17
Chapter V: Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 21
Chapter VI: Concluding Remarks . . . . . . . . . . . . . . . . . . . . . 23

IIIData Purchasing and Data Placement in a Geo-


Distributed Data Market 24
Chapter VII: Introduction to Distributed Data Markets . . . . . . . . . 25
Chapter VIII: Opportunities and challenges . . . . . . . . . . . . . . . 29
8.1 The potential of data markets . . . . . . . . . . . . . . . . . . 29
8.2 Operational challenges for data markets . . . . . . . . . . . . . 30
Chapter IX: A Geo-Distributed Data Cloud . . . . . . . . . . . . . . . 33
9.1 Modeling Data Providers . . . . . . . . . . . . . . . . . . . . . 33
9.2 Modeling Clients . . . . . . . . . . . . . . . . . . . . . . . . . . 34
9.3 Modeling a Geo-Distributed Data Cloud . . . . . . . . . . . . . 35
Chapter X: Optimal data purchasing & data placement . . . . . . . . . 40
vii
10.1 An exact solution for a single data center . . . . . . . . . . . . 41
10.2 The design of Datum . . . . . . . . . . . . . . . . . . . . . . 44
Chapter XI: Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . 49
11.1 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . 49
11.2 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . 51
Chapter XII: Related work . . . . . . . . . . . . . . . . . . . . . . . . . 55
Chapter XIII: Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 57
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Appendix A: Pseudocode for General Online Fractional Packing . . . . 66
A.1 ADMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Appendix B: Proof of Lemma 3 . . . . . . . . . . . . . . . . . . . . . . 68
Appendix C: Proof of Theorem 8 . . . . . . . . . . . . . . . . . . . . . 71
C.1 Proof of Theorem 9 . . . . . . . . . . . . . . . . . . . . . . . . 72
C.2 Proof of Step 2 in §10.2 . . . . . . . . . . . . . . . . . . . . . . 76
C.3 Bulk Data Contracting . . . . . . . . . . . . . . . . . . . . . . 77
Part I

Introduction and Motivation

1
2
Chapter 1

INTRODUCTION

We consider algorithms for distributed optimization and their applications. In


this thesis we propose two new approaches to distributed optimization and
consider an exciting application in distributed data markets.

The first algorithm, LOCO, is a fundamentally new approach to distributed


optimization. There are a wide variety of approaches for distributed opti-
mization, which fall into the categories of dual decomposition and subgradient
methods, and consensus-based schemes. We propose a new approach which
utilizes local computation algorithms, a rising field in theoretical computer
science. A local algorithm is one where a query about part of a solution to a
problem can be answered by communicating with only a small number of com-
putation units in the distributed setting. Neither iterative descent methods
nor consensus methods are local: answering a query about a part of the solu-
tion requires global communication. The advantage offered by LOCO is that
significantly less communication is required to solve the optimization problem
[60].

Secondly, we consider the management of geographically distributed data cen-


ter. For example, imagine we are operating a data service like Yelp. Clients
submit queries to Yelp on their personal devices. Then, Yelp contacts various
data providers that sell data and may have to pay a data purchasing fee. Yelp
routes data through its data centers, and delivers it to clients. We believe that
data as a service is a growing market. In the near future people might want to
buy data as a service just as computing infrastructure is bought today. Given
prices offered by data providers, we, the data center designers, need to decide
which data providers to buy data from to satisfy client queries at minimal cost.
We also decide how data should be stored and replicated throughout the geo-
distributed data center to minimize bandwidth and latency costs. We design a
data center that jointly optimizes data purchasing costs and bandwidth costs.
Today many data center designs minimize bandwidth costs. However, when
data purchasing costs are also considered, the structure of the optimization
problem changes. We model the problem as a facility location problem and
3
thus it is NP hard. We propose a near-optimal, polynomial-time algorithm
and in a simulation study we show that our algorithm is near optimal [77].
Part II

Distributed Optimization via


Local Computation Algorithms

4
5
Chapter 2

INTRODUCTION TO DISTRIBUTED OPTIMIZATION

The goal of this work is to introduce a new, fundamentally different approach


to distributed optimization based on an emerging area of theoretical computer
science – local computation algorithms.

Distributed optimization is an area of crucial importance to networked con-


trol. Settings where multiple, distributed, cooperative agents need to solve an
optimization problem to control a networked system are numerous and varied.
Examples include management of content distribution networks and data cen-
ters [13, 70], communication network protocol design [47, 62, 84], trajectory
optimization [39, 53], formation control of vehicles [86, 75], sensor networks
[69, 59], control of power systems [27, 72], and management of electric vehicles
and distributed storage devices [16, 35].

Distributed optimization is a field with a long history. Beginning in the 1960s


approaches emerged for solving large scale linear programs via decomposition
into pieces that could be solved in a distributed manner. For example, two
early approaches are Bender’s decomposition [9] and the Dantzig-Wolfe de-
composition [24, 23], which can both be generalized to nonlinear objectives
via the subgradient method [10, 67, 83].

Today, there is a wide variety of approaches for distributed optimization, e.g.,


primal decomposition [54, 10] and dual decomposition [28, 67, 61, 84]. See
[71] for a survey. Broadly, these approaches tend to fall into two categories.
The first category uses dual decomposition and subgradient methods [47, 61,
84]; the second involves consensus-based schemes which enable decentralized
information aggregation, which forms the basis for many first order and second
order distributed optimization algorithms [12, 66].

While the algorithms described above are distributed, they are not local. A
local algorithm is one where a query about a small part of a solution to a
problem can be answered by communicating with only a small neighborhood
around the part queried1 (see Subsection 2.2 for a more comprehensive defini-
1
‘Local’ is an overloaded term in the literature. We mean local in the sense of [78].
6
tion and example). Clearly, neither iterative descent methods nor consensus
methods are local: answering a query about a piece of the solution requires
global communication.

Local computation is well suited for distributed optimization. For example,


any failure in the system only has local effects: if a node in a distributed
system goes offline while an iterative distributed algorithm is executing, the
whole process is brought to a halt (or at least the system needs to be carefully
designed to be able to accommodate such failures); if the computations are all
local, the failure will only affect a small number of nodes in the neighborhood
of the failure. Similarly, lag in a single edge affects the computation of the
entire solution in the iterative setting, while most computations are not be
affected at all when the computations are local. Another advantage of local
computation is that it allows the system to be more dynamic: an arrival of
another node requires recomputing the entire solution if the algorithm is not
local, but requires only a few local messages and computations if the algorithm
is local.

Despite the benefits of local algorithms for distributed optimization, the prob-
lem of designing a local, distributed optimization algorithm is open.

2.1 Contributions of this work


This paper introduces an algorithm, LOCO, (LOcal Convex Optimization)
that is both distributed and local. It is not an iterative method and uses
far less communication to compute small parts of the solution than iterative
descent and consensus methods, e.g., ADMM and dual decomposition, while
matching the total communication if the whole solution is queried.

While the technique we propose is general, in this work, we focus on a canoni-


cal optimization problem: network utility maximization. Due to space restric-
tions, we only consider the variant of maximizing throughput, which amounts
to solving a distributed linear program. We focus on this case because it is par-
ticularly well-studied and, in addition, the objective function is linear, which
in many cases is known to produce the worst performance guarantee for online
convex optimization problems [5, 41].

In Section 4, we provide worst-case guarantees on the performance of LOCO with


respect to the relative error and the number of messages it requires. In Sec-
tion 5, we compare the performance of LOCO with ADMM, and show that
7
LOCO uses orders of magnitude less communication than ADMM if only part
of the solution is required, and the same order of magnitude if the entire
solution is required. Furthermore, in terms of both the amount of communica-
tion required and the relative error, LOCO vastly outperforms its theoretical
guarantees.

The key idea behind LOCO is an extension of recent results from the emerging
field of local computation algorithms (LCA) in theoretical computer science
(e.g., [63, 76, 56]). In particular, a key insight of the field is that online
algorithms can be converted into local algorithms in graph problems with
bounded degree [63]. However, much of the focus of local algorithms has,
to this point, been on graph problems (see related literature below). The
technical contribution of this work is the extension of these ideas to convex
programs.

2.2 Related literature


This work, for the first time, brings techniques from the field of local com-
putation algorithms into the domain of networked control. The LCA model
was formally introduced by Rubinfeld et al. [78], after many algorithms fitting
within the framework had recently appeared in distinct areas, e.g., [79, 4, 46].
LCAs have received increasing attention in the years that followed as the im-
portance of local, distributed computing has grown with the increasing scale
of problems in distributed systems, the internet of things, etc.

The main idea of LCAs is to compute a piece of the solution to some algorith-
mic problem using only information that is close to that piece of the problem,
as opposed to a global solution, by exchanging information across distributed
agents. More concretely, an LCA receives a query and is expected to output
the part of the solution associated with that query. For example, an LCA for
maximal matching would receive as a query an edge, and its output would
be “yes/no”, corresponding to whether or not the edge is part of the required
matching. The two requirements are (i) the replies to all queries are consistent
with the same solution, and (ii) the reply to each query is “efficient”, for some
natural notion of efficient.

Most of the work on LCAs has focused on graph problems such as matching,
maximal independent set, and coloring (e.g., [3, 56, 76, 31]) and the efficiency
criteria were the number of probes to the graph, the running time and the
8
amount of memory required. This paper extends the LCA literature by mov-
ing from graph problems to optimization problems, which have not been stud-
ied in the LCA community previously. Mansour et al. [63] showed a general
reduction from LCAs to online algorithms on graphs with bounded degree.
The key technical contribution of our the work is extending that technique to
design LCAs for convex programs. In contrast to previous work whose primary
focus was probe, time and space complexities, the efficiency criterion we use
is the number of messages required as this is usually the expensive resource in
networked control.
9
Chapter 3

NETWORK UTILITY MAXIMIZATION

In order to illustrate the application of local computation algorithms to dis-


tributed optimization, we focus on the classic setting of network utility maxi-
mization (NUM). The NUM framework is a general class of optimization prob-
lems that has seen wide-spread application to distributed control in domains
from the design of TCP congestion control [47, 61, 62, 84] to understanding
of protocol layering as optimization decomposition [18, 71] and power system
demand response [80, 58]. For a recent survey, see [96].

3.1 Model
The NUM framework considers a network containing a set of links L =
{1, . . . , m} of capacity cj , for j ∈ L. A set of N = {1, . . . , n} sources shares
the network; source i ∈ N is characterized by (Li , fi , xi , x̄i ): a path Li ⊆ L
in the network; a (usually) concave utility function fi : R+ → R; and the
minimum and maximum transmission rates of i.

The goal in NUM is to maximize the sources’ aggregate utility. Source i attains
a concave utility fi (xi ) when it transmits at rate xi that satisfies xi ≤ xi ≤ x̄i ;
the optimization of aggregate utility can be formulated as follows,
n
X
max fi (xi )
x
i=1

subject to Ax ≤ c
x ≤ x ≤ x̄,

1, j ∈ L(i)
where A ∈ Rm×n
+ is defined as Aji = .
0, otherwise

The NUM framework is general in that the choice of fi allows for the rep-
resentation of different goals of the network operator. For example, using
fi (xi ) = xi , maximizes throughput; setting fi (xi ) = log(xi ) achieves propor-
tional fairness among the sources; setting fi (xi ) = −1/xi minimizes potential
delay; these are common goals in communication network applications [61, 64].
10
In this paper we focus on the throughput maximization case, i.e., fi (xi ) = xi ; in
this case NUM is an LP. Note that the classical dual decomposition approach
does not work for throughput maximization since it requires the objective
function to be strictly concave. However, ADMM can be applied.

Our complexity results hinge on the assumption that the constraint matrix A
is sparse. The sparsity of A is defined as max{α, β}, where α and β denote the
maximum number of non-zero entries in a row and column of A respectively.
Formally, we say that A is sparse if the sparsity of A is bounded by a constant.
This assumption usually holds in network control applications since α is the
maximum number of sources sharing a link, which is typically small compared
to n, and β is the maximum number of links each source uses, which is typically
small compared to m.1

3.2 Distributed Algorithms for Network Utility Maximization


Given the NUM formulation above, the algorithmic goal is to design a protocol
that efficiently finds an (approximately) optimal solution. If the network is
huge, it is often beneficial to distribute the solution, as performing the entire
computation on a single machine is too costly [14, 84].

There is a large literature across the networked control and communication


networks literatures that seeks to design such distributed optimization al-
gorithms, e.g., [18, 47, 62]. Dual decomposition algorithms are particularly
prominent for use in this setting. However, many such methods cannot be
applied to the case of throughput maximization, i.e., linear fi . One extremely
prominent algorithm that does apply in the case of throughput maximization
is Alternating Method of Multipliers (ADMM), which was introduced by [34]
and has found broad applications in e.g., denoising images [85], support vector
machine [33], and signal processing [20, 19, 81]. As a result, we use ADMM as
a benchmark for comparison in this paper. For completeness, the application
of ADMM to NUM is described in Appendix A.1.

3.3 Performance metrics


Distributed algorithms for NUM should perform well on two measures.

The first is message complexity: the number of messages that are sent across
1
When α is large, many links will be congested and all sources will experience greater
delay, the routing protocol (IP) will start using different links; also, due to the small diameter
of the Internet graph [2], β is small compared to m.
11
e1 e2 s2
s1 e3 t3
t2 t1 e4 s3
(a)
s1 s2 s3 e1 e 0.4
1
e1 1 0 0 s1 e2
e2 1 1 0 s 0.3 e2 0.2
2 e3 e4
e3 0 1 1
e4 0 0 1
s3 e4 e3 0.1
(b) (c) (d)

Figure 3.1: An illustration of LOCO on a toy graph with five nodes and
four edges, e1 , . . . , e4 . There are three sources, s1 , s2 , s3 , with paths ending in
destinations t1 , t2 , t3 respectively. The graph is depicted in (a); the constraint
matrix for NUM is given in (b); the bipartite graph representation of the
matrix in (c); and the dependency graph in (d). The rank of each constraint
(edge) is written in the node representing the constraint in the dependency
graph. The shaded nodes represent the query set for source s1 .

links of the network in order to compute the solution. When the algorithm
uses randomization, we want the message complexity to hold with probability
at least 1 − n1α , where where n is the number of vertices in the network and
1
α > 0 can be an arbitrarily large constant. We denote this by 1 − poly n
. We
do not bound the size of the messages, but note that in both our algorithm
and ADMM the message length will be of order O(log n).

The second is the approximation ratio, which measures the quality of the
solution provided by the algorithm. Specifically, an algorithm is said to α-
approximate a maximization problem if its solution is guaranteed to be at
least OPα T , where OP T is the value of the optimal solution. If the algorithm
is randomized, the approximation ratio is with respect to the expected size
of the solution. We will compare the performance of LOCO with iterative
algorithms such as ADMM, for which approximation ratio is not a standard
measure. Thus in our empirical results, comparison with the optimal solution
is made using relative error, defined in Section 5.1, which is related to, but
slightly different from the approximation ratio.
12
Chapter 4

LOCAL CONVEX OPTIMIZATION

In this section, we introduce a local algorithm for distributed convex opti-


mization, LOcal Convex Optimization (LOCO). In LOCO, every source in the
network computes its portion of a near optimal solution using a small number
of messages, without needing global communication or iteration. This is in
contrast to iterative descent methods, e.g. ADMM, which are global, i.e., they
spread the information necessary to find an optimal solution throughout the
whole network over a series of rounds. LOCO has provable worst-case guaran-
tees on both its approximation ratio and message complexity, and improves on
the communication overhead of iterative descent methods by orders of magni-
tude in practice when asked to compute a piece of the optimal solution.

4.1 An overview of LOCO


The key insight in the design and analysis of LOCO is that any natural1 online
optimization algorithm can be converted into a local, distributed optimization
algorithm. Note that the resulting distributed algorithm is for a static problem,
not an online one. Further, after this conversion, the distributed optimization
algorithm has the same approximation ratio as the original online optimization
algorithm. Thus, given an optimization problem for which there exist effective
online algorithms, these online algorithms can be converted into effective local,
distributed algorithms.

More formally, to reduce a static optimization problem to an online optimiza-


tion problem, we do the following. Let Y be the set of constraints of an opti-
mization problem P . Let r : Y → [0, 1] be a ranking function that assigns each
constraint yj a real number between 0 and 1, uniformly at random. We call
r(yj ) yj ’s rank. Suppose that there is some online algorithm ALG that receives
the constraints sequentially and must augment the variables immediately and
1
Strictly speaking, we require that the online algorithm have the following characteristic:
knowing the output of the algorithm for the “neighbors” of a query q that arrived before
q is sufficient to determine the output for q. We omit this technicality from the theorem
statements as the online algorithm we use, and indeed all online algorithms for convex
optimization that we are aware of, have this property. For a more in-depth discussion, we
refer the reader to [76].
13
irrevocably so as to satisfy each arriving constraint. Suppose furthermore that
for each constraint yj , we can pinpoint a small set of constraints S(yj ) (which
we call yj ’s query set) that arrived before it so that restricting the set of con-
straints of P to S(yj ) results in ALG producing (exactly) the same solution for
the variables that are present in yj . Then simulating ALG only on S(yj ) would
suffice to obtain the solution for the variables in yj . This is precisely what our
algorithm does: it generates a random order of arrival for the constraints, and
for each constraint yj , it constructs such a set S(yj ) and simulates the online
algorithm on it. An arbitrary ordering could mean that these dependency sets
are very large for some constraints; to bound the size of these sets, we require
that (i) the constraint matrix of P is sparse and (ii) the order generated is
random.2

Concretely, there are two main steps in LOCO. In the first, LOCO generates
a localized neighborhood for each vertex. In the second, LOCO simulates an
online algorithm on the localized neighborhood. Importantly, the first step
is independent of the precise nature of the online algorithm, and the second
is independent of the method used to generate the localized neighborhoods.
Therefore, we can think of LOCO as a general methodology that can yield a
variety of algorithms. For example, we can use different online algorithms for
the second step of LOCO depending on whether we consider a linear NUM
problem or a strictly convex NUM problem. More specifically, the two steps
work as follows.

Step 1, Generating a localized neighborhood For clarity, we break the


first step into three sub-steps, see also Figure 3.1.

Step 1a, Representing the constraint matrix as a bipartite graph A


boolean matrix A can be represented as a bipartite graph G = (L, R, E 0 ) as
follows. Each row of A is represented by a vertex v` ∈ L; each column by
a vertex vr ∈ R. The edge (v` , vr ) is in E 0 if and only if A`,r = 1. A more
intuitive way to interpret G is the following: L represents the variables, R
the constraints. Edges represent which variables appear in which constraints.
Note that the maximum degree of G is exactly the sparsity of A.
2
Pseudo-random orders suffice, see [76].
14
Step 1b, Constructing the dependency graph We construct the depen-
dency graph H = (V, E) as follows. The vertices of the dependency graph
are the vertices of R; an edge exists between two vertices in H if the corre-
sponding vertices in G share a neighbor. Intuitively, H represents the “direct
dependencies” between the constraints: changing the value of any variable im-
mediately affects all constraints in which it appears, hence these constraints
can be thought of as directly dependent. The maximum degree of H is upper
bounded by the square of the sparsity of A.

Step 1c, Constructing the query set In order to build the query set, we
generate a random ranking function on the vertices of H, r : V → [0, 1]. Given
the dependency graph H, an initial node y ∈ V and the ranking function r,
we build the query set of y, denoted S(y), using a variation of BFS, as follows.
Initialize S(y) to contain y. For every vertex v ∈ S(y), scan all of v’s neighbors,
denoted N (v). For each u ∈ N (v), if r(u) ≤ r(v), add u to S(y). Continue
iteratively until no more vertices can be added to S(y) (that is, for every vertex
v ∈ S(y) all of its neighbors that are not themselves in S(y) have lower rank
than v). If there are ties (i.e., two neighbors u, v such that r(u) = r(v)), we
tie-break by ID.3

Step 2, Simulating the online algorithm Assume that we have an online


algorithm for the problem that we would like LOCO to solve (in this paper we
use the online packing Algorithm of Buchbinder and Naor [15, chapter 14]. We
provide the pseudocode in Appendix A, for completeness). The specific setting
that the online algorithm must apply to is the following: the variables of the
convex program are known in advance, as are the univariate constraints. The
(rest of the) constraints arrive one at a time; the online algorithm is expected
to satisfy each constraint as it arrives, by increasing the value of some of the
variables. It is never allowed to decrease the value of any variable. We simulate
the online algorithm as follows:

In order to compute its own value in the solution, source i applies r to the
set of constraints in which it is contained, Y (i). For y = arg maxz∈Y (i) {r(z)},
it simulates the online algorithm on S(y). That is, it executes the online
algorithm on the neighborhood constructed in Step 1 for the “last arriving”
3
Any consistent tie breaking rule suffices.
15
constraint that contains i. i’s value is the value of i at the end of the simula-
tion. Claim 4 below shows that i’s value is identical to its value if the online
algorithm was executed on the entire program, with the constraints arriving
in the order defined by r.

4.2 Analysis of LOCO


Our main theoretical result shows that LOCO can compute solutions to convex
optimization problems that are as good as those of the best online algorithms
for the problems using very little communication. We then specialize this
case to throughput maximization in NUM. While we focus on NUM in this
paper, the theorem (and its proof) apply to a wider family of problems as
well. Specifically, the conversion from online to local outlined below can be
used more broadly for any class of optimization problems for which effective
online algorithms exist. Thus, improvements to online optimization problems
immediately yield improved local optimization algorithms.

Theorem 1 Let P be a problem with a concave objective function and linear


inequality constraints, with n variables and m constraints, whose constraint
matrix has sparsity σ. Given an online algorithm4 for P with competitive
ratio h(n, m), there exists a local computation algorithm for P with approx-
2
imation ratio h(n, m) that uses 2O(σ ) log (n + m) messages with probability
1 − 1/poly(n, m).

In particular, we have the following result, for NUM with a linear objective
function.

Theorem 2 Let P be a throughput maximization problem with n variables, m


constraints, and a sparse constraint matrix. LOCO computes an O(log m) –
approximation to the optimal solution of P using O(log(n + m)) messages with
probability 1 − 1/ poly(n, m).

The approximation ratio in Theorem 2 comes from the online algorithm pre-
sented and analyzed in [15] (see Lemma 6). The analysis of the online al-
gorithm is for adversarial input; therefore it is natural to expect LOCO to
achieve a much better approximation ratio in practice, as LOCO randomizes
4
See footnote 1 .
16
the order in which the constraints “arrive”. It is an open question to give bet-
ter theoretical bounds for stochastic inputs, and if such results are obtained
they would immediately improve the bounds in Theorem 2.
The core technical lemma required for the proof of Theorem 1 is the following.

Lemma 3 Let G = (V, E) be a graph whose degree is bounded by d and let


r : V → [0, 1] be a function that assigns to each vertex v ∈ V a number between
0 and 1 independently and uniformly at random. Let Tmax be the size of the
largest query set of G: Tmax = max{|Tv | : v ∈ V }. Then, for λ = 4(d + 1),
1
Pr[|Tmax | > 2λ · 15λ log n] ≤ 2 .
n
The proof of Lemma 3 uses ideas from a proof in [76], and employs a quantiza-
tion of the rank function. Its proof is deferred to Appendix B. The following
simple claim implies that the approximation ratio of LOCO is the same as
that of the online algorithm.
In addition to Lemma 3, the following claim and technical lemma are needed
to complete the proof of Theorem 2.

Claim 4 For any source i, the value of xi in the output of LOCO is identical
to its value in the output of the online algorithm.

Proof 5 Let constraint yj be the last constraint containing i that arrives in


the order defined by r; its arrival is the last time i will be updated. Therefore it
is sufficient to only consider constraints arriving before yj . Further, by design,
S(yj ) is the set of constraints at whose arrival there is possibly some change
that may affect the value of i.

The following lemma is a restatement of Theorem 14.1 in [15], adapted to


throughput maximization. See Appendix A for the pseudocode of the algo-
rithm.

Lemma 6 For any B > 0, there exists is a B-competitive online algorithm to


linearly-constrained NUM with m constraints; each constraint is violated by a
factor at most 2 log(1+m)
B
.

Proof 7 (Proof of Theorem 2) Theorem 1, Claim 4 and Lemma 6, setting


B = 2 log(1 + m), imply Theorem 2.
17
4.3 Contrasting LOCO and ADMM
LOCO fundamentally differs from iterative descent and consensus style ap-
proaches to distributed optimization. While iterative descent and consensus
style approaches are inherently iterative, LOCO is not. Under LOCO, a node
can compute its value in one shot, once it gets information about its query set.

Additionally, while iterative descent and consensus style approaches are global,
LOCO is local. Under LOCO, communication stays within the query set and so
the computation only needs to be updated if changes happen within the query
set. This means that LOCO is robust to churn, failures, and communication
problems outside of that set of nodes.

Another important difference is that LOCO does not compute the optimal so-
lution, while iterative descent and consensus style approaches will eventually
converge to the true optimal. The proven analytical bounds for LOCO are
based on worst-case adversarial input. We show in Section 5.2 that our em-
pirical results outperform the theoretical guarantees by a considerable margin.
This is in part because the ranking is done randomly rather than in an adver-
sarial fashion (we elaborate on this in Section 5).

Finally, note that there is an important difference in the form of the theoretical
guarantees for LOCO and iterative descent and consensus style algorithms.
LOCO has guarantees in terms of the approximation ratio, while iterative
descent and consensus style algorithms have convergence rate guarantees. For
example, ADMM has guarantees on convergence of the norms of the primal
and dual residuals [14, Chapter 3.3].
18
Chapter 5

CASE STUDY

Here we present the results of a simulation study demonstrating the empirical


performance of LOCO on both synthetic and real networks. The results high-
light that an orders-of-magnitude reduction in communication is possible with
LOCO as compared to ADMM, which we choose as a prominent example of
current approaches for distributed optimization. For concreteness, our exper-
iments focus our numeric results on distributed linear programming, i.e., the
case of linear NUM. This is the NUM setting where one could expect LOCO
to perform the worst, given that linear functions are typically the worst-case
examples for online convex optimization algorithms [5, 41].

5.1 Experimental setup


Problem Instances

For our first set of experiments, we generate random synthetic instances of


linear NUM. Let n = m and define the constraint matrix A ∈ R(m×n) as
follows. Set Ãj,i = 1 with probability p and Ãj,i = 0 otherwise. Let A = Ã+In
to ensure each row of A has at least one non zero entry.1 The vector c ∈ Rn is
drawn i.i.d. from Unif[0, 1]. We set the minimum and maximum transmission
rates to be xi = 0 and x̄i = 1. Finally, for the rank function used by LOCO
we use a random permutation of the vertex IDs.2

For our second set of experiments, we use the real network from the graph
of Autonomous System (AS) relationships in [87]. The graph has 8020 nodes
and 36406 edges. In order to interpret the graph in a NUM framework, we
associate each source with a path of links, ending at a destination node. To
do this, for each node i in the graph, we randomly select a destination node ti
which is at distance `i , sampled i.i.d. from Unif[` − 0.5`, ` + 0.5`]. We repeat
this for several values of `. (The distance between two nodes is the length
1
Note that this matrix does not have constant sparsity; however this can only increase
the message complexity. Irregardless, it is possible to adapt the theoretical results to hold
for this data as well, using techniques from [76].
2
For the purposes of our simulations, such a permutation can be efficiently sampled, and
guarantees perfect randomness. For larger n and m, it is possible to use pseudo-randomness
with almost no loss in message complexity [76].
19
6
6 x 10
x 10 12 ADMM 1 LOCO Tot LOCO Avg
15 ADMM 1 LOCO Tot LOCO Avg ADMM 1 LOCO Tot LOCO Avg
ADMM 2 LOCO Max
ADMM 1 LOCO Tot LOCO Avg 8
8 ADMM 2 LOCO Max ADMM 2 LOCO Max 10
ADMM 2 LOCO Max 10

Messages
Messages
8 6

Messages
6

Messages
10 10 10
4 4
10 10
5 2
4
10 2
10
0
10 0
0 0 10
0 5000 10000 15000 0 5000 10000 15000 1 1.5 2 0.5 1 1.5 2
n n p −4 p −4
x 10 x 10

(a) (b) (c) (d)

Figure 5.1: Illustration of the number of messages required by ADMM and


LOCO for the synthetic data set with results averaged over 50 trials. Plots (a)
and (b) vary n while fixing sparsity p = 10−4 , showing the results in linear-
scale and log-scale respectively. Plots (c) and (d) fix n = 103 and vary the
sparsity p, showing the results in linear-scale and log-scale respectively.

of the shortest path between them.) Then, we designate the path L(i) to
be the set of links comprising the shortest path between the source and the
destination. The vectors c, x, and x̄ are chosen in the same manner as for the
synthetic networks.

Algorithm tuning

Our results focus on comparing LOCO and ADMM. Running ADMM requires
tuning four parameters [14]. Unless otherwise specified, we set the relative and
absolute tolerances to be rel = 10−4 and abs = 10−2 , the penalty parameter to
be ρ = 1, and the maximum number of allowed iterations to be tmax = 10000.
This is done to provide the best performance for ADMM: the parameters are
tuned in the typical fashion to optimize ADMM [14]. Running LOCO requires
tuning only one parameter: B, which governs the worst-case guarantee for
the online algorithm used in step 2. A smaller B gives a “better guarantee”,
however some constraints may be violated. Setting B = 2 ln(1 + m) provides
the best worst-case guarantee, and is our choice in the experiments unless
stated otherwise. In fact, it is possible to tune B (akin to tuning ADMM)
to specific data, as the constraints are often still satisfied for smaller B. In
Figure 5.3 (c), we show the improvement in performance guarantee by tuning
B, while keeping the dual solution feasible.

Metrics

For our numeric results, we evaluate ADMM and LOCO with respect to the
quality of the solution provided and the number of messages sent.
20
6
x 10 10
10
8
ADMM LOCO Max
LOCO Tot LOCO Avg
6

Messages
Messages
ADMM
LOCO Tot 5
4 LOCO Max 10
LOCO Avg

2
0
0 10
5 10 15 20 5 10 15 20
Average Path Length Average Path Length

(a) (b)

Figure 5.2: Illustration of the number of messages required by ADMM and


LOCO for the real network data with n = 8020 and various average path
lengths L(i).
7 7
10 10 0.25

0.2

Relative Error
6 6
10 10
Messages

5 Messages 5
0.15
10 10
0.1
4 ADMM 4 ADMM
10 LOCO Tot 10 LOCO Tot 0.05
LOCO Max LOCO Max
3 LOCO Avg 3 LOCO Avg LOCO
10 10 0
0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2 10 11 12 13 14
Relative Error Relative Error B

(a) (b) (c)

Figure 5.3: Comparison of the relative error and the number of messages
required by LOCO and ADMM. Plots (a) and (b) show the Pareto optimal
curve for ADMM with a range of relative tolerances rel ∈ {10−4 , 10−1 }. Plot
(c) depicts how tuning B effects the relative error. The right most point
corresponds to B = 2 ln(1 + m).

To assess the quality of the solution we measure the relative error, which is
∗ LOCO
defined as |p −p|p∗ | | , where p∗ is the optimal solution. For problem instances
of small dimension, one can run an interior point method to check the optimal
solution, but this is too tedious for large problem sizes. In the large dimension
cases we consider, we regard p∗ to be ADMM’s solution with small tolerances,
such that the maximum number of allowed iterations is never needed. Note
that the relative error is an empirical, normalized version of the approximation
ratio for a given instance.

To measure the number of messages used by each of the algorithms, we con-


sider the following. For a distributed implementation of ADMM, two sets of
n variables are updated on separate processors and reported to a central con-
troller which updates another variable (see [14, Chapter 7.1]). The number of
messages for a run of ADMM is twice the number of sources in the NUM prob-
lem, multiplied by the number of iterations required by ADMM. LOCO needs
to use communication only to construct the query set; running the online algo-
rithm does not require any communication. Therefore, the number of messages
21
is proportional to the number of edges with at least one endpoint in the query
set (this is the number of edges we need to send information over in order
to construct the query set, see e.g., [76] for more details). We note that the
number of messages depends both on the network topology and the realization
of the ranking function.

5.2 Experimental Results


We now describe our empirical comparison of the performance of LOCO with
ADMM.
Our first set of experiments investigates the communication used by ADMM
and LOCO, i.e., the number of messages required. Figure 5.1 highlights that
LOCO requires considerably fewer messages than ADMM, across both small
and large n and varying levels of sparsity. More specifically, the figure shows
that both the average and maximum amount of communication needed to
answer a query about a specific piece of the solution under LOCO (LOCO
Avg and LOCO Max respectively) are substantially lower than for ADMM.
Further, even answering every query (LOCO Tot) requires only the same order
of magnitude as ADMM. The figure includes ADMM with a tolerance rel
of 10−4 (ADMM 1) and 10−3 (ADMM 2). Even with suboptimal tolerance,
which results in fewer iterations, ADMM still requires orders of magnitude
more communication than LOCO.
Figure 5.2 shows the same qualitative behavior in the case of the real network
data. In particular, the number of messages used is shown as a function of
the average length of paths in the AS topology. We see that LOCO greatly
outperforms ADMM for all tested average path lengths.
The improvement achieved by LOCO is possible because the size of the query
sets used by LOCO are small compared to the number of sources. When
n = 103 , as in Figure 5.1, the number of nodes in the largest query set (over
all trials) was 60.
We note that the improvement in the amount of communication is achieved at
a cost: LOCO does not precisely solve the optimization, it only approximates
the solution. When B is set to its worst-case guarantee (Figure 5.1), the
relative error of LOCO ranges from 0.29 to 0.34.
It may seem somewhat unfair to compare the message complexity of LOCO
and ADMM when they have differing relative error; we tune the parameters of
22
ADMM and LOCO such that the algorithms have comparable relative error,
while LOCO Tot and ADMM require about the same number of messages.
Figures 5.3 (a) and (b) illustrate the Pareto optimal frontier for ADMM: the
minimal messages needed in order to obtain a particular relative error. Unlike
ADMM, LOCO cannot trade off the number of messages used with the relative
error, thus LOCO corresponds to a single point on the figures. This point is
outside the Pareto frontier of ADMM. Figure 5.3 (c) illustrates the impact of
tuning B. Similarly to ADMM, tuning B can significantly improve the relative
error; unlike ADMM, tuning B does not affect the communication complexity.
23
Chapter 6

CONCLUDING REMARKS

We introduced a new, fundamentally different approach for distributed opti-


mization based on techniques from the field of local computation algorithms.
In particular, we designed a generic algorithm, LOCO, that constructs small
neighborhoods and simulates an online algorithm on them. Due to the fact
that LOCO is local, it has several advantages over existing methods for dis-
tributed optimization. In particular, it is more robust to network failures,
communication lag, and changes in the system. To illustrate the benefits of
LOCO we considered throughput maximization. The improvements of LOCO
over ADMM in terms of communication in this setting are significant.

We view this work as a first step toward the investigation of local computation
algorithms for distributed optimization. In future work, we intend to continue
to investigate the performance of LOCO in more general network optimization
problems. Further, it would be interesting to apply other techniques from the
field of local computation algorithms to develop algorithms for other settings
in which distributed computing is useful, such as power systems and machine
learning.
Part III

Data Purchasing and Data


Placement in a Geo-Distributed
Data Market

24
25
Chapter 7

INTRODUCTION TO DISTRIBUTED DATA MARKETS

Ten years ago computing infrastructure was a commodity – the key bottleneck
for new tech startups was the cost of acquiring and scaling computational
power as they grew. Now, computing power and memory are services that
can be cheaply subscribed to and scaled as needed via cloud providers like
Amazon EC2, Microsoft Azure, etc.

We are beginning the same transition with respect to data. Data is broadly
being gathered, bought, and sold in various marketplaces. However, it is still a
commodity, often obtained through offline negotiations between providers and
companies. Thus, acquiring data is one of the key bottlenecks for new tech
startups nowadays.

This is beginning to change with the emergence of cloud data markets, which
offer a single, logically centralized point for buying and selling data. Multiple
data markets have recently emerged in the cloud, e.g., Microsoft Azure Data-
Market [65], Factual [29], InfoChimps [44], Xignite [95], IUPHAR [88], etc.
These marketplaces enable data providers to sell and upload data and clients
to request data from multiple providers (often for a fee) through a unified
query interface. They provide a variety of services: (i) aggregation of data
from multiple sources, (ii) cleaning of data to ensure quality across sources,
(iii) ease of use, through a unified API, and (iv) low-latency delivery through
a geographically distributed content distribution network.

Given the recent emergence of data markets, there are widely differing designs
in the marketplace today, especially with respect to pricing. For example,
The Azure DataMarket [65] sets prices with a subscription model that allows
a maximum number of queries (API calls) per month and limits the size of
records that can be returned for a single query. Other data markets, e.g.,
Infochimps [44], allow payments per query or per data set. In nearly all cases,
the data provider and the data market operator each then get a share of the fees
paid by the clients, though how this share is arrived at can differ dramatically
across data markets. The task of pricing is made even more challenging when
one considers that clients may be interested in data with differing levels of
26
precision/quality and privacy may be a concern.

Not surprisingly, the design of pricing (both on the client side and the data
provider side) has received significant attention in recent years, including pric-
ing of per-query access [49, 51] and pricing of private data [32, 57].

In contrast, the focus of this paper is not on the design of pricing strategies for
data markets. Instead, we focus on the engineering side of the design
of a data market, which has been ignored to this point. Supposing that
prices are given, there are important challenges that remain for the operation
of a data market. Specifically, two crucial challenges relate to data purchasing
and data placement.

Data purchasing: Given prices and contracts offered by data providers,


which providers should a data market purchase from to satisfy a set of client
queries with minimal cost?

Data placement: How should purchased data be stored and replicated through-
out a geo-distributed data market in order to minimize bandwidth and latency
costs? And which clients should be served from which replicas given the loca-
tions and data requirements of the clients?

Clearly, these two challenges are highly related: data placement decisions de-
pend on which data is purchased from where, so the bandwidth and latency
costs incurred because of data placement must be balanced against the pur-
chasing costs. Concretely, less expensive data that results in larger bandwidth
and latency costs is not desirable.

The goal of this work is to present a design for a geo-distributed data


market that jointly optimizes data purchasing and data placement
costs. The combination of data purchasing and data placement decisions
makes the task of operating a geo-distributed data market more complex than
the task of operating a geo-distributed data analytics system, which has re-
ceived considerable attention in recent years e.g., [73, 93, 92]. Geo-analytics
systems minimize the cost (in terms of latency and bandwidth) of moving the
data needed to answer client queries, replacing the traditional operation mode
where data from multiple data centers was moved to a central data center
for processing queries. However, crucially, such systems do not consider the
cost of obtaining the data (including purchasing and transferring) from data
providers.
27
Thus, the design of a geo-distributed data market necessitates integrating data
purchasing decisions into a geo-distributed data analytics system. To that end,
our design builds on the model used in [92] by adding data providers that offer a
menu of data quality levels for differing fees. The data placement/replication
problem in [92] is already an integer linear program (ILP), and so it is no
surprise that the addition of data providers makes the task of jointly optimizing
data purchasing and data placement NP-hard (see Theorem 8).

Consequently, we focus on identifying structure in the problem that can allow


for a practical and near-optimal system design. To that end, we show that the
task of jointly optimizing data purchasing and data placement is equivalent
to the uncapacitated facility location problem (UFLP) [52]. However, while
constant-factor polynomial running time approximation algorithms are known
for the metric uncapacitated facility location problem (see [17, 38, 45]), our
problem is a non-metric facility location problem, and the best known polyno-
mial running time algorithms achieve a O(log C) approximation via the greedy
algorithm in [42] or the randomized rounding algorithm in [90], where C is
the number of clients. Note that without any additional information on the
costs, this approximation ratio is the smallest achievable for the non-metric
uncapacitated facility location unless NP has slightly superpolynomial time
algorithms [30]. While this is the best theoretical guarantee possible in the
worst-case, some promising heuristics have been proposed for the non-metric
case, e.g., [26, 8, 1, 48, 89, 36].

Though the task of jointly optimizing data purchasing and data placement is
computationally hard in the worst case, in practical settings there is structure
that can be exploited. In particular, we provide an algorithm with polynomial
running time that gives an exact solution in the case of a data market with a
single data center (§10.1). Then, using this structure, we generalize to the case
of a geo-distributed data cloud and provide an algorithm, named Datum (§10.2)
that is near optimal in practical settings.

Datum first optimizes data purchasing as if the data market was made up of
a single data center (given carefully designed “transformed” costs) and then,
given the data purchasing decisions, optimizes data placement/replication.
The “transformed” costs are designed to allow an architectural decomposition
of the joint problem into subproblems that manage data purchasing (external
operations of the data market) and data placement (internal operations of
28
the data market). This decomposition is of crucial operational importance
because it means that internal placement and routing decisions can proceed
without factoring in data purchasing costs, mimicking operational structures
of geo-distributed analytics systems today.

We provide a case study in §11 which highlights that Datum is near-optimal


(within 1.6%) in practical settings. Further, the performance of Datum im-
proves upon approaches that neglect data purchasing decisions by > 45%.

To summarize, this paper makes the following main contributions:

1. We initiate the study of jointly optimizing data purchasing and data place-
ment decisions in geo-distributed data markets.
2. We prove that the task of jointly optimizing data purchasing and data
placement decisions is NP-hard and can be equivalently viewed as a facility
location problem.
3. We provide an exact algorithm with polynomial running time for the case
of a data market with a single data center.
4. We provide an algorithm, Datum, for jointly optimizing data purchasing and
data placement in a geo-distributed data market that is within 1.6% of op-
timal in practical settings and improves by > 45% over designs that neglect
data purchasing costs. Importantly, Datum decomposes into subproblems
that manage data purchasing and data placement decisions separately.
29
Chapter 8

OPPORTUNITIES AND CHALLENGES

Data is now a traded commodity. It is being bought and sold every day, but
most of these transactions still happen offline through direct negotiations for
bulk purchases. This is beginning to change with the emergence of cloud
data markets such as Microsoft Azure DataMarket in [65], Factual [29], In-
foChimps [44], Xignite [95]. As cloud data markets become more prominent,
data will become a service that can be acquired and scaled seamlessly, on
demand, similarly to computing resources available today in the cloud.

8.1 The potential of data markets


The emergence of cloud data markets has the potential to be a significant
disruptor for the tech industry, and beyond. Today, since computing resources
can be easily obtained and scaled through cloud services, data acquisition has
become the bottleneck for new tech startups.

For example, consider an emerging potential competitor for Yelp. The biggest
development challenge is not algorithmic or computational. Instead, it is ob-
taining and managing high quality data at scale. The existence of a data
market, e.g., Azure DataMarket, with detailed local information about re-
straints, attractions, etc., would eliminate this bottleneck entirely. In fact,
data markets such as Factual [29] are emerging to target exactly this need.

Another example highlighted in [50, 51] is language translation. Emerging


data markets such as the Azure DataMarket and Infochimps sell access to
data on word translation, word frequency, etc. across languages. This access
is a crucial tool for easing the transition tech startups face when moving into
different cultural markets.

A final example considers computer vision. When tech startups need to de-
velop computer vision tools in house, a significant bottleneck (in terms of time
and cost) is obtaining labeled images with which to train new algorithms.
Emerging data markets have the potential to eliminate this bottleneck too.
For example, the emerging Visipedia project [91] (while free for now) provides
an example of the potential of such a data market.
30
Thus, like in the case of cloud computing, ease of access and scaling, combined
with the cost efficiency that comes with size, implies that cloud data markets
have the potential to eliminate one of the major bottlenecks for tech startups
today – data acquisition.

8.2 Operational challenges for data markets


The task of designing a cloud data market is complex, and requires balancing
economic and engineering issues. It must carefully consider purchasing and
pricing decisions in its interactions with both data providers and clients and
minimize its operational cost, e.g., from bandwidth. We discuss both the
economic and engineering design challenges below, though this paper focuses
only on the engineering challenges.

Pricing

While there is a large body of literature on selling physical goods, the problem
of pricing digital goods, such as data, is very different. Producing physical
goods usually has a moderate fixed cost, for example, for buying the space and
production machines needed, but this cost is partly recoverable: it is possible,
if the company cannot manage to sell its product, to resell the machinery and
buildings they have been using. However, the cost of producing and acquiring
data is high and irrecoverable: if the data turns out to be worthless and nobody
wants it, then the whole procedure is wasted. Another major difference comes
from the fact that variable costs for data are low: once it has been produced,
data can be cheaply copied and replicated.

These differences lead to “versioning” as the most typical approach for selling
digital goods [7]. Versioning refers to selling different versions of the same
digital good at different prices in order to target different types of buyers. This
pricing model is common in the tech industry, e.g., companies like Dropbox sell
digital space at different prices depending on how much space customers need
and streaming websites such as Amazon often charge differently for streaming
movies at different quality levels.

In the context of data markets, versioning is also common. For example, in


Infochimps and the Azure DataMarket data consumers may pay a monthly
subscription fee that varies according to the maximum number of queries they
are allowed to run. Additionally, when charging per query, proposals have
31
suggested it is desirable to charge based on the complexity of the query, e.g.,
[6, 7]. Another form of versioning that has been proposed in data markets
deals with privacy – data with more personal information should be charged
more, e.g., [57, 22].

There is a growing literature focused on the design of pricing strategies for


cloud data markets in the above, and other contexts, e.g., [6, 49, 7, 51, 7, 57,
22].

Data purchasing and data placement

While data pricing within cloud data markets has received increasing attention,
the engineering of the system itself has been ignored. The engineering of such
a geo-distributed “data cloud” is complex. In particular, the system must
jointly make both data purchasing decisions and data placement, replication
and delivery decisions, as described in the introduction.

Even considered independently, the task of optimizing data placement/replication


within a geo-distributed data analytics system is challenging. Such systems
aim to allow queries on databases that are stored across data centers, as op-
posed to traditional databased that are stored within a single data center.
Examples include Google Spanner [21], Mesa [40], JetStream [74], Geode [92],
and Iridium [73]. The aim in designing a geo-distributed data analytics sys-
tem is to distribute the computation needed to answer queries across data
centers; thus avoiding the need to transfer all the data to a single data center
to respond to queries. This distribution of computation is crucial for min-
imizing bandwidth and latency costs, but leads to considerable engineering
challenges, e.g., handling replication constraints due for fault tolerance and
regulatory constraints on data placement due to data privacy. See [92, 73] for
a longer discussion of these challenges and for examples illustrating the benefit
of distributed query computation in geo-distributed data analytics systems.

Importantly, all previous work on geo-distributed analytics systems assumes


that the system already owns the data. Thus, on top of the complexity in geo-
distributed analytics systems, a geo-distributed cloud data market must bal-
ance the cost of data purchasing with the impact on data placement/replication
costs as well as the decisions for data delivery. For example, if clients who are
interested in some data are located close to data center A, while the data
provider is located close to data center B (far from data center A), it may be
32
worth it to place that data in data center A rather than data center B. In
practice, the problem is more complex since clients are usually geo-graphically
distributed rather than centralized and one client may require data from sev-
eral different data providers.

Additional complexity is created by versioning the data, i.e., the fact that
clients have differing quality requirements for the data requested. For example,
if some clients are interested in high quality data and others are interested in
low quality data, then it may be worth it to provide high quality level data
to some clients that only need low quality data (thus incurring a higher price)
because of the savings in bandwidth and replication costs that result from
being able to serve multiple clients with the same data.
33
Chapter 9

A GEO-DISTRIBUTED DATA CLOUD

This paper presents a design for a geo-distributed cloud data market, which
we refer to as a “data cloud.” This data cloud serves as an intermediary be-
tween data providers, which gather data and offer it for sale, and clients, which
interact with the data cloud through queries for particular subsets/qualities
of data. More concretely, the data cloud purchases data from multiple data
providers, aggregates it, cleans it, stores it (across multiple geographically dis-
tributed data centers), and delivers it (with low-latency) to clients in response
to queries, while aiming at minimizing the operational cost constituted of both
bandwidth and data purchasing costs.

Our design builds on and extends the contributions of recent papers – specif-
ically [92, 73] – that have focused on building geo-distributed data analytic
systems but assume the data is already owned by the system and focus solely
on the interaction between a data cloud and its clients. Unfortunately, as we
highlight in §10, the inclusion of data providers means that the data cloud’s
goal of cost minimization can be viewed as a non-metric uncapacitated facility
location problem, which is NP-hard.

For reference, Figure 9.1 provides an overview of the interaction between these
three parties as well as some basic notations.

9.1 Modeling Data Providers


The interaction between the data cloud and data providers is a key distinction
between the setting we consider and previous work on geo-distributed data
analytics systems such as [73, 92]. We assume that each data provider offers
distinct data to the data cloud, and that the data cloud is a price-taker, i.e.,
cannot impact the prices offered by data providers. Thus, we can summarize
the interaction of a data provider with the data cloud through an exogenous
menu of data qualities and corresponding prices.

We interpret the quality of data as a general concept that can be instantiated


in multiple ways. For categorical data, quality may represent the resolution of
the information provided, e.g., for geographical attributes the resolution may
34
be {street address, zip code, city, county, state}. For numerical data, quality
could take many forms, e.g., the numerical precision, the statistical precision
(e.g., the confidence of an estimator), or the level of noise added to the data.1

Concretely, we consider a setting where there are P data providers selling


different data, p ∈ P = {1, 2, . . . , P }.2 Each data provider offers a set of
quality levels, indexed by l ∈ L = {1, 2, . . . , Lp }, where Lp is the number of
levels that data provider p offers. We use q(l, p) to denote the data quality
level l, offered by data provider p. Similarly, we use f (l, p) to denote the fee
charged by data provider p for data of quality level l. Importantly, the prices
vary across providers p since different providers have different procurement
costs for different qualities and different data.

The data purchasing contract between data providers and data cloud may have
a variety of different types. For example, a data cloud may pay data provider
based on usage, i.e., per query, or a data cloud may buy the data in bulk in
advance. In this paper, we discuss both per-query data contracting and bulk
data contracting. See §9.3 for details.

9.2 Modeling Clients


Clients interact with the data cloud through queries, which may require data
(with varying quality levels) from multiple data providers.

Concretely, we consider a setting where there are C clients, c ∈ C = {1, 2, . . . , C}.


A client c sends a query to the data center, requesting particular data from
multiple data providers.3 Denote the set of data providers required by the re-
quest from client query c by G(c). The client query also specifies a minimum
desired quality level, wc (p), for each data provider p it requests, i.e., ∀p ∈ G(c).
We assume that the client is satisfied with data at a quality level higher than
or equal to the level requested.

More general models of queries are possible, e.g., by including a DAG modeling
the structure of the query and query execution planning (see [92] for details).
For ease of exposition, we do not include such detailed structure here, but it
1
A common suggestion for guaranteeing privacy is to add Laplace noise to data provided
to data markets, see e.g., [25, 57]
2
We distinguish data providers based on data, i.e., one data provider sells multiple data
is treated as multiple data providers.
3
We distinguish clients based on queries, i.e., one client sends multiple queries is treated
as multiple clients.
35
can be added at the expense of more complicated notation.

Depending on the situation, the client may or may not be expected to pay the
data cloud for access. If the clients are internal to the company running the
data cloud, client payments are unnecessary. However, in many situations the
client is expected to pay the data cloud for access to the data. There are many
different types of payment structures that could be considered. Broadly, these
fall into two categories: (i) subscription-based (e.g., Azure DataMarket [65])
or (ii) per-query-based (e.g. Infochimps [44]).

In this paper, we do not focus on (or model) the design of payment structure
between the clients and the data cloud. Instead, we focus on the operational
task of minimizing the cost of the data cloud operation (i.e., bandwidth and
data purchasing costs). This focus is motivated by the fact that minimizing
the operation costs improves the profit of the data cloud regardless of how
clients are charged. Interested readers can find analyses of the design of client
pricing strategies in [49, 51, 57].

9.3 Modeling a Geo-Distributed Data Cloud


The role of the data cloud in this marketplace is as an aggregator and in-
termediary. We model the data cloud as a geographically distributed cloud
consisting of D data centers, d ∈ D = {1, 2, . . . , D}. Each data center aggre-
gates data from geographically separate local data providers, and data from
data providers may be (and often is) replicated across multiple data centers
within the data cloud.

Note that, even for the same data with the same quality, data transfer from
the data providers to the data cloud is not a one time event due to the need of
the data providers to update the data over time. We target the modeling and
optimization of data cloud within a fixed time horizon, given the assumption
that queries from clients are known beforehand or can be predicted accurately.
This assumption is consistent with previous work [92, 73] and reports from
other organizations [94, 55]. Online versions of the problem are also of interest,
but are not the focus of this paper.

Modeling costs

Our goal is to provide a design that minimizes the operational costs of a data
cloud. These costs include both data purchasing and bandwidth costs. In or-
36
Data Providers! Data Cloud! Clients!

PurchCost (bulk): !
! f (l, p)z(l, p)

OperCost: β! p,d (l)yp,d (l)


PurchCost (per-query): !
! f (l, p)xd,c (l, p)

ExecCost: ! αd,c (l, p)xd,c (l, p)

Figure 9.1: An overview of the interaction between data providers, the data
cloud, and clients. The dotted line encircling the data centers (DC) represents
the geo-distributed data cloud. Data providers and clients interact only with
the cloud. Data provider p sends data of quality q(l, p) to data center d,
and the corresponding operation cost is βp,d (l)yp,d (l). Similarly, data center d
sends data of quality q(l, p) to client c, and the corresponding execution cost is
αd,c (l, p)xd,c (l, p). In bulk data contracting, the corresponding purchasing cost
is f (l, p)z(l, p). In per-query data contracting, the corresponding purchasing
cost is f (l, p)xd,c (l, p).

der to describe these costs, we use the following notation, which is summarized
in Figure 9.1.4

xd,c (l, p) ∈ {0, 1}: xd,c (l, p) = 1 if and only if data of quality q(l, p), originating
from data provider p, is transferred from data center d to client c.

αd,c (l, p): cost (including bandwidth and/or latency) to transfer data of qual-
ity q(l, p), originating from data provider p, from data center d to client
c

yp,d (l) ∈ {0, 1}: yp,d (l) = 1 if and only if data of quality q(l, p) is transferred
from data provider p to data center d.

βp,d (l): cost (including bandwidth and/or latency) to transfer data of quality
q(l, p) from data provider p to data center d.

z(l, p) ∈ {0, 1}: z(l, p) = 1 if and only if data of quality q(l, p), originating
from data provider p, is transferred to the data cloud.
4
Throughout, subscript indices refer to data transfer “from, to” a location, and paren-
thesized indices refer to data characteristics (e.g., quality, from which data provider).
37
f (l, p): purchasing cost of data with quality q(l, p), originating from data
provider p.

Given the above notations, the costs of the data cloud can be broken into three
categories:

(i) The operation cost due to transferring data of all quality levels from
data providers to data centers is
Lp D
P X
X X
OperCost = βp,d (l)yp,d (l). (9.1)
p=1 l=1 d=1

(ii) The execution cost due to transferring data of all quality levels from
data centers to clients is
C Lp D
X X X X
ExecCost = αd,c (l, p)xd,c (l, p). (9.2)
c=1 p∈G(c) l=1 d=1

(iii) The purchasing cost (PurchCost) due to buying data from the data
provider could result from a variety of differing contract styles. In this
paper we consider two extreme options: per-query and bulk data con-
tracting. These are the most commonly adopted strategies for data
purchasing today.
In per-query data contracting, the data provider charges the data cloud
a fixed rate for each query that uses the data provided by the data
provider. So, if the same data is used for two different queries, then the
data cloud pays the data provider twice. Given a per-query fee f (l, p)
for data q(l, p), the total purchasing cost is
C Lp D
X X X X
PurchCost(query) = f (l, p)xd,c (l, p). (9.3)
c=1 p∈G(c) l=1 d=1

In bulk data contracting, the data cloud purchases the data in bulk
and then can distribute it without owing future payments to the data
provider. Given a one-time fee f (l, p) for data q(l, p), the total purchas-
ing cost is
P X Lp
X
PurchCost(bulk) = f (l, p)z(l, p). (9.4)
p=1 l=1
38
To keep the presentation of the paper simple, we focus on the per-query data
contracting model throughout the body of the paper and discuss the bulk data
contracting model (which is simpler) in Appendix C.3.

Cost Optimization

Given the cost models described above, we can now represent the goal of the
data cloud via the following integer linear program (ILP), where OperCost,
ExecCost, and PurchCost are as described in equations (9.1), (9.2) and (9.3),
respectively.

min OperCost + ExecCost + PurchCost (9.5)


x,y

subject to xd,c (l, p) ≤ yp,d (l) ∀c, p, l, d (9.5a)


Lp D
X X
xd,c (l, p) = 1, ∀c, p ∈ G(c) (9.5b)
l=1 d=1
Lp D
X X
xd,c (l, p)q(l, p) ≥ wc (p), ∀c, p ∈ G(c) (9.5c)
l=1 d=1

xd,c (l, p) ≥ 0, ∀c, p, l, d (9.5d)


yp,d (l) ≥ 0, ∀p, l, d (9.5e)
xd,c (l, p), yp,d (l) ∈ {0, 1}, ∀c, p, l, d (9.5f)

The constraints in this formulation warrant some discussion. Constraint (9.5a)


states that any data transferred to some client must already have been trans-
ferred from its data provider to the data cloud.5 Constraint (9.5b) ensures
that each client must get the data it requested, and constraint (9.5c) ensures
that the minimum quality requirement of each client must be satisfied. The
remaining constraints state that the decision variables are binary and nonneg-
ative.

An important observation about the formulation above is that data purchas-


ing/placement decisions are decoupled across data providers, i.e., the data
purchasing/placement decision for data from one data provider does not im-
pact the data purchasing/placement decision for any other data providers.
Thus, we frequently drop the index p.
5
For bulk data contracting model, one more constraint yp,d (l) ≤ z(l, p), ∀c, l, p, d is
required. This constraint states that any data placed in the data cloud must be purchased
by the data cloud.
39
Note that there are a variety of practical issues that we have not incorporated
into the formulation in (9.5) in order to minimize notational complexity, but
which can be included without affecting the results described in the following.
A first example is that a minimal level of data replication is often desired
for fault tolerance and disaster recovery reasons. This can be added to (9.5)
PD
by additionally considering constraints of the form d=1 yp,d (l) ≥ kz(l, p),
where k denotes the minimum required number of copies. Similarly, privacy
concerns often lead to regulatory constraints on data movement. As a result,
regulatory restrictions may prohibit some data from being copied to certain
data centers, thus constraining data placement and replication. This can be
included by adding constraints of the form yp,d (l) = 0 to (9.5) where p and d
denote the corresponding data provider and data center, respectively. Finally,
in some cases it is desirable to enforce SLA constraints on the latency of
delivery to clients. Such constraints can be added by including constraints of
P PLp PD
the form p∈G(c) l=1 d=1 αd,c (l, p)xd,c (l, p) ≤ rc , where rc denotes the SLA
requirement of client c.

We refer the reader to [92, 93, 73] for more discussions of these additional
practical constraints. Each paper includes a subset of these factors in the
design of geo-distributed data analytics systems, but does not model data
purchasing decisions.
40
C h a p t e r 10

OPTIMAL DATA PURCHASING & DATA PLACEMENT

Given the model of a geo-distributed data cloud described in the previous sec-
tion, the design task is now to provide an algorithm for computing the optimal
data purchasing and data placement/replication decisions, i.e., to solve data
cloud cost minimization problem in (9.5). Unfortunately, this cost minimiza-
tion problem is an ILP, which are computationally difficult in general.1

A classic NP-hard ILP is the uncapacitated facility location problem (UFLP) [52].
In the uncapacitated facility location problem, there is a set of I clients and
J potential facilities. Facility j ∈ J costs fj to open and can serve clients
i ∈ I with cost ci,j . The task is to determine the set of facilities that serves
the clients with minimal cost.

Our first result, stated below, highlights that cost minimization for a geo-
distributed data cloud can be reduced to the uncapacitated facility location
problem, and vice-versa. Thus, the task of operating a data cloud can then
be viewed as a facility location problem, where opening a facility parallels
purchasing a specific quality level from a data provider and placing it in a
particular data center in the data cloud.

Theorem 8 The cost minimization problem for a geo-distributed data cloud


given in (9.5) is NP-hard.

The proof of Theorem 8 (given in Appendix C) provides a reduction both to


and from the uncapacitated facility location problem. Importantly, the proof
of Theorem 8 serves a dual purpose: it both characterizes the hardness of
the data cloud cost minimization problem and highlights that algorithms for
the facility location problem can be applied in this context. Given the large
literature on facility location, this is important.

More specifically, the reduction leading to Theorem 8 highlights that the data
cloud optimization problem is equivalent to the non-metric uncapacitated fa-
1
Note that previous work on geo-distributed data analytics where data providers and
data purchasing were not considered already leads to an ILP with limited structure. For
example, [92] suggest only heuristic algorithms with no analytic guarantees.
41
cility location problem – every instance of any of the two problems can be
written as an instance of the other. While constant-factor polynomial running
time approximation algorithms are given for the metric uncapacitated facility
location problem in [17, 38, 45], in the more general non-metric case the best
known polynomial running time algorithm achieves a log(C)-approximation
via a greedy algorithm with polynomial running time, where C is the number
of clients [42]. This is the best worst-case guarantee possible (unless NP has
slightly superpolynomial time algorithms, as proven in [30]); however some
promising heuristics have been proposed for the non-metric case, e.g., [26, 8,
1, 48, 89, 36].

Nevertheless, even though our problem can, in general, be viewed as the non-
metric uncapacitated facility location, it does have a structure in real-world
situations that we can exploit to develop practical algorithms.

In particular, in this section we begin with the case of a data cloud made up
of a single data center. We show that, in this case, there is a structure that
allows us to design an algorithm with polynomial running time that gives an
exact solution (§10.1). Then, we move to the case of a data cloud made up
of geo-distributed data centers and highlight how to build on the algorithm
for the single data center case to provide an algorithm, Datum, for the general
case (§10.2). Importantly, Datum allows decomposition of the management of
data purchasing (operations outside of the data cloud) and data placement
(operations inside the data cloud). This feature of Datum is crucial in practice
because it means that the algorithm allows a data cloud to manage internal
operations without factoring in data purchasing costs, mimicking operations
today. While we do not provide analytic guarantees for Datum (as expected
given the reduction to/from the non-metric facility location problem), we show
that the heuristic performs well in practical settings using a case study in §11.

10.1 An exact solution for a single data center


We begin our analysis by focusing on the case of a single data center, which
interacts with multiple data providers and multiple clients. The key observa-
tion is that, if the execution costs associated with transferring different quality
levels of the same data are the same, i.e., ∀l, αc (l) = αc , then the execution
cost becomes a constant which is independent of the data purchasing and data
42
placement decisions as shown in (10.1).
C X
L C L C
!
X X X X
ExecCost = αc xc (l) = αc xc (l) = αc (10.1)
c=1 l=1 c=1 l=1 c=1

The assumption that the execution costs are the same across quality levels is
natural in many cases. For example, if quality levels correspond to the level of
noise added to numerical data, then the size of the data sets will be the same.
We adopt this assumption in what follows.

This assumption allows the elimination of the execution cost term from the
objective. Additionally, we can simplify notation by removing the index d for
the data center. Thus, in per-query data contracting, the data cloud optimiza-
tion problem can be simplified to (10.2). (We discuss the case of bulk data
contracting in Appendix C.3.)

L
X C X
X L
minimize β(l)y(l) + f (l)xc (l) (10.2)
l=1 c=1 l=1

subject to xc (l) ≤ y(l), ∀c, l


L
X
xc (l) = 1, ∀c (10.2a)
l=wc

xc (l) ≥ 0, ∀c, l
y(l) ≥ 0, ∀l
xc (l), y(l) ∈ {0, 1}, ∀c, l

Note that constraint (10.2a) is a contraction of (9.5b) and (9.5c), and simply
means that any client c must be given exactly one quality level above wc ,
the minimum required quality level.2 While this problem is still an ILP, in
this case there is a structure that can be exploited to provide a polynomial
time algorithm that can find an exact solution. In particular, we prove in
Appendix C.1 that the solution to (10.2) can be found by solving the linear
program (LP) given in (10.3).
2
While the two constraints are equivalent for an ILP, they lead to different feasible sets
when considering its LP-relaxation; in particular, facility location algorithms based on LP-
relaxations such as randomized rounding algorithms need to use the contracted version of
the constraints to preserve the O(log C)-approximation ratio for non-metric facility location.
It is equivalent to the reformulation given in Appendix C and does not introduce infinite
costs that may lead to numerical errors.
43

L
X L X
X L
minimize β(l)y(l) + Si f (l)χi (l) (10.3)
l=1 i=1 l=i

subject to
χi (l) ≤ y(l), ∀i, l
L
X
χi (l) = 1, ∀i
l=i

χi (l) ≥ 0, ∀i, l
y(l) ≥ 0, ∀l

In (10.3), Si is the number of clients who require a minimum quality level


of i, and χi (l) = 1 represents clients with minimum required quality level i
purchase at quality level l.

Note that this LP is not directly obtained by relaxing the integer constraints
in (10.2), but is obtained from relaxing the integer constraints in a reformu-
lation of (10.2) described in Appendix C.1. The theorem below provides a
tractable, exact algorithm for cost minimization in a data cloud made up of a
single data center. (A proof is given in Appendix C.1).

Theorem 9 There exists a binary optimal solution to the linear relaxiation


program in (10.3) which is an optimal solution of the integer program in (10.2)
and can be found in polynomial time.

In summary, the following gives a polynomial time algorithm which yields the
optimal solution of (10.2).
Step 1: Rewrite (10.2) in the form given by (C.4).
Step 2: Solve the linear relaxation of (C.4), i.e., (10.3). If it gives an inte-
gral solution, this solution is an optimal solution of (10.2), and the algorithm
finishes. Otherwise, denote the fractional solution of the previous step by
{χr (l), y r (l)} and continue to the next step.
P i −1 r
Step 3: Find mi ∈ {i, . . . , n} such that m
Pmi r
l=i y (l) < 1, and l=i y (l) ≥ 1.
(See Appendix C.1 for the existence of {mi }.) And express {χi (l)} as a func-
tion of {y(l)} based on (C.6). Substitute the expressions of {χi (l)} with {y(l)}
44
in (10.3) to obtain an instance of (C.7). Solve the linear programming prob-
lem (C.7) and find an optimal solution that is also an extreme point of (C.7).3
This yields a binary optimal solution of (C.7). Use transformation (C.6) to get
a binary optimal solution of (10.3), which can be reformulated as an optimal
solution of (10.2) from the definition of {χi (l)}.

10.2 The design of Datum


Unlike the data cloud cost minimization problem for a single data center, the
general data cloud cost minimization is NP-hard. In this section, we build on
the exact algorithm for cost minimization in a data cloud made up of a single
data center (§10.1) to provide an algorithm, Datum, for cost minimization in
a geo-distributed data cloud.

The idea underlying Datum is to, first, optimize data purchasing decisions as if
the data market was made up of a single data center (given carefully designed
“transformed” costs), which can be done tractably as a result of Theorem 9.
Then, second, Datum optimizes data placement/replication decisions given the
data purchasing decisions.

Before presenting Datum, we need to reformulate the general cost minimiza-


tion ILP in (9.5). Recall that (9.5) is separable across providers, thus we
can consider independent optimizations for each provider, and drop the in-
dex p throughout. Second, we denote the set of all possible subsets of data
centers, e.g., {{d1 }, {d2 }, . . . , {d1 , d2 }, {d1 , d3 }, . . .} by V .4 Further, define
P
βv (l) = d∈v βd (l), and αv,c (l) = mind∈v {αd,c (l)}. Given this change, we
define yv (l) = 1 if and only if data with quality level l is placed in (and only
in) data centers d ∈ v and xv,c (l) = 1 if and only if data with quality level l
is transferred to client c from some data center d ∈ v. These reformulations
allow us to convert (9.5) to (10.4) as following.
3
This step can be finished in polynomial time [11].
4
Note that, in practice, the number of data centers is usually small, e.g., 10 − 20 world-
wide. Further, to avoid exponential explosion of V , the subsets included in V can be limited
to only have a constant number of data centers, where the constant is determined by the
maximal number of replicas to be stored.
45

L X
X V C X
X L X
V
minimize βv (l)yv (l) + αv,c (l)xv,c (l)
l=1 v=1 c=1 l=1 v=1
XC X L X
V
+ f (l)xv,c (l) (10.4)
c=1 l=1 v=1

subject to xv,c (l) ≤ yv (l), ∀c, l (10.4a)


L X
X V
xv,c (l) = 1, ∀c (10.4b)
l=wc v=1
V
X
yv (l) ≤ 1, ∀l (10.4c)
v=1
XV
xv,c (l) ≤ 1, ∀c, l (10.4d)
v=1

xv,c (l) ≥ 0, ∀v, c, l (10.4e)


yv (l) ≥ 0, ∀v, l (10.4f)
xv,c (l), yv (l) ∈ {0, 1}, ∀v, c, l (10.4g)

Compared to (9.5), the main difference is that (10.4) has two extra con-
straints (10.4c) and (10.4d). Constraint (10.4c) ensures that data can only be
placed in at most one subset of data centers across V . And constraint (10.4d)
follows from constraint (10.4b). Using this reformulation Datum can now be
explained in two steps.
Step 1: Solve (10.5) while treating the geo-distributed data cloud as a single
data center. Specifically, define Y (l) = Vv=1 yv (l) and Xc (l) = Vv=1 xv,c (l).
P P

Note that, Y (l) and Xc (l) are 0−1 variables from Constaint (10.4c) and (10.4d).
Further, ignore the middle term in the objective, i.e., the ExecCost. Finally,
for each quality level l, consider a “transformed” cost β ∗ (l). We discuss how
to define β ∗ (l) below. This leaves the “single data center” problem (10.5).
Crucially, this formulation can be solved optimally in polynomial time using
the results for the case of a data cloud made up of a single data center (§10.1).
46

L
X C X
X L

minimize β (l)Y (l) + f (l)Xc (l) (10.5)
l=1 c=1 l=1

subject to Xc (l) ≤ Y (l), ∀c, l


L
X
Xc (l) = 1, ∀c
l=wc

Xc (l) ≥ 0, ∀c, l
Y (l) ≥ 0, ∀l
Xc (l), Y (l) ∈ {0, 1}, ∀c, l

The remaining issue is to define β ∗ (l). Note that the reason for using trans-
formed costs β ∗ (l) instead of βv (l) is that βv (l) cannot be known precisely
without also optimizing the data placement. Thus, in defining β ∗ (l) we need
to anticipate the execution costs that result from data placement and repli-
cation given the purchase of data with quality level l. This anticipation then
allows a decomposition of data purchasing and data placement decisions. Note
that the only inaccuracy in the heuristic comes from the mismatch between
β ∗ (l) and min{βv (l) + c∈C ∗ (l) αv,c (l)} where C ∗ (l) is the set of customers who
P

buy at quality level l in an optimal solution – if these match for the minimizer
of (9.5) then the heuristic is exact. Indeed, in order to minimize the cost of
locating quality levels to data centers, and allocating clients to data centers
and quality levels, the set of data centers v where an optimal solution chooses
to put quality level l has to minimize the cost of data transfer in the set v and
allocating all clients who get data at quality level l, i.e. C ∗ (l), to this set of
data centers v.

Many choices are possible for the transformed costs β ∗ (l). A conservative
choice is β ∗ (l) = min βv (l), which results in a solution (with Step 2) whose
v
OperCost + PurchCost is a lower bound to the corresponding costs in the
optimal solution of (9.5).5 However, it is natural to think that more aggressive
estimates may be valuable. To evaluate this, we have performed experiments in
the setting of the case study (see §11) using the following parametric form β ∗ (l)
0
αv,c (l0 )e−µ2 (l−l ) }, where µ1 and µ2 are parameters. This
P P
= min{βv (l)+µ1
v l0 ≤l wc =l0
5
However the ExecCost cannot be bounded, thus we cannot obtain a bound for the total
cost. The proof of this is simple and is not included in the paper due to space limit.
47
form generalizes the conservative choice by providing a weighting of αv,c (l0 )
based on the “distance” of the quality deviation between l0 and the target
quality level l. The idea behind this is that a client is more likely to be
served data with quality level close to the requested minimum quality level of
0
the client. Here we use the exponential decay term e−µ2 (l−l ) to capture the
possibility of serving the data with quality level l to a client with minimum
quality level l0 ≤ l. Interestingly, in the setting of our case study, the best
design is µ1 = µ2 = 0, i.e., the conservative estimate β ∗ (l) = min βv (l), and so
v
we adopt this β ∗ (l) in Datum.
Step 2: At the completion of Step 1 the solution (X, Y ) to (10.5) determines
which quality levels should be purchased and which quality level should be
delivered to each client. What remains is to determine data placement and
data replication levels. To accomplish this, we substitute (X, Y ) into (10.4),
which yields (10.6).

L X
X V C X
X L X
V
minimize βv (l)yv (l) + αv,c (l)xv,c (l)
l=1 v=1 c=1 l=1 v=1
XC X L X V
+ f (l)xv,c (l) (10.6)
c=1 l=1 v=1

subject to xv,c (l) ≤ yv (l), ∀c, l (10.6a)


L X
X V
xv,c (l) = 1, ∀c (10.6b)
l=wc v=1
V
X
yv (l) = Y (l) (10.6c)
v=1
V
X
xv,c (l) = Xv (l) (10.6d)
v=1

xv,c (l) ≥ 0, ∀v, c, l (10.6e)


yv (l) ≥ 0, ∀v, l (10.6f)
xv,c (l), yv (l) ∈ {0, 1}, ∀v, c, l (10.6g)

The key observation is that this is no longer a computationally hard ILP. In


fact, the inclusion of (X, Y ) means that it can be solved in closed form.

Let C(l) denote the set of clients that purchase data with quality level l, i.e.,
48
C(l) = {c : Xc (l) = 1}. Then (10.7) gives the optimal solution of (10.6). (A
proof is given in Appendix C.2.)




 1, if Y (l) = 1 and

yv (l) = (10.7a)
P
v = arg min{βv (l) + c∈C(l) αv,c (l)},



0, otherwise.

y (l), if c ∈ C(l),
v
xv,c (l) = (10.7b)
0, otherwise.
49

Band. Cost / Opt. Band. Cost


1.7
6

Tot. Cost / Tot. OptCost


1.5 5

4
1.3 NearestDC NearestDC
OptBand 3 Datum
Datum OptCost
2
1.1
1
4 8 12 16 20 4 8 12 16 20
Number of Providers per Client Request Number of Providers per Client Request

(a) Total Cost (b) Bandwidth Cost


Figure 11.1: Illustration of the near-optimality of Datum as a function of the
complexity of client requests (i.e., the average number of providers data must
be procured from in order to complete a client request).

C h a p t e r 11

CASE STUDY

We now illustrate the performance of Datum using a case study of a geo-


distributed data cloud running in North America. While the setting we use is
synthetic, we attempt to faithfully model realistic geography for data centers
in the data cloud, data providers, and clients. Our focus is on quantifying
the overall cost (including data purchasing and bandwidth/latency costs) of
Datum compared to two existing designs for geo-distributed data analytics
systems and the optimal. To summarize, the highlights of our analysis are

1. Datum provides consistently lower cost (> 45% lower) than existing designs
for geo-distributed data analytics systems.
2. Datum achieves near optimal total cost (within 1.6%) of optimal.
3. Datum achieves reduction in total cost by significantly lowering purchas-
ing costs without sacrificing bandwidth/latency costs, which stay typically
within 20-25% of the minimal bandwidth/latency costs necessary for deliv-
ery of the data to clients.

11.1 Experimental setup


The following outlines the setting in which we demonstrate the empirical per-
formance of Datum.

Geo-distributed data cloud. We consider a geographically distributed data cloud


50
with 10 data centers located in California, Washington, Oregon, Illinois, Geor-
gia, Virginia, Texas, Florida, North Carolina, and South Carolina. The loca-
tions of the data centers in our experiments mimic those in [37] and include
the locations of Google’s data centers in the United States.

Clients. Client locations are picked randomly among US cities, weighted pro-
portionally to city populations. Each client requests data from a subset of
data providers, chosen i.i.d. from a Uniform distribution. Unless otherwise
specified, the average number of providers per client request is P/2. The qual-
ity level requested from each chosen provider follows a Zipf distribution with
mean Lp /2 and shape parameter 30. P and Lp are defined as in §9.1 and §9.2.
We choose a Zipf distribution motivated by the fact that popularity typically
follows a heavy-tailed distribution [68]. Results are averaged over 20 random
instances. We observe that the results of the 20 instances for the same plot
are very close (within 5%), and thus do not show the confidence intervals on
the plots.

Data providers. We consider 20 data providers. We place data providers in


the second and third largest cities within a state containing a data center.
This ensures that the data providers are near by, but not right on top of, data
center and client locations.

Operation and execution costs. To set operation and execution costs, we com-
pute the geographical distances between data centers, clients and providers.
The operation and execution costs are proportional to the geographical dis-
tances, such that the costs are effectively one dollar per gigameter. This
captures both the form of bandwidth costs adopted in [93] and the form of
latency costs adopted in [73].

Data purchasing costs. The per-query purchasing costs are drawn i.i.d. from
a Pareto distribution with mean 10 and shape parameter 2 unless otherwise
specified. We choose a Pareto distribution motivated by the fact that incomes
and prices often follow heavy-tailed distributions [68]. Results were averaged
over 20 random instances. To study the sensitivity of Datum to the relative
size of purchasing and bandwidth costs, we vary the ratio of them between
(0.01, 100).

Baselines. We compare the performance of Datum to the following baselines.


51
1.7 1.7

Tot. Cost / Tot. OptCost


NearestDC

Tot. Cost / Tot. OptCost


OptBand
1.5 Datum 1.5

1.3 1.3 NearestDC


OptBand
Datum
1.1 1.1

1.5 2 2.5 3 1 2 3 4 5 6 7 8
Shape Para. Pareto per Query Fee Function Number of Quality Levels

(a) (b)
Figure 11.2: Illustration of Datum’s sensitivity to query parameters. (a) varies
the heaviness of the tail in the distribution of purchasing fees. (b) varies
the number of quality levels available. Note that Figure 11.1 sets the shape
parameter of the Pareto governing purchasing fees to 2 and includes 8 quality
levels.

• OptCost computes the optimal solution to the data cloud cost minimization
problem by solving the integer linear programming (9.5). Note that this
requires solving an NP-hard problem, and so is not feasible in practice. We
include it in order to benchmark the performance of Datum.

• OptBand computes the optimal solution to the bandwidth cost minimization


problem. It is obtained by minimizing only the operation cost and execution
cost in the objective of (9.5). Bandwidth cost minimization is commonly
considered as a primary goal for cost minimization in geo-distributed data
analytics systems [92]. Due to computational complexity, heuristics are usu-
ally applied to minimize the bandwidth cost. Here, instead of implementing
a heuristic algorithms, we optimistically use OptBand in order to lower
bound the achievable performance. Note that this also requires solving an
NP-hard problem and thus is not feasible in practice.

• NearestDC is a greedy heuristic for the total cost minimization problem


that is often applied in practice. It serves the clients exactly what they ask
for by purchasing the data and storing it at the data center closest to the
data provider.

11.2 Experimental results


Quantifying cost reductions from Datum. Figure 11.1(a) illustrates the
costs savings Datum provides. Across levels of query complexity (number of
providers involved), Datum consistently provides > 45% savings over OptBand
and > 51% savings compared to NearestDC. Further, Datum is within 1.6% of
52
the optimal cost in all these cases. The improvement of Datum compared to
OptBand comes as a result of optimizing purchasing decisions at the expense
of increased bandwidth. Importantly, Figure 11.1(b) shows that the extra
bandwidth cost incurred is small, 20 − 25%. Thus, joint optimization of data
purchasing and data placement decisions leads to significant reductions in total
cost without adversely impacting bandwidth costs.

The form of client queries. To understand the sensitivity of the cost


reductions provided by Datum, we next consider the impact of parameters
related to client queries. Figure 11.1 shows that the complexity of queries has
little impact on the cost reductions of Datum. Figure 11.2 studies two other
parameters: the heaviness of the tail of the per-query purchasing fee and the
number of quality levels offered.

Across all settings, Datum is within 1.6% of optimal; however both of these
parameters have a considerable impact on the cost savings Datum provides
over our baselines. In particular, the lighter the tail of the prices of different
quality levels is, the less improvement can be achieved. This is a result of
more concentration of prices across quality levels leaving less room for opti-
mization. Similarly, fewer quality levels provides less opportunity to optimize
data purchasing decisions. At the extreme, with only quality level available,
the opportunity to optimization data purchasing goes away and OptBand and
OptCost are equivalent.

Data purchasing vs. bandwidth costs. The most important determinant


of the magnitude of Datum’s cost savings is the relative importance of data
purchasing costs. In one extreme, if data is free, then the data purchasing
decisions disappear and the problem is simply to do data placement in a man-
ner that minimizes bandwidth costs. In the other extreme, if data purchasing
costs dominate then data placement is unimportant. In Figure 11.3 we only
compare total costs among OptCost, OptBand, and Datum. NearestDC is
far worse (more than 5 times worse than OptCost in some cases) and thus
is dropped from the plots. Figure 11.3(a) studies the impact of the relative
size of data purchasing and bandwidth costs. When the x-axis is 0, the data
purchasing and bandwidth costs of the data center are balanced. Positive val-
ues mean that bandwidth costs dominate and negative values mean that data
purchasing costs dominate. As expected, Datum’s cost savings are most dra-
matic in regimes where data purchasing costs dominate. Cost savings can be
53
1.7 1.7
OptBand OptBand

Tot. Cost / Tot. OptCost

Tot. Cost / Tot. OptCost


Datum Datum
1.5 1.5

1.3 1.3

1.1 1.1

−2 −1 0 1 2 −2 −1 0 1 2
log((α + β )/f ) log(α/(β + f ))

(a) (b)
Figure 11.3: Illustration of the impact of bandwidth and purchasing fees on
Datum’s performance. NearestDC is excluded because its costs are off-scale.
(a) varies the ratio of bandwidth costs (summarized by α + β) to purchasing
costs (summarized by f ). (b) varies the ratio of costs internal to the data
cloud (α) to costs external to the data cloud (β + f ). Note that in Figure 11.1
the ratios are set to log( α+β
f
α
) = −0.5 and log( β+f ) = −1.

54% in extreme settings. Data purchasing costs are expected to dominate in


the future – for some systems this is already true today. However, it is worth
noting that, in settings where bandwidth costs dominates, Datum can deviate
from the optimal cost by 10 − 20% in extreme circumstances, and can be out-
performed by the MinBand benchmark. Of course, Datum is not designed for
such settings given its prioritization of the minimization of data purchasing
costs.

Internal vs. external costs. An important aspect of the design of Datum is


the decomposition of data purchasing decisions from data placement decisions.
This provides a separation between the internal and external operations (and
costs) of Datum. Given this separation, it is important to evaluate the sensi-
tivity of Datum’s design to the relative size of internal and external costs.

Given that Datum prioritizes the optimization of external costs (optimizing


them in Step 1, see §10.2), it is natural to expect that Datum performs best
when these costs dominate. This is indeed the case, as illustrated in Figure
11.3(b). Like in Figure 11.3(a), when the x-axis is 0, the internal and exter-
nal costs are balanced. Positive values indicate the internal costs dominate
and negative values indicate the external costs dominate. In settings where
external costs dominate Datum can provide 50% cost savings and be within a
few percent of the optimal. However, in cases when internal costs dominate
Datum can deviate from the optimal cost by 10 − 30% in extreme circum-
stances, and can be outperformed by the MinBand benchmark. Note that, as
data purchasing costs grow in importance, external costs will dominate, and so
54
we can expect that Datum will provide near optimal performance in practical
settings.
55
C h a p t e r 12

RELATED WORK

Our work focuses on the joint design of data purchasing and data placement in
a geo-distributed cloud data market. As such, it is related both to recent work
on data pricing and to geo-distributed data analytics systems. Further, the
algorithmic problem at the core of our design is the facility location problem,
and so our work builds on that literature. We discuss related work in these
three areas in the following.

Data pricing: The design of data markets has begun to attract increasing
interest in recent years, especially in the database community, see [6] for an
overview. The current literature mainly focuses on query-based pricing mecha-
nism designs [49, 51, 57] and seldom considers the operating cost of the market
service providers (i.e., the data cloud). There is also a growing body of work
related to data pricing with differentiated qualities [32, 57, 22], often motivated
by privacy. See §8.2 for more discussion. This work relates to data pricing on
the data provider side and is orthogonal to our discussion in this paper.

Geo-distributed data analytics systems: As cloud servers are increasingly


located in geo-distributed systems, analysis and optimization of data stored in
geographically distributed data centers has received increasing attention [92,
93, 73, 43]. Bandwidth constraints [92, 93] as well as latency [73] are the two
main challenges for system design, and a number of system designs have been
proposed, e.g., see §8.2 for more discussion. Our work builds on the model
of geo-distributed data analytics systems in [73, 92], but is distinct from this
literature because none of the work on geo-distributed data analytics systems
considers the costs associated with purchasing data.

Algorithms for facility location: Our data cloud cost minimization prob-
lem can be viewed as a variant of the uncapacitated facility location problem.
Though such problems have been widely studied, most of the results, espe-
cially algorithms with constant approximation ratios, require the assumption
of metric cost parameters [17, 38, 45], which is not the case in our problem.
In contrast, for the non-metric facility location problem the best known al-
gorithm is a greedy algorithm proposed in [52]. Beyond this algorithm, a
56
variety of heuristics have been proposed, however none of the heuristics are
appealing for our problem because it is desirable to separate (external) data
purchasing decisions from (internal) data placement/replication decisions as
much as possible. As a result we propose a new algorithm, Datum, which is
both near-optimal in practical settings and provides the desired decomposition.
Datum may also be valuable more broadly for facility location problems.
57
C h a p t e r 13

CONCLUSION

This work sits at the intersection of two recent trends: the emergence of online
data marketplaces and the emergence of geo-distributed data analytics sys-
tems. Both have received significant attention in recent years across academia
and industry, changing the way data is bought and sold and changing how
companies like Facebook run queries across geo-distributed databases. In this
paper we study the engineering challenges that come when online data market-
places are run on top of a geo-distributed data analytics infrastructure. Such
cloud data markets have the potential to be a significant disruptor (as we high-
light in §8). However, there are many unanswered economic and engineering
questions about their design. While there has been significant prior work on
economic questions, such as how to price data, the engineering questions have
been neglected to this point.

We presented the design of a geo-distributed cloud data market: Datum. Da-


tum jointly optimizes data purchasing decisions with data placement decisions
in order to minimize the overall cost. While the overall cost minimization
problem is NP-hard (via a reduction to/from the facility location problem),
Datum provides near-optimal performance (within 1.6% of optimal) in realistic
settings via a polynomial-time algorithm that is provably optimal in the case
of a data cloud running on a single data center. Additionally, Datum provides
> 45% improvement over current design proposals for geo-distributed data
analytics systems. Datum works by decomposing the total cost minimization
problem into subproblems that allow optimization of data purchasing and data
placement separately, which provides a practical route for implementation in
real systems. Further, Datum provides a unified solution across systems using
per-query pricing or bulk pricing, systems with data replication constraints
and/or regulatory constraints on data placement, and systems with SLA con-
straints on delivery.
58
BIBLIOGRAPHY

[1] Khalid Al-Sultan and M. Al-Fawzan. “A Tabu Search Approach to the


Uncapacitated Facility Location Problem”. In: Annals of Operations Re-
search (1999).
[2] Réka Albert, Hawoong Jeong, and Albert-László Barabási. “Internet:
Diameter of the world-wide web”. In: Nature 401.6749 (1999), pp. 130–
131.
[3] Noga Alon, Ronitt Rubinfeld, Shai Vardi, and Ning Xie. “Space-Efficient
Local Computation Algorithms”. In: Proc./ 22ndACM-SIAM Sympo-
sium on Discrete Algorithms (SODA). 2012, pp. 1132–1139.
[4] Reid Andersen et al. “Local Computation of PageRank Contributions”.
In: Internet Mathematics 5(1–2) (2008), pp. 23–45.
[5] Lachlan LH Andrew et al. “A Tale of Two Metrics: Simultaneous Bounds
on Competitiveness and Regret.” In: COLT. 2013, pp. 741–763.
[6] Magdalena Balazinska, Bill Howe, and Dan Suciu. “Data Markets in the
Cloud: An Opportunity for the Database Community”. In: Proceedings
of the VLDB Endowment (2011).
[7] Magdalena Balazinska et al. “A Discussion on Pricing Relational Data”.
In: In Search of Elegance in the Theory and Practice of Computation.
2013.
[8] John Beasley. “Lagrangean Heuristics for Location Problems”. In: Eu-
ropean Journal of Operational Research (1993).
[9] Jacques F Benders. “Partitioning procedures for solving mixed-variables
programming problems”. In: Numerische mathematik 4.1 (1962), pp. 238–
252.
[10] Dimitri P Bertsekas. Nonlinear programming. Athena scientific Belmont,
1999.
[11] Dimitris Bertsimas and John Tsitsiklis. Introduction to Linear Optimiza-
tion. 1997.
59
[12] Vincent D Blondel, Julien M Hendrickx, Alex Olshevsky, and John N
Tsitsiklis. “Convergence in multiagent coordination, consensus, and flock-
ing”. In: Proceedings of IEEE Conference on Decision and Control. IEEE.
2005, pp. 2996–3000.
[13] Sem Borst, Varun Gupta, and Anwar Walid. “Distributed caching al-
gorithms for content distribution networks”. In: Proceedings of IEEE
INFOCOM. IEEE. 2010, pp. 1–9.
[14] Stephen Boyd et al. “Distributed optimization and statistical learning
via the alternating direction method of multipliers”. In: Foundations and
Trends R in Machine Learning 3.1 (2011), pp. 1–122.
[15] Niv Buchbinder and Joseph Naor. “The Design of Competitive Online
Algorithms via a Primal-Dual Approach”. In: Foundations and Trends
in Theoretical Computer Science 3.2-3 (2009), pp. 93–263.
[16] Yijia Cao et al. “An optimized EV charging model considering TOU
price and SOC curve”. In: IEEE Transactions on Smart Grid 3.1 (2012),
pp. 388–393.
[17] Moses Charikar, Sudipto Guha, Éva Tardos, and David Shmoys. “A
Constant-factor Approximation Algorithm for the K-median Problem
(Extended Abstract)”. In: STOC. 1999.
[18] Mung Chiang, Steven H Low, A Robert Calderbank, and John C Doyle.
“Layering as optimization decomposition: A mathematical theory of net-
work architectures”. In: Proceedings of the IEEE 95.1 (2007), pp. 255–
312.
[19] Patrick L Combettes and Jean-Christophe Pesquet. “A Douglas–Rachford
splitting approach to nonsmooth convex variational signal recovery”.
In: IEEE Journal of Selected Topics in Signal Processing 1.4 (2007),
pp. 564–574.
[20] Patrick L Combettes and Valérie R Wajs. “Signal recovery by proximal
forward-backward splitting”. In: Multiscale Modeling & Simulation 4.4
(2005), pp. 1168–1200.
[21] James Corbett et al. “Spanner: Google’s Globally Distributed Database”.
In: ACM Transactions on Computer Systems (2013).
[22] Rachel Cummings et al. “Accuracy for Sale: Aggregating Data with a
Variance Constraint”. In: ITCS. 2015.
60
[23] George Dantzig. Linear programming and extensions. Princeton univer-
sity press, 2016.
[24] George B Dantzig and Philip Wolfe. “Decomposition principle for linear
programs”. In: Operations research 8.1 (1960), pp. 101–111.
[25] Cynthia Dwork. “Differential Privacy”. In: Encyclopedia of Cryptography
and Security. 2011.
[26] Donald Erlenkotter. “A Dual-Based Procedure for Uncapacitated Facil-
ity Location”. In: Operations Research (1978).
[27] Tomaso Erseghe. “Distributed optimal power flow using ADMM”. In:
IEEE transactions on power systems 29.5 (2014), pp. 2370–2380.
[28] Hugh Everett III. “Generalized Lagrange multiplier method for solving
problems of optimum allocation of resources”. In: Operations research
11.3 (1963), pp. 399–417.
[29] Factual. https://www.factual.com/. 2015.
[30] Uriel Feige. “A Threshold of ln n for Approximating Set Cover”. In: J.
ACM (1998).
[31] Uriel Feige, Boaz Patt-Shamir, and Shai Vardi. On the probe complexity
of local computation algorithms. Under submission. 2017.
[32] Lisa Fleischer and Yu han Lyu. “Approximately Optimal Auctions for
Selling Privacy when Costs are Correlated with Data”. In: Proceedings
of the 13th ACM Conference on Electronic Commerce. 2012.
[33] Pedro A Forero, Alfonso Cano, and Georgios B Giannakis. “Consensus-
based distributed support vector machines”. In: Journal of Machine
Learning Research 11.May (2010), pp. 1663–1707.
[34] Daniel Gabay and Bertrand Mercier. “A dual algorithm for the solution
of nonlinear variational problems via finite element approximation”. In:
Computers & Mathematics with Applications 2.1 (1976), pp. 17–40.
[35] Lingwen Gan, Ufuk Topcu, and Steven Low. “Optimal decentralized pro-
tocol for electric vehicle charging”. In: IEEE Transactions on Power Sys-
tems 28.2 (2013), pp. 940–951. issn: 0885-8950. doi: 10.1109/TPWRS.
2012.2210288.
61
[36] Diptesh Ghosh. “Neighborhood Search Heuristics for the Uncapacitated
Facility Location Problem ”. In: European Journal of Operational Re-
search (2003).
[37] Google Data Center FAQ. http://www.datacenterknowledge.com/
archives/2012/05/15/google-data-center-faq/. 2012.
[38] Sudipto Guha and Samir Khuller. “Greedy Strikes Back: Improved Fa-
cility Location Algorithms”. In: Journal of Algorithms (1999).
[39] Yi Guo and Lynne E Parker. “A distributed and optimal motion plan-
ning approach for multiple mobile robots”. In: Robotics and Automation,
2002. Proceedings. ICRA’02. IEEE International Conference on. Vol. 3.
IEEE. 2002, pp. 2612–2619.
[40] Ashish Gupta et al. “Mesa: Geo-replicated, Near Real-time, Scalable
Data Warehousing”. In: Proceedings of the VLDB Endowment (2014).
[41] Elad Hazan. “Introduction to online convex optimization”. In: Founda-
tions and Trends R in Optimization 2.3-4 (2016), pp. 157–325.
[42] Dorit Hochbaum. “Heuristics for the Fixed Cost Median Problem”. In:
Math. Program. (1982).
[43] Chien-Chun Hung, Leana Golubchik, and Minlan Yu. “Scheduling Jobs
across Geo-distributed Datacenters”. In: Proceedings of the 6th ACM
Symposium on Cloud Computing. 2015.
[44] Infochimps. http://www.infochimps.com/. 2015.
[45] Kamal Jain and Vijay Vazirani. “Approximation Algorithms for Metric
Facility Location and k-Median Problems Using the Primal-dual Schema
and Lagrangian Relaxation”. In: J. ACM (2001).
[46] Jonathan Katz and Luca Trevisan. “On the efficiency of local decod-
ing procedures for error-correcting codes”. In: Proc. 32nd Annual ACM
Symposium on the Theory of Computing (STOC). 2000, pp. 80–86.
[47] Frank P Kelly, Aman K Maulloo, and David KH Tan. “Rate control
for communication networks: shadow prices, proportional fairness and
stability”. In: Journal of the Operational Research society 49.3 (1998),
pp. 237–252.
[48] Manfred Korkel. “On the Exact Solution of Large-scale Simple Plant Lo-
cation Problems ”. In: European Journal of Operational Research (1989).
62
[49] Paraschos Koutris et al. “Query-based Data Pricing”. In: Proceedings of
the 31st symposium on Principles of Database Systems. 2012.
[50] Paraschos Koutris et al. “QueryMarket Demonstration: Pricing for On-
line Data Markets”. In: Proceedings of the VLDB Endowment (2012).
[51] Paraschos Koutris et al. “Toward Practical Query Pricing with Query-
Market”. In: SIGMOD. 2013.
[52] Jakob Krarup and Peter Pruzan. “The Simple Plant Location Problem:
Survey and Synthesis”. In: European Journal of Operational Research
(1983).
[53] Yoshiaki Kuwata and Jonathan P How. “Cooperative distributed robust
trajectory optimization using receding horizon MILP”. In: IEEE Trans-
actions on Control Systems Technology 19.2 (2011), pp. 423–431.
[54] Leon S Lasdon. Optimization theory for large systems. Courier Corpora-
tion, 1970.
[55] George Lee et al. “The Unified Logging Infrastructure for Data Analytics
at Twitter”. In: Proceedings of the VLDB Endowment (2012).
[56] Reut Levi, Ronitt Rubinfeld, and Anak Yodpinyanee. “Brief Announce-
ment: Local Computation Algorithms for Graphs of Non-Constant De-
grees”. In: Proceedings of the 27th ACM on Symposium on Parallelism
in Algorithms and Architectures, SPAA. 2015, pp. 59–61.
[57] Chao Li, Daniel Yang Li, Gerome Miklau, and Dan Suciu. “A Theory
of Pricing Private Data”. In: ACM Transactions on Database Systems
(2014).
[58] Na Li, Lijun Chen, and Steven H Low. “Optimal demand response based
on utility maximization in power networks”. In: Power and Energy So-
ciety General Meeting, 2011 IEEE. IEEE. 2011, pp. 1–8.
[59] Ying Liao, Huan Qi, and Weiqun Li. “Load-balanced clustering algo-
rithm with distributed self-organization for wireless sensor networks”.
In: IEEE Sensors Journal 13.5 (2013), pp. 1498–1506.
[60] Palma London, Niangjun Chen, Shai Vardi, and Adam Wierman. Dis-
tributed Optimization via Local Computation Algorithms. http://users.
cms.caltech.edu/~plondon/loco.pdf. Under submission. 2017.
63
[61] Steven Low and David Lapsley. “Optimization flow control. I. Basic al-
gorithm and convergence”. In: IEEE/ACM Transactions on Networking
7.6 (1999), pp. 861–874. issn: 1063-6692. doi: 10.1109/90.811451.
[62] Steven H Low, Fernando Paganini, and John C Doyle. “Internet conges-
tion control”. In: IEEE control systems 22.1 (2002), pp. 28–43.
[63] Yishay Mansour, Aviad Rubinstein, Shai Vardi, and Ning Xie. “Con-
verting Online Algorithms to Local Computation Algorithms”. In: Pro-
ceedings of 39th International Colloquium on Automata, Languages and
Programming (ICALP). 2012, pp. 653–664.
[64] Laurent Massoulié and James Roberts. “Bandwidth sharing: objectives
and algorithms”. In: INFOCOM’99. Eighteenth Annual Joint Confer-
ence of the IEEE Computer and Communications Societies. Proceedings.
IEEE. Vol. 3. IEEE. 1999, pp. 1395–1403.
[65] Microsoft Azure. https://azure.microsoft.com/en-us/. 2015.
[66] Angelia Nedić and Asuman Ozdaglar. “Convergence rate for consensus
with delays”. In: Journal of Global Optimization 47.3 (2010), pp. 437–
456.
[67] Angelia Nedic and Asuman Ozdaglar. “Distributed subgradient meth-
ods for multi-agent optimization”. In: IEEE Transactions on Automatic
Control 54.1 (2009), pp. 48–61.
[68] M. E. J. Newman. “Power Laws, Pareto Distributions and Zipf’s Law”.
In: Contemporary physics (2005).
[69] Reza Olfati-Saber. “Distributed Kalman filtering for sensor networks”.
In: Decision and Control, 2007 46th IEEE Conference on. IEEE. 2007,
pp. 5492–5498.
[70] Venkata N Padmanabhan, Helen J Wang, Philip A Chou, and Kunwadee
Sripanidkulchai. “Distributing streaming media content using coopera-
tive networking”. In: Proceedings of workshop on Network and operating
systems support for digital audio and video. ACM. 2002, pp. 177–186.
[71] Daniel P Palomar and Mung Chiang. “Alternative distributed algo-
rithms for network utility maximization: Framework and applications”.
In: IEEE Transactions on Automatic Control 52.12 (2007), pp. 2254–
2269.
64
[72] Qiuyu Peng and Steven Low. “Distributed optimal power flow algorithm
for radial networks, I: Balanced single phase case”. In: IEEE Transac-
tions on Smart Grid (2016).
[73] Qifan Pu et al. “Low Latency Geo-distributed Data Analytics”. In: SIG-
COMM. 2015.
[74] A. Rabkin et al. “Aggregation and Degradation in JetStream: Streaming
Analytics in the Wide Area”. In: NSDI. 2014.
[75] Robin L Raffard, Claire J Tomlin, and Stephen P Boyd. “Distributed
optimization for cooperative agents: Application to formation flight”.
In: Decision and Control, 2004. CDC. 43rd IEEE Conference on. Vol. 3.
IEEE. 2004, pp. 2453–2459.
[76] Omer Reingold and Shai Vardi. “New techniques and tighter bounds
for local computation algorithms”. In: Journal of Computer and System
Science 82.7 (2016), pp. 1180–1200.
[77] Xiaoqi Ren, Palma London, Juba Ziani, and Adam Wierman. Joint Data
Purchasing and Data Placement in a Geo-Distributed Data Market. Pro-
ceedings of the 2016 ACM SIGMETRICS International Conference on
Measurement and Modeling of Computer Science. 2016.
[78] Ronitt Rubinfeld, Gil Tamir, Shai Vardi, and Ning Xie. “Fast Local
Computation Algorithms”. In: Proc. 2nd Symposium on Innovations in
Computer Science (ICS). 2011, pp. 223–238.
[79] Michael E. Saks and C. Seshadhri. “Local Monotonicity Reconstruction”.
In: SIAM Journal on Computing 39.7 (2010), pp. 2897–2926.
[80] Pedram Samadi et al. “Optimal real-time pricing algorithm based on
utility maximization for smart grid”. In: Proceedings of Smart Grid Com-
munications (SmartGridComm). IEEE. 2010, pp. 415–420.
[81] Ioannis D Schizas, Alejandro Ribeiro, and Georgios B Giannakis. “Con-
sensus in ad hoc WSNs with noisy linksPart I: Distributed estimation of
deterministic signals”. In: IEEE Trans. on Signal Processing 56.1 (2008),
pp. 350–364.
[82] Alexander Schrijver. Theory of Linear and Integer Programming. John
Wiley & Sons, 1998.
[83] Naum Zuselevich Shor. Minimization methods for non-differentiable func-
tions. Vol. 3. Springer Science & Business Media, 2012.
65
[84] Rayadurgam Srikant. The mathematics of Internet congestion control.
Springer Science & Business Media, 2012.
[85] Gabriele Steidl and Tanja Teuber. “Removing multiplicative noise by
Douglas-Rachford splitting methods”. In: Journal of Mathematical Imag-
ing and Vision 36.2 (2010), pp. 168–184.
[86] Ichiro Suzuki and Masafumi Yamashita. “Distributed anonymous mobile
robots: Formation of geometric patterns”. In: SIAM Journal on Com-
puting 28.4 (1999), pp. 1347–1363.
[87] The CAIDA UCSD AS Relationship Dataset, September 17, 2007. http:
//www.caida.org/data/as-relationships/.
[88] The IUPHAR/BPS Guide to Pharmacology. http://www.guidetopharmacology.
org/. 2015.
[89] Dilek Tuzun and Laura Burke. “A Two-phase Tabu Search Approach to
the Location Routing Problem ”. In: European Journal of Operational
Research (1999).
[90] Vijay Vazirani. Approximation Algorithms. Springer, 2001.
[91] Visipedia Project. http : / / www . vision . caltech . edu / visipedia/.
2015.
[92] Ashish Vulimiri et al. “Global Analytics in the Face of Bandwidth and
Regulatory Constraints”. In: NSDI. 2015.
[93] Ashish Vulimiri et al. “WANalytics: Analytics for a Geo-distributed
Data-intensive World”. In: CIDR. 2015.
[94] Janet Wiener and Nathan Boston. Facebook’s top open data problems.
https://research.facebook.com/blog/1522692927972019/facebook-
s-top-open-data-problems/. 2014.
[95] Xignite. http://www.xignite.com/. 2015.
[96] Yung Yi and Mung Chiang. “Stochastic network utility maximization -
a tribute to Kelly’s paper published in this journal a decade ago”. In:
European Transactions on Telecommunications 19.4 (2008), pp. 421–442.
[97] Jiawei Zhang. “Approximating the two-level facility location problem via
a quasi-greedy approach”. In: Mathematical Programming (2006).
66
Appendix A

PSEUDOCODE FOR GENERAL ONLINE FRACTIONAL


PACKING

The following pseudocode is replicated from [15]. Constraints arrive in some


order. During the jth round, the dual variable y(j) and all the primal variables
are increased. The minimum y(j) is found, such that the primal constraints
are satisfied.

Algorithm 1: General Online Fractional Packing


Input: A ∈ Rm×n , c ∈ Rn
Output: x, y
Initialize x = 0n , y = 0m for j = 1...m do
for i = 1...n do
ai (max) ← maxjk=1 {a(i, k)}
while ni=1 a(i, j)x(i) < 1 do
P
Increase y(j) continuously
for i = 1...n do
B
Pj
δ = exp( 2c(i) a(i, k)y(k)) − 1
n k=1 o
1
x(i) = max x(i), nai (max) δ

Instead of increasing y(j) continuously, one can perform a binary search over
possible values of y(j). For each candidate y(j), a corresponding new value
of x is computed and the primal constraints are checked for feasibility. If
feasible, the new x is accepted, and y(j) will be increased in the next round
of the binary search. If infeasible, the new x is rejected, and the value of y(j)
will be decreased in the next round of the search.

A.1 ADMM
In our numerical results we compare LOCO to ADMM in the case of linear
NUM. For completeness, we describe the application of ADMM to that setting
here.

To apply ADMM, we first absorb the inequality constraint x ≤ x̄ into the


h iT h iT
inequality A00 x ≤ c0 by letting A00 = A, I and c0 = c, x̄ , where this
67
notation indicates a stack of vectors. We introduce a slack variable s ≥ 0
h iT
such that the inequality constraint becomes A00 x + s = c0 . Let x0 = x, s ,
h iT
A0 = [A00 I] and b = 1n , 0n . We can now write the problem in standard
ADMM form,

min
0
g(x0 ) + h(z)
x ,z

s.t. x0 − z = 0

where g = (x − x)+ is the indicator function associated with the constraints


x ≤ x and h(z 0 ) = −bT z where dom h = {z|A0 z = c0 }.

Writing down the scaled augmented Lagrangian Lρ (x0 , z, u) = g(x0 ) + h(z) +


uT (z − x0 ) + ρ2 kx0 − zk2 , we can see that all the update steps have closed form
solution (see [14, Chapter 5.2]). The updates become:

x0k+1 = (z k+1 + uk − x)+ ∀i


 −1  
0T 0k k
ρI A ρ(x − u ) − b
z k+1 =    
0 0
A 0 c
uk+1 = uk + (x0k+1 − z k+1 ) ∀i

The solution to the NUM problem is recovered from the first n entries of x0 .
68
Appendix B

PROOF OF LEMMA 3

We denote the set {0, 1, . . . , m} by [m]. Logarithms are base e. Let G = (V, E)
be a graph. For any vertex set S ⊆ V , denote by N (S) the set of vertices that
are not in S but are neighbors of some vertex in S: N (S) = {N (v) : v ∈ S}\S.
The length of a path is the number of edges it contains. For a set S ⊆ V and
a function f : V → N, we use S ∩ f −1 (i) to denote the set {v ∈ S : f (v) = i}.

Let G = (V, E) be a graph, and let f : V → N be some function on the vertices.


An adaptive vertex exposure procedure A is one that does not know f a priori.
A is given a vertex v ∈ V and f (v); A iteratively adds vertices from V \ S to
S: for every vertex u that A adds to S, f (u) is revealed immediately after u
is added. Let S t denote S after the addition of the tth vertex. The following
is a simple concentration bound whose proof is given for completeness.

Lemma 10 Let G = (V, E) be a graph, let Q > 0 be some constant, let


γ = 15Q, and let f : V → [Q] be a function chosen uniformly at random from
all such possible functions. Let A be an adaptive vertex exposure procedure that
is given a vertex v ∈ V . Then, for any q ∈ [Q], the probability that there is
t|
some t, γ log n ≤ t ≤ n for which |S t ∩ f −1 (q)| > 2|S
Q
is at most n14 .

Proof 11 Let vj be the j th vertex added to S by A, and let "Xj be the


# indicator
Xt
variable whose value is 1 iff f (vj ) = q. For any t ≤ n, E Xj = Qt . As
j=1
Xi and Xj are independent for all i 6= j, by the Chernoff bound, for γ log n ≤
t ≤ n, " t #
X 2t −t
Pr Xj > ≤ e 3Q ≤ e−5 log n .
j=1
Q

A union bound over all possible values of t : γ log n ≤ t ≤ n completes the


proof.

Let r : V → [0, 1] be a function chosen uniformly at random from all such


possible functions. Partition [0, 1] into Q = 4(d+1) segments of equal measure,
69
I1 , . . . , IQ . For every v ∈ V , set f (v) = q if r(v) ∈ Iq (f is a quantization of
r).

Consider the following method of generating two sets of vertices: T and R,


where T ⊆ R. For some vertex v, set T = R = {v}. Continue inductively:
choose some vertex w ∈ T , add all N (w) to R and compute f (u) for all
u ∈ N (w). Add the vertices u such that u ∈ N (w) and f (u) ≥ f (w) to T .
The process ends when no more vertices can be added to T . T is the query set
with respect to f , hence |T | is an upper bound on the size of the actual query
set (i.e., the query set with respect to r). However, it is difficult to reason
about the size of T directly, as the ranks of its vertices are not independent.
The ranks of the vertices in R, though, are independent, as R is generated by
an adaptive vertex exposure procedure. R is a superset of T that includes T
and its boundary, hence |R| is also an upper bound on the size of the query
set.

We now define Q + 1 “layers” - T≤0 , . . . , T≤Q : T≤q = T ∩ qi=0 f −1 (i). That is,
S

T≤q is the set of vertices in T whose rank is at most q. (The range of f is [Q],
hence T≤0 will be empty, but we include it to simplify the proof.)

Claim 12 Set Q = 4(d + 1), γ = 15Q. Assume without loss of generality that
f (v) = 0. Then for all 0 ≤ i ≤ Q − 1,
1
Pr[|T≤i | ≤ 2i γ log n ∧ |T≤i+1 | ≥ 2i+1 γ log n] ≤ .
n4

Proof 13 For all 0 ≤ i ≤ Q, let R≤i = T≤i ∪ N (T≤i ). Note that

R≤i ∩ f −1 (i) = T≤i ∩ f −1 (i), (B.1)

because if there had been some u ∈ N (T≤i ), f (u) = i, u would have been added
to T≤i .

Note that |T≤i | ≤ 2i γ log n ∧ |T≤i+1 | ≥ 2i+1 γ log n implies that

|T≤i+1 |
|T≤i+1 ∩ f −1 (i + 1)| > . (B.2)
2

In other words, the majority of vertices v ∈ T≤i+1 must have f (v) = i + 1.

Given |T≤i+1 | > 2i+1 γ log n, it holds that |R≤i+1 | > 2i+1 γ log n because T≤i+1 ⊆
R≤i+1 . Furthermore, R≤i+1 was constructed by an adaptive vertex exposure
70
procedure and so the conditions of Lemma 10 hold for R≤i+1 . From Equa-
tions (B.1) and (B.2) we get

Pr[|T≤i | ≤ 2i γ log n ∧ |T≤i+1 | ≥ 2i+1 γ log n]


 
−1 |T≤i+1 |
≤ Pr R≤i+1 ∩ f (i + 1) >
2
 
−1 2 |R≤i+1 |
≤ Pr R≤i+1 ∩ f (i + 1) >
Q
1
≤ 4,
n
where the second inequality is because |R≤i+1 | ≤ (d + 1)|T≤i+1 |, as G’s degree
is at most d; the last inequality is due to Lemma 10.

Lemma 14 Set Q = 4(d + 1). Let G = (V, E) be a graph with degree bounded
by d, where |V | = n. For any vertex v ∈ G, Pr Tv > 2Q · 15Q log n < n13 .
 

Proof 15 To prove Lemma 14, we need to show that, for γ = 15Q,


1
Pr[|T≤Q | > 2L γ log n] < .
n3
We show that for 0 ≤ i ≤ Q, Pr[|T≤i | > 2i γ log n] < ni4 , by induction. For
the base of the induction, |S0 | = 1, and the claim holds. For the inductive
step, assume that Pr[|T≤i | > 2i γ log n] < ni4 . Then, denoting by X the event
|T≤i | > 2i γ log n and by X̄ the event |T≤i | ≤ 2i γ log n,

Pr[|T≤i+1 | > 2i+1 γ log n]


= Pr[|T≤i+1 | > 2i+1 γ log n : X] Pr[X]
+ Pr[|T≤i+1 | > 2i+1 γ log n : X̄] Pr[X̄].

From the inductive step and Claim 12, using the union bound, the lemma
follows.

Applying a union bound over all the vertices gives the size of each query set is
O(log n) with probability at least 1−1/n2 , completing the proof of Theorem 3.
71
Appendix C

PROOF OF THEOREM 8

To prove Theorem 8, we show a connection between the data cloud cost mini-
mization problem in (9.5) and the uncapacitated facility location problem. In
particular, we show both that the facility location problem can be reduced to
a data cloud optimization problem and vice versa.

First, we show that every instance of the uncapacitated facility location prob-
lem can be viewed as an instance of (9.5).

Take any instance of the uncapacitated facility location problem (UFLP). Let
I be the set of customers, J the set of locations, αij the cost of assigning
customer i to location j, and βj the cost of opening facility j. Binary variables
yj = 1 if and only if facility is open at site j, and xj,i = 1 if and only if
customer i is assigned to location j. Then the UFLP can be formulated as
following.
X X
min βj yj + αij xj,i (C.1)
x,y
j∈F i∈I,j∈F

subject to
xj,i ≤ yj , ∀i, j
X
xj,i = 1, ∀c
j∈F

xj,i , yj ∈ {0, 1}, ∀i, j

Mapping j to d and i to c yields an instance of (9.5) with |P | = |L| = 1,


f (l) = 0 and wc (l) = 0, in which case constraint (9.5c) becomes trivial.

Next, we show that every instance of (9.5) can be written as an instance of


UFLP.

We start by remarking that (9.5) (with p dropped) is equivalent to the following


72
ILP.
D,L D,L,C
X X
min βd (l)yd (l) + (f (l) + αd,c (l))xd,c (l) (C.2)
x,y
d,l=1 d,l,c=1

subject to
xd,c (l) ≤ yd (l), ∀c, l, d
D X
X L
xd,c (l) = 1, ∀c
d=1 l=1

xd,c (l), yd (l) ∈ {0, 1}, ∀c, l, d

with αd,c (l) = M , for M big enough, whenever l < wc . Indeed, in any feasible
solution of (9.5), we necessarily have xd,c (l) = 0 whenever l < wc , as each client
purchases exactly one quality level and this quality level has to be higher than
the minimum required level wc ; by setting αd,c (l) big enough, we ensure that
any optimal solution must have xd,c (l) = 0 thus must be feasible for (9.5), and
has the same cost as in (9.5). Now, take J = [D] × [L] and I = [C], and the
problem can be rewritten as
X X
min βd (l)yd (l) + (f (l) + αd,c (l))xd,c (l) (C.3)
x,y
(d,l)∈J (d,l)∈J,c∈I

subject to xd,c (l) ≤ yd (l), ∀(d, l) ∈ J, c ∈ I


X
xd,c (l) = 1, ∀c ∈ I
(d,l)∈J

xd,c (l), yd (l) ∈ {0, 1}, ∀c ∈ I, (d, l) ∈ J

which is an UFLP.

C.1 Proof of Theorem 9


Assume without loss of generality that all clients can be satisfied by the highest
quality level, i.e., wc ≤ q(L), ∀c. Define Ci = {c : q(i − 1) < wc ≤ q(i)}
(q(0) = 0 by default). Given these assumptions, clients can be grouped into
L categories {C1 , C2 , . . . , CL } based on their minimum quality level. Note
that Ci ∩ Cj = ∅, ∀i, j and ∪Li=1 Ci = C. Without loss of generality, assume
Ci 6= ∅, ∀i.

As the clients in the same group Ci all face exactly the same choice of quality
levels and minimum quality requirements, there must always be an optimal so-
lution in which the data purchasing decisions of any clients within one category
are the same.
73
Let us denote the number of clients in category Ci by Si . Denote the purchasing
decision of category Ci by χi , e.g., χi (l) = xc (l), ∀l, c ∈ Ci , similar to the
argument in proof of Theorem 8, we can reformulate (10.2) as follows. Note
the slight abuse of notation, as clients and their associated required quality
level are represented by the same letter, i, due to clients in category Ci having
minimum quality level i by definition.
L
X L X
X L
minimize β(l)y(l) + Si f (l)χi (l) (C.4)
l=1 i=1 l=i

subject to χi (l) ≤ y(l), ∀i, l (C.4a)


L
X
χi (l) = 1, ∀i (C.4b)
l=i

χi (l) ≥ 0, ∀i, l (C.4c)


y(l) ≥ 0, ∀l (C.4d)
χi (l), y(l) ∈ {0, 1}, ∀i, l (C.4e)

Consider the linear relaxation of (C.4), which drops the 0 − 1 integer con-
straint (C.4e). For any optimal solution {χri (l), y r (l)} of the linear relaxation
we have the following observations.

1. χrL (L) = 1.

Proof 16 From (C.4b), let i = L, then χrL (L) = 1. The intuition behind
this is that, since CL 6= ∅, highest quality data always has to be purchased
to provide service for clients in C(L).

2. y r (l) = maxi {χri (l)} ∈ [0, 1] and y r (L) = 1.

Proof 17 From (C.4a), the non-negativity of {β(l)}, and the optimality of


{y r (l)}, y r (l) = maxi {χri (l)}. From the non-negativity of {χri (l)}, y r (l) =
maxi {χri (l)} ≤ Ll=i χri (l) = 1, and y r (L) = χrL (L) = 1
P

PL r r r
3. ∀l ≥ i, if l=i y (l) ≤ 1, χi (l) = y (l); otherwise, χi (l) = max{1 −
Pl−1 r
k=i y (k), 0}.

Proof 18 For some fixed i, {Si f (l)} is a positive, strictly increasing se-
quence as l increases. From constraint (C.4a) and (C.4b), χri (l) ≤ y r (l),
74
PL Pl
and l=i χri (l) = 1. Since {χri (l), y r (l)} is optimal, ∀l ≥ i, if k=i y r (k) ≤
1, χri (l) = y r (l); otherwise, χri (l) = max{1 − l−1 r
P
k=i y (k), 0}.

P i −1 r
Next, define mi ∈ {i, . . . , n} such that m
Pmi r
l=i y (l) < 1, and l=i y (l) ≥ 1.
Such an mi must exist since y (l) ≥ 0 for all l and y (L) = 1. Recall χrL (L) =
r r

y r (L) = 1. For for any i = 1, 2, . . . , L − 1, if the values of {y r (l)} are given,


the optimal {χri (l)} satisfy the following closed form expression:



 y r (l), i ≤ l < mi

χri (l) = 1 − m i −1 r
(C.5)
P
 k=i y (k), l = mi

mi < l ≤ n.

0,

Note that, if y r are binary, then χr are binary. Suppose there exists an opti-
mal solution {χr , y r } with y r 6∈ {0, 1}L , in the following we show that there
exists a feasible binary solution {χ∗ , y ∗ } of (C.4) such that the objective value
generated by {χ∗ , y ∗ } is better than or equal to that of {χr , y r }.

Suppose fractional solution y r is an optimal solution of the linear relaxation


and calculate mi as in (C.5). Write χ as a function of y, ∀i, l.



 y(l), i ≤ l < mi

Pmi −1
χi (l) = 1 − k=i y(k), l = mi (C.6)


mi < l ≤ n

0,

Substituting (C.6) in the objective function (C.4), the objective function be-
comes a linear combination of {y(l)} that we denote L(y).

Consider the optimization problem in which {χi (l)} is expressed as a function


of {y(l)} in the linear relaxation:

minimize L(y) (C.7)


m0i −1
X
subject to y(l) ≤ 1, ∀i = 1, . . . , L − 1
l=i
m0i
X
y(l) ≥ 1, ∀i = 1, . . . , L − 1
l=i

y(l) ≥ 0, ∀l = 1, . . . , L
y(L) = 1

The following claims hold:


75
1. (C.7) is feasible and bounded, and always has an optimal solution at an
extreme point.

Proof 19 Clearly, ∀l, y(l) ∈ [0, 1]. And starting from y(L), it is easy to
construct a feasible solution of (C.7). Thus, (C.7) is feasible and bounded,
and always has an optimal solution at an extreme point.

2. {y r (l)} is a feasible solution of (C.7).

Proof 20 Since {y r (l)} is feasible for (C.4), y r (l) ≥ 0, ∀l, and y r (L) = 1.
P i −1 r
By definition of mi , m
Pmi r
l=i y (l) ≤ 1, l=i y (l) ≥ 1.

3. Any extreme point {y(l)} of (C.7) is binary.

Proof 21 Since y(L) = 1, we can drop y(L), and write (C.7) in the fol-
lowing standard linear programming form:

min L(y) (C.8)


y

s.t. Ay ≤ b
y≥0

Note that all entries of A are 0, ±1, and all rows of A has either consecutive
1s or consecutive −1s. Thus, from [82], A is a totally unimodular matrix
thus the extreme points of (C.8) are all integral. In particular, since all
y(l) ∈ [0, 1], the extreme points of (C.8) are all binary.

4. The {χ∗i (l)} obtained through (C.6) corresponding to an optimal binary


solution {y ∗ } is also binary.

Proof 22 Follows immediately from (C.6) and integrality of {y ∗ (l)}.

5. {χ∗i (l), y ∗ (l)} is a feasible solution of the linear relaxation of (C.4).


PL
Proof 23 Follows from (C.6) and l=i χ∗i (l) = 1

{χri (l), y r (l)} and any optimal extreme point {χ∗i (l), y ∗ (l)} see their cor-
responding objective values unchanged between (C.7) and the relaxation
of (C.4) by construction of the χi (l)’s. And any such extremal and optimal
76
{χ∗i (l), y ∗ (l)} has a better or equal objective value compared to {χri (l), y r (l)}
in relaxed (C.4). Since {χri (l), y r (l)} is optimal for (C.7), it implies any op-
timal extreme point of relaxed (C.4) yields a binary and optimal solution
for (C.7). This provides a polynomial time algorithm to find such a binary
optimal solution, which can be summarized as in §10.2.

C.2 Proof of Step 2 in §10.2


In this section we derive the closed form solutions of (10.7) for the optimization
in (10.6). We start by discussing the form of xv,c (l). Consider the following
two cases based on the value of Y (l).

V
1. For any quality level l0 , if Y (l0 ) = 0, then ∀v, yv (l0 ) = Y (l0 ) = 0.
P
v=1
From the non-negativity of yv (l0 ), ∀v, yv (l0 ) = 0. Further, ∀v, c, xv,c (l0 ) = 0
from (10.6a).
2. For any quality level l0 , if Y (l0 ) = 1, then from the definition of yv (l) and
Y (l), ∃!v 0 ∈ V, such that yv0 (l0 ) = Y (l0 ) = 1. Recall that C(l0 ) = {c :
Xc (l0 ) = 1} represents the set of clients that are assigned data with quality
level l0 by Step 1 in §10.2.

a) For client c0 ∈ C(l0 ), Xc0 (l0 ) = 1. Since v 0 is the unique data center set
across V such that yv0 (l0 ) = 1, from (10.6a) and (10.6b), xv0 ,c0 (l0 ) =
1 and xv,c0 (l) = 0, ∀v 6= v 0 or l 6= l0 . In other words, xv,c0 (l0 ) =
yv (l0 ), ∀v ∈ V, c ∈ C(l0 ).
/ C(l0 ), Xc (l0 ) = 0. From the definition of Xc (l0 ), xv,c (l0 ) =
b) For client c ∈
0, ∀v.

In all above cases, the optimal solution {xv,c (l), yv (l)} of (10.6) satisfies the
following:

y (l), if c ∈ C(l),
v
xv,c (l) = (C.11)
0, otherwise.

Next, we use this form for xv,c (l) to derive yv (l). After substituting (C.11)
into (10.6), most constraints become trivial due to the form of (C.11) and the
optimality of Xc (l) and Y (l). And we only need to optimize the objective
P
function with the constraints stating that yv (l) is binary, and v yv (l) = Y (l).
77
Thus, we only need to optimize the following problem.
V
X X
minimize βv (l)yv (l)
l:Y (l)=1 v=1

X V
X X
+ (αv,c (l) + f (l))yv (l)
l:Y (l)=1 c∈C(l) v=1
V
X
subject to yv (l) = Y (l), ∀v, c, l
v=1

yv (l) ∈ {0, 1}, ∀v, c, l

The above optimization can be decoupled by l and optimized across v, yielding


the following closed form solution.




 1, if Y (l) = 1 and

yv (l) = (C.11)
P
v = argmin{βv (l) + c∈C(l) αv,c (l)},



0, otherwise.

C.3 Bulk Data Contracting


In bulk data contracting, the data cloud only has to pay a one-time fee f (l, p)
for data q(l, p), no matter how many times the data is replicated on the cloud
and transferred to clients. Compared to per-query contracting, the main differ-
ence lies in the purchasing fees modeling. Defining z(l, p) ∈ {0, 1} to be equal
to 1 if and only if data of quality q(l, p) from data provider p is transferred to
the data cloud, the whole optimization problem can still be formulated in a
form similar to (9.5), with the purchasing costs now given by (9.4) and with
the addition of the following constraint:

yp,d (l) ≤ z(l, p), ∀c, l, p, d (C.12)

This constraint states that any data placed in the data cloud must have been
purchased by the data cloud. As in the per-query contracting case, the data
purchasing/placement decision for data from one data provider does not im-
pact the data purchasing/placement decision for any other data providers.
Thus, we drop the index p in the following.

In general, the cost minimization problem for bulk contracting is NP-hard. To


be specific, the 1-level UFLP can reduce to the cost minimization problem for
78
a geo-distributed data cloud, and the cost minimization problem can reduce to
the 2-level UFLP in the bulk case. In the 2-level UFLP, facilities are organizing
on 2 levels, J1 × J2 ; each customer i ∈ I has to be assigned to a valid path
p ∈ J1 × J2 . A pass is valid if and only if both facilities are open along the
path. More details on the 2-level UFLP can be found in [97].

The first reduction follows directly from the first part of the proof for The-
orem 8. It can be easily proved by defining facilities in J1 to be the quality
levels, and using the same reformulation as the second part of the proof for
Theorem 8 for the facilities in J2 , i.e. define facilities inJ2 to be pairs of qual-
ity levels and data centers. In the reduction, a facility j1 ∈ J1 is open if and
only if the corresponding quality level l is purchased, and a facility j2 ∈ J2 is
opened if and only if data of quality level l is placed in data center d.

While the cost minimization in bulk contracting is generally hard, it can be


solved optimally in both the single data center and the geo-distributed data
cloud settings under certain assumptions.

For the single data center case, we always have z(l) = y(l) for all quality level l -
this follows immediately from dropping the dependence of yd (l) in d, implying
that z(l) is only lower-bounded by y(l) in the constraints. Furthermore, if
the execution costs are the same across quality levels, the cost minimization
problem can be formulated as follows:
L
X
minimize (β(l) + f (l)) y(l) (C.13)
l=1

subject to xc (l) ≤ y(l), ∀c, l


L
X
xc (l) = 1, ∀c
l=wc

xc (l) ≥ 0, ∀c, l
y(l) ≥ 0, ∀l
xc (l), y(l) ∈ {0, 1}, ∀c, l

Since the decisions for variables {xc (l)} do not affect the objective value, (C.13)
79
can be written as follows:
L
X
minimize (β(l) + f (l)) y(l) (C.14)
l=1
XL
subject to y(l) ≥ 1, ∀l, c
l=wc

y(l) ∈ {0, 1}, ∀l

Since there are customers buying the highest quality level, the highest level
quality L is always purchased by the data cloud and y(L) = 1 in any feasible
solution. Since all customers are satisfied and all costs are non-negative, an
optimal solution for (C.14) is y(L) = z(L) = 1, xc (L) = 1 with all other
variables are set to 0. The result implies the data cloud will only purchase the
highest quality level of data and serve that data to every customers.

For a geo-distributed data cloud, the cost minimization problem is generally


hard. However, if we assume the operation cost and execution cost are inde-
pendent of l, i.e., βd (l) = βd and αd,c (l) = αd,c , it is easy to show that the
optimal solution will only purchase the highest quality data as in the single
data center case. We can then use Step 2 in §10.2 to give an optimal solution
to the data placement problem.
ProQuest Number: 30556351

INFORMATION TO ALL USERS


The quality and completeness of this reproduction is dependent on the quality
and completeness of the copy made available to ProQuest.

Distributed by ProQuest LLC ( 2023 ).


Copyright of the Dissertation is held by the Author unless otherwise noted.

This work may be used in accordance with the terms of the Creative Commons license
or other rights statement, as indicated in the copyright statement or in the metadata
associated with this work. Unless otherwise specified in the copyright statement
or the metadata, all rights are reserved by the copyright holder.

This work is protected against unauthorized copying under Title 17,


United States Code and other applicable copyright laws.

Microform Edition where available © ProQuest LLC. No reproduction or digitization


of the Microform Edition is authorized without permission of ProQuest LLC.

ProQuest LLC
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346 USA

You might also like