Distributed Optimization and Data Market Design
Distributed Optimization and Data Market Design
Thesis by
Palma London
2017
Submitted May 19, 2017
ii
c 2017
Palma London
All rights reserved
iii
ACKNOWLEDGEMENTS
Palma London, Niangjun Chen, Shai Vardi, and Adam Wierman. Distributed
Optimization via Local Computation Algorithms. http://users.cms.caltech.
edu/~plondon/loco.pdf. Under submission. 2017.
P. London and S. Vardi came up with the results and proofs in this paper, and
P. London coded and ran all experiments. Article adapted and extended for
this thesis.
Xiaoqi Ren, Palma London, Juba Ziani, and Adam Wierman. Joint Data Pur-
chasing and Data Placement in a Geo-Distributed Data Market. Proceedings
of the 2016 ACM SIGMETRICS International Conference on Measurement
and Modeling of Computer Science. 2016.
X. Ren and P. London came up with the results and proofs in this paper, and
P. London coded and ran all experiments. Article adapted and extended for
this thesis.
vi
TABLE OF CONTENTS
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi
1
2
Chapter 1
INTRODUCTION
4
5
Chapter 2
While the algorithms described above are distributed, they are not local. A
local algorithm is one where a query about a small part of a solution to a
problem can be answered by communicating with only a small neighborhood
around the part queried1 (see Subsection 2.2 for a more comprehensive defini-
1
‘Local’ is an overloaded term in the literature. We mean local in the sense of [78].
6
tion and example). Clearly, neither iterative descent methods nor consensus
methods are local: answering a query about a piece of the solution requires
global communication.
Despite the benefits of local algorithms for distributed optimization, the prob-
lem of designing a local, distributed optimization algorithm is open.
The key idea behind LOCO is an extension of recent results from the emerging
field of local computation algorithms (LCA) in theoretical computer science
(e.g., [63, 76, 56]). In particular, a key insight of the field is that online
algorithms can be converted into local algorithms in graph problems with
bounded degree [63]. However, much of the focus of local algorithms has,
to this point, been on graph problems (see related literature below). The
technical contribution of this work is the extension of these ideas to convex
programs.
The main idea of LCAs is to compute a piece of the solution to some algorith-
mic problem using only information that is close to that piece of the problem,
as opposed to a global solution, by exchanging information across distributed
agents. More concretely, an LCA receives a query and is expected to output
the part of the solution associated with that query. For example, an LCA for
maximal matching would receive as a query an edge, and its output would
be “yes/no”, corresponding to whether or not the edge is part of the required
matching. The two requirements are (i) the replies to all queries are consistent
with the same solution, and (ii) the reply to each query is “efficient”, for some
natural notion of efficient.
Most of the work on LCAs has focused on graph problems such as matching,
maximal independent set, and coloring (e.g., [3, 56, 76, 31]) and the efficiency
criteria were the number of probes to the graph, the running time and the
8
amount of memory required. This paper extends the LCA literature by mov-
ing from graph problems to optimization problems, which have not been stud-
ied in the LCA community previously. Mansour et al. [63] showed a general
reduction from LCAs to online algorithms on graphs with bounded degree.
The key technical contribution of our the work is extending that technique to
design LCAs for convex programs. In contrast to previous work whose primary
focus was probe, time and space complexities, the efficiency criterion we use
is the number of messages required as this is usually the expensive resource in
networked control.
9
Chapter 3
3.1 Model
The NUM framework considers a network containing a set of links L =
{1, . . . , m} of capacity cj , for j ∈ L. A set of N = {1, . . . , n} sources shares
the network; source i ∈ N is characterized by (Li , fi , xi , x̄i ): a path Li ⊆ L
in the network; a (usually) concave utility function fi : R+ → R; and the
minimum and maximum transmission rates of i.
The goal in NUM is to maximize the sources’ aggregate utility. Source i attains
a concave utility fi (xi ) when it transmits at rate xi that satisfies xi ≤ xi ≤ x̄i ;
the optimization of aggregate utility can be formulated as follows,
n
X
max fi (xi )
x
i=1
subject to Ax ≤ c
x ≤ x ≤ x̄,
1, j ∈ L(i)
where A ∈ Rm×n
+ is defined as Aji = .
0, otherwise
The NUM framework is general in that the choice of fi allows for the rep-
resentation of different goals of the network operator. For example, using
fi (xi ) = xi , maximizes throughput; setting fi (xi ) = log(xi ) achieves propor-
tional fairness among the sources; setting fi (xi ) = −1/xi minimizes potential
delay; these are common goals in communication network applications [61, 64].
10
In this paper we focus on the throughput maximization case, i.e., fi (xi ) = xi ; in
this case NUM is an LP. Note that the classical dual decomposition approach
does not work for throughput maximization since it requires the objective
function to be strictly concave. However, ADMM can be applied.
Our complexity results hinge on the assumption that the constraint matrix A
is sparse. The sparsity of A is defined as max{α, β}, where α and β denote the
maximum number of non-zero entries in a row and column of A respectively.
Formally, we say that A is sparse if the sparsity of A is bounded by a constant.
This assumption usually holds in network control applications since α is the
maximum number of sources sharing a link, which is typically small compared
to n, and β is the maximum number of links each source uses, which is typically
small compared to m.1
The first is message complexity: the number of messages that are sent across
1
When α is large, many links will be congested and all sources will experience greater
delay, the routing protocol (IP) will start using different links; also, due to the small diameter
of the Internet graph [2], β is small compared to m.
11
e1 e2 s2
s1 e3 t3
t2 t1 e4 s3
(a)
s1 s2 s3 e1 e 0.4
1
e1 1 0 0 s1 e2
e2 1 1 0 s 0.3 e2 0.2
2 e3 e4
e3 0 1 1
e4 0 0 1
s3 e4 e3 0.1
(b) (c) (d)
Figure 3.1: An illustration of LOCO on a toy graph with five nodes and
four edges, e1 , . . . , e4 . There are three sources, s1 , s2 , s3 , with paths ending in
destinations t1 , t2 , t3 respectively. The graph is depicted in (a); the constraint
matrix for NUM is given in (b); the bipartite graph representation of the
matrix in (c); and the dependency graph in (d). The rank of each constraint
(edge) is written in the node representing the constraint in the dependency
graph. The shaded nodes represent the query set for source s1 .
links of the network in order to compute the solution. When the algorithm
uses randomization, we want the message complexity to hold with probability
at least 1 − n1α , where where n is the number of vertices in the network and
1
α > 0 can be an arbitrarily large constant. We denote this by 1 − poly n
. We
do not bound the size of the messages, but note that in both our algorithm
and ADMM the message length will be of order O(log n).
The second is the approximation ratio, which measures the quality of the
solution provided by the algorithm. Specifically, an algorithm is said to α-
approximate a maximization problem if its solution is guaranteed to be at
least OPα T , where OP T is the value of the optimal solution. If the algorithm
is randomized, the approximation ratio is with respect to the expected size
of the solution. We will compare the performance of LOCO with iterative
algorithms such as ADMM, for which approximation ratio is not a standard
measure. Thus in our empirical results, comparison with the optimal solution
is made using relative error, defined in Section 5.1, which is related to, but
slightly different from the approximation ratio.
12
Chapter 4
Concretely, there are two main steps in LOCO. In the first, LOCO generates
a localized neighborhood for each vertex. In the second, LOCO simulates an
online algorithm on the localized neighborhood. Importantly, the first step
is independent of the precise nature of the online algorithm, and the second
is independent of the method used to generate the localized neighborhoods.
Therefore, we can think of LOCO as a general methodology that can yield a
variety of algorithms. For example, we can use different online algorithms for
the second step of LOCO depending on whether we consider a linear NUM
problem or a strictly convex NUM problem. More specifically, the two steps
work as follows.
Step 1c, Constructing the query set In order to build the query set, we
generate a random ranking function on the vertices of H, r : V → [0, 1]. Given
the dependency graph H, an initial node y ∈ V and the ranking function r,
we build the query set of y, denoted S(y), using a variation of BFS, as follows.
Initialize S(y) to contain y. For every vertex v ∈ S(y), scan all of v’s neighbors,
denoted N (v). For each u ∈ N (v), if r(u) ≤ r(v), add u to S(y). Continue
iteratively until no more vertices can be added to S(y) (that is, for every vertex
v ∈ S(y) all of its neighbors that are not themselves in S(y) have lower rank
than v). If there are ties (i.e., two neighbors u, v such that r(u) = r(v)), we
tie-break by ID.3
In order to compute its own value in the solution, source i applies r to the
set of constraints in which it is contained, Y (i). For y = arg maxz∈Y (i) {r(z)},
it simulates the online algorithm on S(y). That is, it executes the online
algorithm on the neighborhood constructed in Step 1 for the “last arriving”
3
Any consistent tie breaking rule suffices.
15
constraint that contains i. i’s value is the value of i at the end of the simula-
tion. Claim 4 below shows that i’s value is identical to its value if the online
algorithm was executed on the entire program, with the constraints arriving
in the order defined by r.
In particular, we have the following result, for NUM with a linear objective
function.
The approximation ratio in Theorem 2 comes from the online algorithm pre-
sented and analyzed in [15] (see Lemma 6). The analysis of the online al-
gorithm is for adversarial input; therefore it is natural to expect LOCO to
achieve a much better approximation ratio in practice, as LOCO randomizes
4
See footnote 1 .
16
the order in which the constraints “arrive”. It is an open question to give bet-
ter theoretical bounds for stochastic inputs, and if such results are obtained
they would immediately improve the bounds in Theorem 2.
The core technical lemma required for the proof of Theorem 1 is the following.
Claim 4 For any source i, the value of xi in the output of LOCO is identical
to its value in the output of the online algorithm.
Additionally, while iterative descent and consensus style approaches are global,
LOCO is local. Under LOCO, communication stays within the query set and so
the computation only needs to be updated if changes happen within the query
set. This means that LOCO is robust to churn, failures, and communication
problems outside of that set of nodes.
Another important difference is that LOCO does not compute the optimal so-
lution, while iterative descent and consensus style approaches will eventually
converge to the true optimal. The proven analytical bounds for LOCO are
based on worst-case adversarial input. We show in Section 5.2 that our em-
pirical results outperform the theoretical guarantees by a considerable margin.
This is in part because the ranking is done randomly rather than in an adver-
sarial fashion (we elaborate on this in Section 5).
Finally, note that there is an important difference in the form of the theoretical
guarantees for LOCO and iterative descent and consensus style algorithms.
LOCO has guarantees in terms of the approximation ratio, while iterative
descent and consensus style algorithms have convergence rate guarantees. For
example, ADMM has guarantees on convergence of the norms of the primal
and dual residuals [14, Chapter 3.3].
18
Chapter 5
CASE STUDY
For our second set of experiments, we use the real network from the graph
of Autonomous System (AS) relationships in [87]. The graph has 8020 nodes
and 36406 edges. In order to interpret the graph in a NUM framework, we
associate each source with a path of links, ending at a destination node. To
do this, for each node i in the graph, we randomly select a destination node ti
which is at distance `i , sampled i.i.d. from Unif[` − 0.5`, ` + 0.5`]. We repeat
this for several values of `. (The distance between two nodes is the length
1
Note that this matrix does not have constant sparsity; however this can only increase
the message complexity. Irregardless, it is possible to adapt the theoretical results to hold
for this data as well, using techniques from [76].
2
For the purposes of our simulations, such a permutation can be efficiently sampled, and
guarantees perfect randomness. For larger n and m, it is possible to use pseudo-randomness
with almost no loss in message complexity [76].
19
6
6 x 10
x 10 12 ADMM 1 LOCO Tot LOCO Avg
15 ADMM 1 LOCO Tot LOCO Avg ADMM 1 LOCO Tot LOCO Avg
ADMM 2 LOCO Max
ADMM 1 LOCO Tot LOCO Avg 8
8 ADMM 2 LOCO Max ADMM 2 LOCO Max 10
ADMM 2 LOCO Max 10
Messages
Messages
8 6
Messages
6
Messages
10 10 10
4 4
10 10
5 2
4
10 2
10
0
10 0
0 0 10
0 5000 10000 15000 0 5000 10000 15000 1 1.5 2 0.5 1 1.5 2
n n p −4 p −4
x 10 x 10
of the shortest path between them.) Then, we designate the path L(i) to
be the set of links comprising the shortest path between the source and the
destination. The vectors c, x, and x̄ are chosen in the same manner as for the
synthetic networks.
Algorithm tuning
Our results focus on comparing LOCO and ADMM. Running ADMM requires
tuning four parameters [14]. Unless otherwise specified, we set the relative and
absolute tolerances to be rel = 10−4 and abs = 10−2 , the penalty parameter to
be ρ = 1, and the maximum number of allowed iterations to be tmax = 10000.
This is done to provide the best performance for ADMM: the parameters are
tuned in the typical fashion to optimize ADMM [14]. Running LOCO requires
tuning only one parameter: B, which governs the worst-case guarantee for
the online algorithm used in step 2. A smaller B gives a “better guarantee”,
however some constraints may be violated. Setting B = 2 ln(1 + m) provides
the best worst-case guarantee, and is our choice in the experiments unless
stated otherwise. In fact, it is possible to tune B (akin to tuning ADMM)
to specific data, as the constraints are often still satisfied for smaller B. In
Figure 5.3 (c), we show the improvement in performance guarantee by tuning
B, while keeping the dual solution feasible.
Metrics
For our numeric results, we evaluate ADMM and LOCO with respect to the
quality of the solution provided and the number of messages sent.
20
6
x 10 10
10
8
ADMM LOCO Max
LOCO Tot LOCO Avg
6
Messages
Messages
ADMM
LOCO Tot 5
4 LOCO Max 10
LOCO Avg
2
0
0 10
5 10 15 20 5 10 15 20
Average Path Length Average Path Length
(a) (b)
0.2
Relative Error
6 6
10 10
Messages
5 Messages 5
0.15
10 10
0.1
4 ADMM 4 ADMM
10 LOCO Tot 10 LOCO Tot 0.05
LOCO Max LOCO Max
3 LOCO Avg 3 LOCO Avg LOCO
10 10 0
0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2 10 11 12 13 14
Relative Error Relative Error B
Figure 5.3: Comparison of the relative error and the number of messages
required by LOCO and ADMM. Plots (a) and (b) show the Pareto optimal
curve for ADMM with a range of relative tolerances rel ∈ {10−4 , 10−1 }. Plot
(c) depicts how tuning B effects the relative error. The right most point
corresponds to B = 2 ln(1 + m).
To assess the quality of the solution we measure the relative error, which is
∗ LOCO
defined as |p −p|p∗ | | , where p∗ is the optimal solution. For problem instances
of small dimension, one can run an interior point method to check the optimal
solution, but this is too tedious for large problem sizes. In the large dimension
cases we consider, we regard p∗ to be ADMM’s solution with small tolerances,
such that the maximum number of allowed iterations is never needed. Note
that the relative error is an empirical, normalized version of the approximation
ratio for a given instance.
CONCLUDING REMARKS
We view this work as a first step toward the investigation of local computation
algorithms for distributed optimization. In future work, we intend to continue
to investigate the performance of LOCO in more general network optimization
problems. Further, it would be interesting to apply other techniques from the
field of local computation algorithms to develop algorithms for other settings
in which distributed computing is useful, such as power systems and machine
learning.
Part III
24
25
Chapter 7
Ten years ago computing infrastructure was a commodity – the key bottleneck
for new tech startups was the cost of acquiring and scaling computational
power as they grew. Now, computing power and memory are services that
can be cheaply subscribed to and scaled as needed via cloud providers like
Amazon EC2, Microsoft Azure, etc.
We are beginning the same transition with respect to data. Data is broadly
being gathered, bought, and sold in various marketplaces. However, it is still a
commodity, often obtained through offline negotiations between providers and
companies. Thus, acquiring data is one of the key bottlenecks for new tech
startups nowadays.
This is beginning to change with the emergence of cloud data markets, which
offer a single, logically centralized point for buying and selling data. Multiple
data markets have recently emerged in the cloud, e.g., Microsoft Azure Data-
Market [65], Factual [29], InfoChimps [44], Xignite [95], IUPHAR [88], etc.
These marketplaces enable data providers to sell and upload data and clients
to request data from multiple providers (often for a fee) through a unified
query interface. They provide a variety of services: (i) aggregation of data
from multiple sources, (ii) cleaning of data to ensure quality across sources,
(iii) ease of use, through a unified API, and (iv) low-latency delivery through
a geographically distributed content distribution network.
Given the recent emergence of data markets, there are widely differing designs
in the marketplace today, especially with respect to pricing. For example,
The Azure DataMarket [65] sets prices with a subscription model that allows
a maximum number of queries (API calls) per month and limits the size of
records that can be returned for a single query. Other data markets, e.g.,
Infochimps [44], allow payments per query or per data set. In nearly all cases,
the data provider and the data market operator each then get a share of the fees
paid by the clients, though how this share is arrived at can differ dramatically
across data markets. The task of pricing is made even more challenging when
one considers that clients may be interested in data with differing levels of
26
precision/quality and privacy may be a concern.
Not surprisingly, the design of pricing (both on the client side and the data
provider side) has received significant attention in recent years, including pric-
ing of per-query access [49, 51] and pricing of private data [32, 57].
In contrast, the focus of this paper is not on the design of pricing strategies for
data markets. Instead, we focus on the engineering side of the design
of a data market, which has been ignored to this point. Supposing that
prices are given, there are important challenges that remain for the operation
of a data market. Specifically, two crucial challenges relate to data purchasing
and data placement.
Data placement: How should purchased data be stored and replicated through-
out a geo-distributed data market in order to minimize bandwidth and latency
costs? And which clients should be served from which replicas given the loca-
tions and data requirements of the clients?
Clearly, these two challenges are highly related: data placement decisions de-
pend on which data is purchased from where, so the bandwidth and latency
costs incurred because of data placement must be balanced against the pur-
chasing costs. Concretely, less expensive data that results in larger bandwidth
and latency costs is not desirable.
Though the task of jointly optimizing data purchasing and data placement is
computationally hard in the worst case, in practical settings there is structure
that can be exploited. In particular, we provide an algorithm with polynomial
running time that gives an exact solution in the case of a data market with a
single data center (§10.1). Then, using this structure, we generalize to the case
of a geo-distributed data cloud and provide an algorithm, named Datum (§10.2)
that is near optimal in practical settings.
Datum first optimizes data purchasing as if the data market was made up of
a single data center (given carefully designed “transformed” costs) and then,
given the data purchasing decisions, optimizes data placement/replication.
The “transformed” costs are designed to allow an architectural decomposition
of the joint problem into subproblems that manage data purchasing (external
operations of the data market) and data placement (internal operations of
28
the data market). This decomposition is of crucial operational importance
because it means that internal placement and routing decisions can proceed
without factoring in data purchasing costs, mimicking operational structures
of geo-distributed analytics systems today.
1. We initiate the study of jointly optimizing data purchasing and data place-
ment decisions in geo-distributed data markets.
2. We prove that the task of jointly optimizing data purchasing and data
placement decisions is NP-hard and can be equivalently viewed as a facility
location problem.
3. We provide an exact algorithm with polynomial running time for the case
of a data market with a single data center.
4. We provide an algorithm, Datum, for jointly optimizing data purchasing and
data placement in a geo-distributed data market that is within 1.6% of op-
timal in practical settings and improves by > 45% over designs that neglect
data purchasing costs. Importantly, Datum decomposes into subproblems
that manage data purchasing and data placement decisions separately.
29
Chapter 8
Data is now a traded commodity. It is being bought and sold every day, but
most of these transactions still happen offline through direct negotiations for
bulk purchases. This is beginning to change with the emergence of cloud
data markets such as Microsoft Azure DataMarket in [65], Factual [29], In-
foChimps [44], Xignite [95]. As cloud data markets become more prominent,
data will become a service that can be acquired and scaled seamlessly, on
demand, similarly to computing resources available today in the cloud.
For example, consider an emerging potential competitor for Yelp. The biggest
development challenge is not algorithmic or computational. Instead, it is ob-
taining and managing high quality data at scale. The existence of a data
market, e.g., Azure DataMarket, with detailed local information about re-
straints, attractions, etc., would eliminate this bottleneck entirely. In fact,
data markets such as Factual [29] are emerging to target exactly this need.
A final example considers computer vision. When tech startups need to de-
velop computer vision tools in house, a significant bottleneck (in terms of time
and cost) is obtaining labeled images with which to train new algorithms.
Emerging data markets have the potential to eliminate this bottleneck too.
For example, the emerging Visipedia project [91] (while free for now) provides
an example of the potential of such a data market.
30
Thus, like in the case of cloud computing, ease of access and scaling, combined
with the cost efficiency that comes with size, implies that cloud data markets
have the potential to eliminate one of the major bottlenecks for tech startups
today – data acquisition.
Pricing
While there is a large body of literature on selling physical goods, the problem
of pricing digital goods, such as data, is very different. Producing physical
goods usually has a moderate fixed cost, for example, for buying the space and
production machines needed, but this cost is partly recoverable: it is possible,
if the company cannot manage to sell its product, to resell the machinery and
buildings they have been using. However, the cost of producing and acquiring
data is high and irrecoverable: if the data turns out to be worthless and nobody
wants it, then the whole procedure is wasted. Another major difference comes
from the fact that variable costs for data are low: once it has been produced,
data can be cheaply copied and replicated.
These differences lead to “versioning” as the most typical approach for selling
digital goods [7]. Versioning refers to selling different versions of the same
digital good at different prices in order to target different types of buyers. This
pricing model is common in the tech industry, e.g., companies like Dropbox sell
digital space at different prices depending on how much space customers need
and streaming websites such as Amazon often charge differently for streaming
movies at different quality levels.
While data pricing within cloud data markets has received increasing attention,
the engineering of the system itself has been ignored. The engineering of such
a geo-distributed “data cloud” is complex. In particular, the system must
jointly make both data purchasing decisions and data placement, replication
and delivery decisions, as described in the introduction.
Additional complexity is created by versioning the data, i.e., the fact that
clients have differing quality requirements for the data requested. For example,
if some clients are interested in high quality data and others are interested in
low quality data, then it may be worth it to provide high quality level data
to some clients that only need low quality data (thus incurring a higher price)
because of the savings in bandwidth and replication costs that result from
being able to serve multiple clients with the same data.
33
Chapter 9
This paper presents a design for a geo-distributed cloud data market, which
we refer to as a “data cloud.” This data cloud serves as an intermediary be-
tween data providers, which gather data and offer it for sale, and clients, which
interact with the data cloud through queries for particular subsets/qualities
of data. More concretely, the data cloud purchases data from multiple data
providers, aggregates it, cleans it, stores it (across multiple geographically dis-
tributed data centers), and delivers it (with low-latency) to clients in response
to queries, while aiming at minimizing the operational cost constituted of both
bandwidth and data purchasing costs.
Our design builds on and extends the contributions of recent papers – specif-
ically [92, 73] – that have focused on building geo-distributed data analytic
systems but assume the data is already owned by the system and focus solely
on the interaction between a data cloud and its clients. Unfortunately, as we
highlight in §10, the inclusion of data providers means that the data cloud’s
goal of cost minimization can be viewed as a non-metric uncapacitated facility
location problem, which is NP-hard.
For reference, Figure 9.1 provides an overview of the interaction between these
three parties as well as some basic notations.
The data purchasing contract between data providers and data cloud may have
a variety of different types. For example, a data cloud may pay data provider
based on usage, i.e., per query, or a data cloud may buy the data in bulk in
advance. In this paper, we discuss both per-query data contracting and bulk
data contracting. See §9.3 for details.
More general models of queries are possible, e.g., by including a DAG modeling
the structure of the query and query execution planning (see [92] for details).
For ease of exposition, we do not include such detailed structure here, but it
1
A common suggestion for guaranteeing privacy is to add Laplace noise to data provided
to data markets, see e.g., [25, 57]
2
We distinguish data providers based on data, i.e., one data provider sells multiple data
is treated as multiple data providers.
3
We distinguish clients based on queries, i.e., one client sends multiple queries is treated
as multiple clients.
35
can be added at the expense of more complicated notation.
Depending on the situation, the client may or may not be expected to pay the
data cloud for access. If the clients are internal to the company running the
data cloud, client payments are unnecessary. However, in many situations the
client is expected to pay the data cloud for access to the data. There are many
different types of payment structures that could be considered. Broadly, these
fall into two categories: (i) subscription-based (e.g., Azure DataMarket [65])
or (ii) per-query-based (e.g. Infochimps [44]).
In this paper, we do not focus on (or model) the design of payment structure
between the clients and the data cloud. Instead, we focus on the operational
task of minimizing the cost of the data cloud operation (i.e., bandwidth and
data purchasing costs). This focus is motivated by the fact that minimizing
the operation costs improves the profit of the data cloud regardless of how
clients are charged. Interested readers can find analyses of the design of client
pricing strategies in [49, 51, 57].
Note that, even for the same data with the same quality, data transfer from
the data providers to the data cloud is not a one time event due to the need of
the data providers to update the data over time. We target the modeling and
optimization of data cloud within a fixed time horizon, given the assumption
that queries from clients are known beforehand or can be predicted accurately.
This assumption is consistent with previous work [92, 73] and reports from
other organizations [94, 55]. Online versions of the problem are also of interest,
but are not the focus of this paper.
Modeling costs
Our goal is to provide a design that minimizes the operational costs of a data
cloud. These costs include both data purchasing and bandwidth costs. In or-
36
Data Providers! Data Cloud! Clients!
PurchCost (bulk): !
! f (l, p)z(l, p)
Figure 9.1: An overview of the interaction between data providers, the data
cloud, and clients. The dotted line encircling the data centers (DC) represents
the geo-distributed data cloud. Data providers and clients interact only with
the cloud. Data provider p sends data of quality q(l, p) to data center d,
and the corresponding operation cost is βp,d (l)yp,d (l). Similarly, data center d
sends data of quality q(l, p) to client c, and the corresponding execution cost is
αd,c (l, p)xd,c (l, p). In bulk data contracting, the corresponding purchasing cost
is f (l, p)z(l, p). In per-query data contracting, the corresponding purchasing
cost is f (l, p)xd,c (l, p).
der to describe these costs, we use the following notation, which is summarized
in Figure 9.1.4
xd,c (l, p) ∈ {0, 1}: xd,c (l, p) = 1 if and only if data of quality q(l, p), originating
from data provider p, is transferred from data center d to client c.
αd,c (l, p): cost (including bandwidth and/or latency) to transfer data of qual-
ity q(l, p), originating from data provider p, from data center d to client
c
yp,d (l) ∈ {0, 1}: yp,d (l) = 1 if and only if data of quality q(l, p) is transferred
from data provider p to data center d.
βp,d (l): cost (including bandwidth and/or latency) to transfer data of quality
q(l, p) from data provider p to data center d.
z(l, p) ∈ {0, 1}: z(l, p) = 1 if and only if data of quality q(l, p), originating
from data provider p, is transferred to the data cloud.
4
Throughout, subscript indices refer to data transfer “from, to” a location, and paren-
thesized indices refer to data characteristics (e.g., quality, from which data provider).
37
f (l, p): purchasing cost of data with quality q(l, p), originating from data
provider p.
Given the above notations, the costs of the data cloud can be broken into three
categories:
(i) The operation cost due to transferring data of all quality levels from
data providers to data centers is
Lp D
P X
X X
OperCost = βp,d (l)yp,d (l). (9.1)
p=1 l=1 d=1
(ii) The execution cost due to transferring data of all quality levels from
data centers to clients is
C Lp D
X X X X
ExecCost = αd,c (l, p)xd,c (l, p). (9.2)
c=1 p∈G(c) l=1 d=1
(iii) The purchasing cost (PurchCost) due to buying data from the data
provider could result from a variety of differing contract styles. In this
paper we consider two extreme options: per-query and bulk data con-
tracting. These are the most commonly adopted strategies for data
purchasing today.
In per-query data contracting, the data provider charges the data cloud
a fixed rate for each query that uses the data provided by the data
provider. So, if the same data is used for two different queries, then the
data cloud pays the data provider twice. Given a per-query fee f (l, p)
for data q(l, p), the total purchasing cost is
C Lp D
X X X X
PurchCost(query) = f (l, p)xd,c (l, p). (9.3)
c=1 p∈G(c) l=1 d=1
In bulk data contracting, the data cloud purchases the data in bulk
and then can distribute it without owing future payments to the data
provider. Given a one-time fee f (l, p) for data q(l, p), the total purchas-
ing cost is
P X Lp
X
PurchCost(bulk) = f (l, p)z(l, p). (9.4)
p=1 l=1
38
To keep the presentation of the paper simple, we focus on the per-query data
contracting model throughout the body of the paper and discuss the bulk data
contracting model (which is simpler) in Appendix C.3.
Cost Optimization
Given the cost models described above, we can now represent the goal of the
data cloud via the following integer linear program (ILP), where OperCost,
ExecCost, and PurchCost are as described in equations (9.1), (9.2) and (9.3),
respectively.
We refer the reader to [92, 93, 73] for more discussions of these additional
practical constraints. Each paper includes a subset of these factors in the
design of geo-distributed data analytics systems, but does not model data
purchasing decisions.
40
C h a p t e r 10
Given the model of a geo-distributed data cloud described in the previous sec-
tion, the design task is now to provide an algorithm for computing the optimal
data purchasing and data placement/replication decisions, i.e., to solve data
cloud cost minimization problem in (9.5). Unfortunately, this cost minimiza-
tion problem is an ILP, which are computationally difficult in general.1
A classic NP-hard ILP is the uncapacitated facility location problem (UFLP) [52].
In the uncapacitated facility location problem, there is a set of I clients and
J potential facilities. Facility j ∈ J costs fj to open and can serve clients
i ∈ I with cost ci,j . The task is to determine the set of facilities that serves
the clients with minimal cost.
Our first result, stated below, highlights that cost minimization for a geo-
distributed data cloud can be reduced to the uncapacitated facility location
problem, and vice-versa. Thus, the task of operating a data cloud can then
be viewed as a facility location problem, where opening a facility parallels
purchasing a specific quality level from a data provider and placing it in a
particular data center in the data cloud.
More specifically, the reduction leading to Theorem 8 highlights that the data
cloud optimization problem is equivalent to the non-metric uncapacitated fa-
1
Note that previous work on geo-distributed data analytics where data providers and
data purchasing were not considered already leads to an ILP with limited structure. For
example, [92] suggest only heuristic algorithms with no analytic guarantees.
41
cility location problem – every instance of any of the two problems can be
written as an instance of the other. While constant-factor polynomial running
time approximation algorithms are given for the metric uncapacitated facility
location problem in [17, 38, 45], in the more general non-metric case the best
known polynomial running time algorithm achieves a log(C)-approximation
via a greedy algorithm with polynomial running time, where C is the number
of clients [42]. This is the best worst-case guarantee possible (unless NP has
slightly superpolynomial time algorithms, as proven in [30]); however some
promising heuristics have been proposed for the non-metric case, e.g., [26, 8,
1, 48, 89, 36].
Nevertheless, even though our problem can, in general, be viewed as the non-
metric uncapacitated facility location, it does have a structure in real-world
situations that we can exploit to develop practical algorithms.
In particular, in this section we begin with the case of a data cloud made up
of a single data center. We show that, in this case, there is a structure that
allows us to design an algorithm with polynomial running time that gives an
exact solution (§10.1). Then, we move to the case of a data cloud made up
of geo-distributed data centers and highlight how to build on the algorithm
for the single data center case to provide an algorithm, Datum, for the general
case (§10.2). Importantly, Datum allows decomposition of the management of
data purchasing (operations outside of the data cloud) and data placement
(operations inside the data cloud). This feature of Datum is crucial in practice
because it means that the algorithm allows a data cloud to manage internal
operations without factoring in data purchasing costs, mimicking operations
today. While we do not provide analytic guarantees for Datum (as expected
given the reduction to/from the non-metric facility location problem), we show
that the heuristic performs well in practical settings using a case study in §11.
The assumption that the execution costs are the same across quality levels is
natural in many cases. For example, if quality levels correspond to the level of
noise added to numerical data, then the size of the data sets will be the same.
We adopt this assumption in what follows.
This assumption allows the elimination of the execution cost term from the
objective. Additionally, we can simplify notation by removing the index d for
the data center. Thus, in per-query data contracting, the data cloud optimiza-
tion problem can be simplified to (10.2). (We discuss the case of bulk data
contracting in Appendix C.3.)
L
X C X
X L
minimize β(l)y(l) + f (l)xc (l) (10.2)
l=1 c=1 l=1
xc (l) ≥ 0, ∀c, l
y(l) ≥ 0, ∀l
xc (l), y(l) ∈ {0, 1}, ∀c, l
Note that constraint (10.2a) is a contraction of (9.5b) and (9.5c), and simply
means that any client c must be given exactly one quality level above wc ,
the minimum required quality level.2 While this problem is still an ILP, in
this case there is a structure that can be exploited to provide a polynomial
time algorithm that can find an exact solution. In particular, we prove in
Appendix C.1 that the solution to (10.2) can be found by solving the linear
program (LP) given in (10.3).
2
While the two constraints are equivalent for an ILP, they lead to different feasible sets
when considering its LP-relaxation; in particular, facility location algorithms based on LP-
relaxations such as randomized rounding algorithms need to use the contracted version of
the constraints to preserve the O(log C)-approximation ratio for non-metric facility location.
It is equivalent to the reformulation given in Appendix C and does not introduce infinite
costs that may lead to numerical errors.
43
L
X L X
X L
minimize β(l)y(l) + Si f (l)χi (l) (10.3)
l=1 i=1 l=i
subject to
χi (l) ≤ y(l), ∀i, l
L
X
χi (l) = 1, ∀i
l=i
χi (l) ≥ 0, ∀i, l
y(l) ≥ 0, ∀l
Note that this LP is not directly obtained by relaxing the integer constraints
in (10.2), but is obtained from relaxing the integer constraints in a reformu-
lation of (10.2) described in Appendix C.1. The theorem below provides a
tractable, exact algorithm for cost minimization in a data cloud made up of a
single data center. (A proof is given in Appendix C.1).
In summary, the following gives a polynomial time algorithm which yields the
optimal solution of (10.2).
Step 1: Rewrite (10.2) in the form given by (C.4).
Step 2: Solve the linear relaxation of (C.4), i.e., (10.3). If it gives an inte-
gral solution, this solution is an optimal solution of (10.2), and the algorithm
finishes. Otherwise, denote the fractional solution of the previous step by
{χr (l), y r (l)} and continue to the next step.
P i −1 r
Step 3: Find mi ∈ {i, . . . , n} such that m
Pmi r
l=i y (l) < 1, and l=i y (l) ≥ 1.
(See Appendix C.1 for the existence of {mi }.) And express {χi (l)} as a func-
tion of {y(l)} based on (C.6). Substitute the expressions of {χi (l)} with {y(l)}
44
in (10.3) to obtain an instance of (C.7). Solve the linear programming prob-
lem (C.7) and find an optimal solution that is also an extreme point of (C.7).3
This yields a binary optimal solution of (C.7). Use transformation (C.6) to get
a binary optimal solution of (10.3), which can be reformulated as an optimal
solution of (10.2) from the definition of {χi (l)}.
The idea underlying Datum is to, first, optimize data purchasing decisions as if
the data market was made up of a single data center (given carefully designed
“transformed” costs), which can be done tractably as a result of Theorem 9.
Then, second, Datum optimizes data placement/replication decisions given the
data purchasing decisions.
L X
X V C X
X L X
V
minimize βv (l)yv (l) + αv,c (l)xv,c (l)
l=1 v=1 c=1 l=1 v=1
XC X L X
V
+ f (l)xv,c (l) (10.4)
c=1 l=1 v=1
Compared to (9.5), the main difference is that (10.4) has two extra con-
straints (10.4c) and (10.4d). Constraint (10.4c) ensures that data can only be
placed in at most one subset of data centers across V . And constraint (10.4d)
follows from constraint (10.4b). Using this reformulation Datum can now be
explained in two steps.
Step 1: Solve (10.5) while treating the geo-distributed data cloud as a single
data center. Specifically, define Y (l) = Vv=1 yv (l) and Xc (l) = Vv=1 xv,c (l).
P P
Note that, Y (l) and Xc (l) are 0−1 variables from Constaint (10.4c) and (10.4d).
Further, ignore the middle term in the objective, i.e., the ExecCost. Finally,
for each quality level l, consider a “transformed” cost β ∗ (l). We discuss how
to define β ∗ (l) below. This leaves the “single data center” problem (10.5).
Crucially, this formulation can be solved optimally in polynomial time using
the results for the case of a data cloud made up of a single data center (§10.1).
46
L
X C X
X L
∗
minimize β (l)Y (l) + f (l)Xc (l) (10.5)
l=1 c=1 l=1
Xc (l) ≥ 0, ∀c, l
Y (l) ≥ 0, ∀l
Xc (l), Y (l) ∈ {0, 1}, ∀c, l
The remaining issue is to define β ∗ (l). Note that the reason for using trans-
formed costs β ∗ (l) instead of βv (l) is that βv (l) cannot be known precisely
without also optimizing the data placement. Thus, in defining β ∗ (l) we need
to anticipate the execution costs that result from data placement and repli-
cation given the purchase of data with quality level l. This anticipation then
allows a decomposition of data purchasing and data placement decisions. Note
that the only inaccuracy in the heuristic comes from the mismatch between
β ∗ (l) and min{βv (l) + c∈C ∗ (l) αv,c (l)} where C ∗ (l) is the set of customers who
P
buy at quality level l in an optimal solution – if these match for the minimizer
of (9.5) then the heuristic is exact. Indeed, in order to minimize the cost of
locating quality levels to data centers, and allocating clients to data centers
and quality levels, the set of data centers v where an optimal solution chooses
to put quality level l has to minimize the cost of data transfer in the set v and
allocating all clients who get data at quality level l, i.e. C ∗ (l), to this set of
data centers v.
Many choices are possible for the transformed costs β ∗ (l). A conservative
choice is β ∗ (l) = min βv (l), which results in a solution (with Step 2) whose
v
OperCost + PurchCost is a lower bound to the corresponding costs in the
optimal solution of (9.5).5 However, it is natural to think that more aggressive
estimates may be valuable. To evaluate this, we have performed experiments in
the setting of the case study (see §11) using the following parametric form β ∗ (l)
0
αv,c (l0 )e−µ2 (l−l ) }, where µ1 and µ2 are parameters. This
P P
= min{βv (l)+µ1
v l0 ≤l wc =l0
5
However the ExecCost cannot be bounded, thus we cannot obtain a bound for the total
cost. The proof of this is simple and is not included in the paper due to space limit.
47
form generalizes the conservative choice by providing a weighting of αv,c (l0 )
based on the “distance” of the quality deviation between l0 and the target
quality level l. The idea behind this is that a client is more likely to be
served data with quality level close to the requested minimum quality level of
0
the client. Here we use the exponential decay term e−µ2 (l−l ) to capture the
possibility of serving the data with quality level l to a client with minimum
quality level l0 ≤ l. Interestingly, in the setting of our case study, the best
design is µ1 = µ2 = 0, i.e., the conservative estimate β ∗ (l) = min βv (l), and so
v
we adopt this β ∗ (l) in Datum.
Step 2: At the completion of Step 1 the solution (X, Y ) to (10.5) determines
which quality levels should be purchased and which quality level should be
delivered to each client. What remains is to determine data placement and
data replication levels. To accomplish this, we substitute (X, Y ) into (10.4),
which yields (10.6).
L X
X V C X
X L X
V
minimize βv (l)yv (l) + αv,c (l)xv,c (l)
l=1 v=1 c=1 l=1 v=1
XC X L X V
+ f (l)xv,c (l) (10.6)
c=1 l=1 v=1
Let C(l) denote the set of clients that purchase data with quality level l, i.e.,
48
C(l) = {c : Xc (l) = 1}. Then (10.7) gives the optimal solution of (10.6). (A
proof is given in Appendix C.2.)
1, if Y (l) = 1 and
yv (l) = (10.7a)
P
v = arg min{βv (l) + c∈C(l) αv,c (l)},
0, otherwise.
y (l), if c ∈ C(l),
v
xv,c (l) = (10.7b)
0, otherwise.
49
4
1.3 NearestDC NearestDC
OptBand 3 Datum
Datum OptCost
2
1.1
1
4 8 12 16 20 4 8 12 16 20
Number of Providers per Client Request Number of Providers per Client Request
C h a p t e r 11
CASE STUDY
1. Datum provides consistently lower cost (> 45% lower) than existing designs
for geo-distributed data analytics systems.
2. Datum achieves near optimal total cost (within 1.6%) of optimal.
3. Datum achieves reduction in total cost by significantly lowering purchas-
ing costs without sacrificing bandwidth/latency costs, which stay typically
within 20-25% of the minimal bandwidth/latency costs necessary for deliv-
ery of the data to clients.
Clients. Client locations are picked randomly among US cities, weighted pro-
portionally to city populations. Each client requests data from a subset of
data providers, chosen i.i.d. from a Uniform distribution. Unless otherwise
specified, the average number of providers per client request is P/2. The qual-
ity level requested from each chosen provider follows a Zipf distribution with
mean Lp /2 and shape parameter 30. P and Lp are defined as in §9.1 and §9.2.
We choose a Zipf distribution motivated by the fact that popularity typically
follows a heavy-tailed distribution [68]. Results are averaged over 20 random
instances. We observe that the results of the 20 instances for the same plot
are very close (within 5%), and thus do not show the confidence intervals on
the plots.
Operation and execution costs. To set operation and execution costs, we com-
pute the geographical distances between data centers, clients and providers.
The operation and execution costs are proportional to the geographical dis-
tances, such that the costs are effectively one dollar per gigameter. This
captures both the form of bandwidth costs adopted in [93] and the form of
latency costs adopted in [73].
Data purchasing costs. The per-query purchasing costs are drawn i.i.d. from
a Pareto distribution with mean 10 and shape parameter 2 unless otherwise
specified. We choose a Pareto distribution motivated by the fact that incomes
and prices often follow heavy-tailed distributions [68]. Results were averaged
over 20 random instances. To study the sensitivity of Datum to the relative
size of purchasing and bandwidth costs, we vary the ratio of them between
(0.01, 100).
1.5 2 2.5 3 1 2 3 4 5 6 7 8
Shape Para. Pareto per Query Fee Function Number of Quality Levels
(a) (b)
Figure 11.2: Illustration of Datum’s sensitivity to query parameters. (a) varies
the heaviness of the tail in the distribution of purchasing fees. (b) varies
the number of quality levels available. Note that Figure 11.1 sets the shape
parameter of the Pareto governing purchasing fees to 2 and includes 8 quality
levels.
• OptCost computes the optimal solution to the data cloud cost minimization
problem by solving the integer linear programming (9.5). Note that this
requires solving an NP-hard problem, and so is not feasible in practice. We
include it in order to benchmark the performance of Datum.
Across all settings, Datum is within 1.6% of optimal; however both of these
parameters have a considerable impact on the cost savings Datum provides
over our baselines. In particular, the lighter the tail of the prices of different
quality levels is, the less improvement can be achieved. This is a result of
more concentration of prices across quality levels leaving less room for opti-
mization. Similarly, fewer quality levels provides less opportunity to optimize
data purchasing decisions. At the extreme, with only quality level available,
the opportunity to optimization data purchasing goes away and OptBand and
OptCost are equivalent.
1.3 1.3
1.1 1.1
−2 −1 0 1 2 −2 −1 0 1 2
log((α + β )/f ) log(α/(β + f ))
(a) (b)
Figure 11.3: Illustration of the impact of bandwidth and purchasing fees on
Datum’s performance. NearestDC is excluded because its costs are off-scale.
(a) varies the ratio of bandwidth costs (summarized by α + β) to purchasing
costs (summarized by f ). (b) varies the ratio of costs internal to the data
cloud (α) to costs external to the data cloud (β + f ). Note that in Figure 11.1
the ratios are set to log( α+β
f
α
) = −0.5 and log( β+f ) = −1.
RELATED WORK
Our work focuses on the joint design of data purchasing and data placement in
a geo-distributed cloud data market. As such, it is related both to recent work
on data pricing and to geo-distributed data analytics systems. Further, the
algorithmic problem at the core of our design is the facility location problem,
and so our work builds on that literature. We discuss related work in these
three areas in the following.
Data pricing: The design of data markets has begun to attract increasing
interest in recent years, especially in the database community, see [6] for an
overview. The current literature mainly focuses on query-based pricing mecha-
nism designs [49, 51, 57] and seldom considers the operating cost of the market
service providers (i.e., the data cloud). There is also a growing body of work
related to data pricing with differentiated qualities [32, 57, 22], often motivated
by privacy. See §8.2 for more discussion. This work relates to data pricing on
the data provider side and is orthogonal to our discussion in this paper.
Algorithms for facility location: Our data cloud cost minimization prob-
lem can be viewed as a variant of the uncapacitated facility location problem.
Though such problems have been widely studied, most of the results, espe-
cially algorithms with constant approximation ratios, require the assumption
of metric cost parameters [17, 38, 45], which is not the case in our problem.
In contrast, for the non-metric facility location problem the best known al-
gorithm is a greedy algorithm proposed in [52]. Beyond this algorithm, a
56
variety of heuristics have been proposed, however none of the heuristics are
appealing for our problem because it is desirable to separate (external) data
purchasing decisions from (internal) data placement/replication decisions as
much as possible. As a result we propose a new algorithm, Datum, which is
both near-optimal in practical settings and provides the desired decomposition.
Datum may also be valuable more broadly for facility location problems.
57
C h a p t e r 13
CONCLUSION
This work sits at the intersection of two recent trends: the emergence of online
data marketplaces and the emergence of geo-distributed data analytics sys-
tems. Both have received significant attention in recent years across academia
and industry, changing the way data is bought and sold and changing how
companies like Facebook run queries across geo-distributed databases. In this
paper we study the engineering challenges that come when online data market-
places are run on top of a geo-distributed data analytics infrastructure. Such
cloud data markets have the potential to be a significant disruptor (as we high-
light in §8). However, there are many unanswered economic and engineering
questions about their design. While there has been significant prior work on
economic questions, such as how to price data, the engineering questions have
been neglected to this point.
Instead of increasing y(j) continuously, one can perform a binary search over
possible values of y(j). For each candidate y(j), a corresponding new value
of x is computed and the primal constraints are checked for feasibility. If
feasible, the new x is accepted, and y(j) will be increased in the next round
of the binary search. If infeasible, the new x is rejected, and the value of y(j)
will be decreased in the next round of the search.
A.1 ADMM
In our numerical results we compare LOCO to ADMM in the case of linear
NUM. For completeness, we describe the application of ADMM to that setting
here.
min
0
g(x0 ) + h(z)
x ,z
s.t. x0 − z = 0
The solution to the NUM problem is recovered from the first n entries of x0 .
68
Appendix B
PROOF OF LEMMA 3
We denote the set {0, 1, . . . , m} by [m]. Logarithms are base e. Let G = (V, E)
be a graph. For any vertex set S ⊆ V , denote by N (S) the set of vertices that
are not in S but are neighbors of some vertex in S: N (S) = {N (v) : v ∈ S}\S.
The length of a path is the number of edges it contains. For a set S ⊆ V and
a function f : V → N, we use S ∩ f −1 (i) to denote the set {v ∈ S : f (v) = i}.
We now define Q + 1 “layers” - T≤0 , . . . , T≤Q : T≤q = T ∩ qi=0 f −1 (i). That is,
S
T≤q is the set of vertices in T whose rank is at most q. (The range of f is [Q],
hence T≤0 will be empty, but we include it to simplify the proof.)
Claim 12 Set Q = 4(d + 1), γ = 15Q. Assume without loss of generality that
f (v) = 0. Then for all 0 ≤ i ≤ Q − 1,
1
Pr[|T≤i | ≤ 2i γ log n ∧ |T≤i+1 | ≥ 2i+1 γ log n] ≤ .
n4
because if there had been some u ∈ N (T≤i ), f (u) = i, u would have been added
to T≤i .
|T≤i+1 |
|T≤i+1 ∩ f −1 (i + 1)| > . (B.2)
2
Given |T≤i+1 | > 2i+1 γ log n, it holds that |R≤i+1 | > 2i+1 γ log n because T≤i+1 ⊆
R≤i+1 . Furthermore, R≤i+1 was constructed by an adaptive vertex exposure
70
procedure and so the conditions of Lemma 10 hold for R≤i+1 . From Equa-
tions (B.1) and (B.2) we get
Lemma 14 Set Q = 4(d + 1). Let G = (V, E) be a graph with degree bounded
by d, where |V | = n. For any vertex v ∈ G, Pr Tv > 2Q · 15Q log n < n13 .
From the inductive step and Claim 12, using the union bound, the lemma
follows.
Applying a union bound over all the vertices gives the size of each query set is
O(log n) with probability at least 1−1/n2 , completing the proof of Theorem 3.
71
Appendix C
PROOF OF THEOREM 8
To prove Theorem 8, we show a connection between the data cloud cost mini-
mization problem in (9.5) and the uncapacitated facility location problem. In
particular, we show both that the facility location problem can be reduced to
a data cloud optimization problem and vice versa.
First, we show that every instance of the uncapacitated facility location prob-
lem can be viewed as an instance of (9.5).
Take any instance of the uncapacitated facility location problem (UFLP). Let
I be the set of customers, J the set of locations, αij the cost of assigning
customer i to location j, and βj the cost of opening facility j. Binary variables
yj = 1 if and only if facility is open at site j, and xj,i = 1 if and only if
customer i is assigned to location j. Then the UFLP can be formulated as
following.
X X
min βj yj + αij xj,i (C.1)
x,y
j∈F i∈I,j∈F
subject to
xj,i ≤ yj , ∀i, j
X
xj,i = 1, ∀c
j∈F
subject to
xd,c (l) ≤ yd (l), ∀c, l, d
D X
X L
xd,c (l) = 1, ∀c
d=1 l=1
with αd,c (l) = M , for M big enough, whenever l < wc . Indeed, in any feasible
solution of (9.5), we necessarily have xd,c (l) = 0 whenever l < wc , as each client
purchases exactly one quality level and this quality level has to be higher than
the minimum required level wc ; by setting αd,c (l) big enough, we ensure that
any optimal solution must have xd,c (l) = 0 thus must be feasible for (9.5), and
has the same cost as in (9.5). Now, take J = [D] × [L] and I = [C], and the
problem can be rewritten as
X X
min βd (l)yd (l) + (f (l) + αd,c (l))xd,c (l) (C.3)
x,y
(d,l)∈J (d,l)∈J,c∈I
which is an UFLP.
As the clients in the same group Ci all face exactly the same choice of quality
levels and minimum quality requirements, there must always be an optimal so-
lution in which the data purchasing decisions of any clients within one category
are the same.
73
Let us denote the number of clients in category Ci by Si . Denote the purchasing
decision of category Ci by χi , e.g., χi (l) = xc (l), ∀l, c ∈ Ci , similar to the
argument in proof of Theorem 8, we can reformulate (10.2) as follows. Note
the slight abuse of notation, as clients and their associated required quality
level are represented by the same letter, i, due to clients in category Ci having
minimum quality level i by definition.
L
X L X
X L
minimize β(l)y(l) + Si f (l)χi (l) (C.4)
l=1 i=1 l=i
Consider the linear relaxation of (C.4), which drops the 0 − 1 integer con-
straint (C.4e). For any optimal solution {χri (l), y r (l)} of the linear relaxation
we have the following observations.
1. χrL (L) = 1.
Proof 16 From (C.4b), let i = L, then χrL (L) = 1. The intuition behind
this is that, since CL 6= ∅, highest quality data always has to be purchased
to provide service for clients in C(L).
PL r r r
3. ∀l ≥ i, if l=i y (l) ≤ 1, χi (l) = y (l); otherwise, χi (l) = max{1 −
Pl−1 r
k=i y (k), 0}.
Proof 18 For some fixed i, {Si f (l)} is a positive, strictly increasing se-
quence as l increases. From constraint (C.4a) and (C.4b), χri (l) ≤ y r (l),
74
PL Pl
and l=i χri (l) = 1. Since {χri (l), y r (l)} is optimal, ∀l ≥ i, if k=i y r (k) ≤
1, χri (l) = y r (l); otherwise, χri (l) = max{1 − l−1 r
P
k=i y (k), 0}.
P i −1 r
Next, define mi ∈ {i, . . . , n} such that m
Pmi r
l=i y (l) < 1, and l=i y (l) ≥ 1.
Such an mi must exist since y (l) ≥ 0 for all l and y (L) = 1. Recall χrL (L) =
r r
Note that, if y r are binary, then χr are binary. Suppose there exists an opti-
mal solution {χr , y r } with y r 6∈ {0, 1}L , in the following we show that there
exists a feasible binary solution {χ∗ , y ∗ } of (C.4) such that the objective value
generated by {χ∗ , y ∗ } is better than or equal to that of {χr , y r }.
Substituting (C.6) in the objective function (C.4), the objective function be-
comes a linear combination of {y(l)} that we denote L(y).
y(l) ≥ 0, ∀l = 1, . . . , L
y(L) = 1
Proof 19 Clearly, ∀l, y(l) ∈ [0, 1]. And starting from y(L), it is easy to
construct a feasible solution of (C.7). Thus, (C.7) is feasible and bounded,
and always has an optimal solution at an extreme point.
Proof 20 Since {y r (l)} is feasible for (C.4), y r (l) ≥ 0, ∀l, and y r (L) = 1.
P i −1 r
By definition of mi , m
Pmi r
l=i y (l) ≤ 1, l=i y (l) ≥ 1.
Proof 21 Since y(L) = 1, we can drop y(L), and write (C.7) in the fol-
lowing standard linear programming form:
s.t. Ay ≤ b
y≥0
Note that all entries of A are 0, ±1, and all rows of A has either consecutive
1s or consecutive −1s. Thus, from [82], A is a totally unimodular matrix
thus the extreme points of (C.8) are all integral. In particular, since all
y(l) ∈ [0, 1], the extreme points of (C.8) are all binary.
{χri (l), y r (l)} and any optimal extreme point {χ∗i (l), y ∗ (l)} see their cor-
responding objective values unchanged between (C.7) and the relaxation
of (C.4) by construction of the χi (l)’s. And any such extremal and optimal
76
{χ∗i (l), y ∗ (l)} has a better or equal objective value compared to {χri (l), y r (l)}
in relaxed (C.4). Since {χri (l), y r (l)} is optimal for (C.7), it implies any op-
timal extreme point of relaxed (C.4) yields a binary and optimal solution
for (C.7). This provides a polynomial time algorithm to find such a binary
optimal solution, which can be summarized as in §10.2.
V
1. For any quality level l0 , if Y (l0 ) = 0, then ∀v, yv (l0 ) = Y (l0 ) = 0.
P
v=1
From the non-negativity of yv (l0 ), ∀v, yv (l0 ) = 0. Further, ∀v, c, xv,c (l0 ) = 0
from (10.6a).
2. For any quality level l0 , if Y (l0 ) = 1, then from the definition of yv (l) and
Y (l), ∃!v 0 ∈ V, such that yv0 (l0 ) = Y (l0 ) = 1. Recall that C(l0 ) = {c :
Xc (l0 ) = 1} represents the set of clients that are assigned data with quality
level l0 by Step 1 in §10.2.
a) For client c0 ∈ C(l0 ), Xc0 (l0 ) = 1. Since v 0 is the unique data center set
across V such that yv0 (l0 ) = 1, from (10.6a) and (10.6b), xv0 ,c0 (l0 ) =
1 and xv,c0 (l) = 0, ∀v 6= v 0 or l 6= l0 . In other words, xv,c0 (l0 ) =
yv (l0 ), ∀v ∈ V, c ∈ C(l0 ).
/ C(l0 ), Xc (l0 ) = 0. From the definition of Xc (l0 ), xv,c (l0 ) =
b) For client c ∈
0, ∀v.
In all above cases, the optimal solution {xv,c (l), yv (l)} of (10.6) satisfies the
following:
y (l), if c ∈ C(l),
v
xv,c (l) = (C.11)
0, otherwise.
Next, we use this form for xv,c (l) to derive yv (l). After substituting (C.11)
into (10.6), most constraints become trivial due to the form of (C.11) and the
optimality of Xc (l) and Y (l). And we only need to optimize the objective
P
function with the constraints stating that yv (l) is binary, and v yv (l) = Y (l).
77
Thus, we only need to optimize the following problem.
V
X X
minimize βv (l)yv (l)
l:Y (l)=1 v=1
X V
X X
+ (αv,c (l) + f (l))yv (l)
l:Y (l)=1 c∈C(l) v=1
V
X
subject to yv (l) = Y (l), ∀v, c, l
v=1
1, if Y (l) = 1 and
yv (l) = (C.11)
P
v = argmin{βv (l) + c∈C(l) αv,c (l)},
0, otherwise.
This constraint states that any data placed in the data cloud must have been
purchased by the data cloud. As in the per-query contracting case, the data
purchasing/placement decision for data from one data provider does not im-
pact the data purchasing/placement decision for any other data providers.
Thus, we drop the index p in the following.
The first reduction follows directly from the first part of the proof for The-
orem 8. It can be easily proved by defining facilities in J1 to be the quality
levels, and using the same reformulation as the second part of the proof for
Theorem 8 for the facilities in J2 , i.e. define facilities inJ2 to be pairs of qual-
ity levels and data centers. In the reduction, a facility j1 ∈ J1 is open if and
only if the corresponding quality level l is purchased, and a facility j2 ∈ J2 is
opened if and only if data of quality level l is placed in data center d.
For the single data center case, we always have z(l) = y(l) for all quality level l -
this follows immediately from dropping the dependence of yd (l) in d, implying
that z(l) is only lower-bounded by y(l) in the constraints. Furthermore, if
the execution costs are the same across quality levels, the cost minimization
problem can be formulated as follows:
L
X
minimize (β(l) + f (l)) y(l) (C.13)
l=1
xc (l) ≥ 0, ∀c, l
y(l) ≥ 0, ∀l
xc (l), y(l) ∈ {0, 1}, ∀c, l
Since the decisions for variables {xc (l)} do not affect the objective value, (C.13)
79
can be written as follows:
L
X
minimize (β(l) + f (l)) y(l) (C.14)
l=1
XL
subject to y(l) ≥ 1, ∀l, c
l=wc
Since there are customers buying the highest quality level, the highest level
quality L is always purchased by the data cloud and y(L) = 1 in any feasible
solution. Since all customers are satisfied and all costs are non-negative, an
optimal solution for (C.14) is y(L) = z(L) = 1, xc (L) = 1 with all other
variables are set to 0. The result implies the data cloud will only purchase the
highest quality level of data and serve that data to every customers.
This work may be used in accordance with the terms of the Creative Commons license
or other rights statement, as indicated in the copyright statement or in the metadata
associated with this work. Unless otherwise specified in the copyright statement
or the metadata, all rights are reserved by the copyright holder.
ProQuest LLC
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, MI 48106 - 1346 USA