0% found this document useful (0 votes)
84 views10 pages

ICML - 2016 - Stratified Sampling Meets Machine Learning

Uploaded by

gao jiashi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
84 views10 pages

ICML - 2016 - Stratified Sampling Meets Machine Learning

Uploaded by

gao jiashi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Stratified Sampling Meets Machine Learning

Kevin Lang LANGK @ YAHOO - INC . COM


Yahoo Research
Edo Liberty EDO @ YAHOO - INC . COM
Yahoo Research
Konstantin Shmakov KSHMAKOV @ YAHOO - INC . COM
Yahoo Research

Abstract
P
qi = 0 otherwise. The result y = i qi is the number of
articles read by Yahoo’s New York based users. In an inter-
This paper solves a specialized regression prob- nal system at Yahoo (YAM+) such queries are performed in
lem to obtain sampling probabilities for records rapid succession by advertisers when designing advertising
in databases. The goal is to sample a small set of campaigns.
records over which evaluating aggregate queries
can be done both efficiently and accurately. We Answering such queries efficiently poses a challenge. On
provide a principled and provable solution for the one hand, the number of users n is too large for an ef-
this problem; it is parameterless and requires no ficient linear scan, i.e., evaluating y explicitly. This is both
data insights. Unlike standard regression prob- in terms of running time and space (disk) usage. On the
lems, the loss is inversely proportional to the other hand, the values qi could be results of applying arbi-
regressed-to values. Moreover, a cost zero so- trarily complex functions to records. Consider for example
lution always exists and can only be excluded by the Flurry SDK (Yahoo) where users issue arbitrary queries
hard budget constraints. A unique form of reg- on their dataset. This means no indexing, intermediate pre-
ularization is also needed. We provide an effi- aggregation, or sketching based solutions could be applied.
cient and simple regularized Empirical Risk Min- Fortunately, executing queries on a random sample of
imization (ERM) algorithm along with a theoret- records can provide good approximate answers. See the
ical generalization result. Our extensive exper- work of Olken, Rotem and Hellerstein (Olken & Rotem,
imental results significantly improve over both 1986; 1990; Olken, 1993; Hellerstein et al., 1997) for mo-
uniform sampling and standard stratified sam- tivations and efficient algorithms for database sampling. It
pling which are de-facto the industry standards. is well known (and easy to show) that a uniform sample of
records provides a provably good solution to this problem.

1. Introduction 1.1. Uniform Sampling


Given a database of n records 1, 2, . . . , n we P
define the re- Let S ⊂ {1, 2, . . . , n} be the set of sampled records and
sult y of an aggregate query q to be y = i qi . Here, Pr(i ∈ S) = pi independently for all i. PThe Horvitz-
qi is the scalar result of evaluating query q on record i.1 Thompson estimator for y is given by ỹ = i∈S qi /pi . If
For example, consider a database containing user actions pi ≥ ζ > 0 the following statements hold true.
on a popular site such as the Yahoo homepage. Here, each
record corresponds to a single user and contains his/her past • E[ỹ − y] = 0
p
actions on the site. The value qi can be the number of times • σ[ỹ − y] ≤ y 1/(ζ · card(q))
user i read a full news story if they are New York based and 2
• Pr[|ỹ − y| ≥ εy] ≤ e−O(ε ζ·card(q))
1
For notational brevity, the index i ranges over 1, 2, . . . , n un-
less otherwise specified. P σ[·] stands for the standard deviation and card(q)
Here, :=
|qi |/ max |qi | is the numeric cardinality of a query.2 The
Proceedings of the 33 rd International Conference on Machine 2
Note that for binary, or ‘select’, queries the numeric cardi-
Learning, New York, NY, USA, 2016. JMLR: W&CP volume
nality card(q) is equal to the cardinality of the set of selected
48. Copyright 2016 by the author(s).
records.
Stratified Sampling Meets Machine Learning

first and second facts follow from direct expectation and the query log. They report favorable results but they sug-
variance computations. The third follows from applying gest an inefficient algorithm, offer no formal guaranties and
Bernstein’s inequality to the sum of independent, mean fail to recognize the critical importance of regularization.
zero, random variables that make up ỹ − y.
Recent results (Agarwal et al., 2013; Laptev et al., 2012;
Note that ζ can be inversely proportional to card(q) which Agarwal et al., 2014) investigate a more interactive or dy-
means high cardinality queries can be well approximated namic database setting. These ideas combined with modern
by very few uniform samples. Acharya, Gibbons and Poos- data infrastructures lead to impressive practical results.
ala (Acharya et al., 2000) showed that uniform sampling is
also the optimal strategy against adversarial queries. Later 1.3. Our Contributions
it was shown by Liberty, Mitzenmacher, Thaler and Ull-
man (Liberty et al., 2014) that uniform sampling is also In this paper we approach this problem in its fullest gener-
space optimal in the information theoretic sense for any ality. We allow each record to be sampled with a different
compression mechanism (not limited to selecting records). probability. Then, we optimize these probabilities to min-
That means that no summary of a database can be more imize the expected error of estimating future queries. Our
accurate than uniform sampling in the worst case. only assumption is that past and future queries are drawn
independently from the same unknown distribution. This
Nevertheless, in practice, queries and databases are not ad- embeds the stratification task into the PAC model.
versarial. This gives some hope that a non-uniform sam-
ple could produce better results for practically encountered 1. We formalize stratified sampling as a special regres-
datasets and query distributions. This motivated several in- sion problem in the PAC model (Section 2).
vestigations into this problem. 2. We propose a simple and efficient one pass algorithm
for solving the regularized ERM problem (Section 3).
1.2. Prior Art
3. We report extensive experimental results on both syn-
Sampling populations non-uniformly, such as Stratified thetic and real data that showcase the effectiveness of
Sampling, is a standard technique in statistics. An ex- our proposed solution (Section 4).
ample is known as Neyman allocation3 (Neyman, 1934;
Cochran, 1977) which selects records with probability in- This gives the first solution to this problem which is simul-
versely proportional to the size of the stratum they belong taneously provable, practical, efficient and fully automated.
to. Strata in this context is a mutually exclusive partitioning
of the records which mirrors the structure of future queries. 2. Sampling in the PAC Model
This structure is overly restrictive for our setting where the
queries are completely unrestricted. In the PAC model, one assumes that examples are drawn
i.i.d. from an unknown distribution (e.g. (Valiant, 1984;
Acharya et al. (Acharya et al., 2000) introduce congres- Kearns & Vazirani, 1994)). Given a random collection
sional sampling. This is a hybrid of uniform sampling and of such samples – a training set – the goal is to train a
Neyman allocation. The stratification is performed with re- model that is accurate in expectation for future examples
spect to the relations in the database. Later Chaudhuri, Das (over the unknown distribution). Our setting is very sim-
and Narasayya (Chaudhuri et al., 2007) considered the no- ilar. Let pi be the probability with which we pick record
tion of a distribution over queries and assert that the query i. LetPqi denote the query q evaluated for record i and let
log is a random sample from that distribution, an assump- y = Pi qi be the correct exact answer for that query. Let
tion we later make as well. Both papers describe standard ỹ = i∈S qi /pi where i ∈ S with probability pi be the
stratified sampling on finest partitioning (or fundamental Horvitz-Thompson estimator for y. The value ỹ is anal-
regions) which often degenerate to single records in our ogous to the prediction of our regressor at point q. The
setting. Nevertheless, if the formulation of (Chaudhuri model in this analogy is the vector of probabilities p.
et al., 2007) is taken to its logical conclusion, their result
resembles our ERM based approach. Their solution of the A standard objective in regression problems is to mini-
optimization problem however does not carry over. mize the squared loss, L(ỹ, y) = (ỹ − y)2 . In our set-
ting, however, the prediction ỹ is itself a random vari-
The work of Joshi and Jermaine (Joshi & Jermaine, 2008) able. By taking the expectation over the random bits of
is closely related to ours. They generate a large number of the sampling algorithm and by overloading the loss func-
distributions by taking convex combinations of Neyman al- tion L(p, q) := i qi2 (1/pi − 1), our goal is modified to
P
locations of individual strata of single queries. The chosen minimize
solution is the one that minimizes the observed variance on X
3
Also known as Neyman optimal allocation. Eq [Eỹ (ỹ − y)2 ] = Eq qi2 (1/pi − 1) = Eq L(p, q) .
i
Stratified Sampling Meets Machine Learning

Optimizing for relative squared loss L(ỹ, y) = (ỹ/y − 1)2 infinite and guarantees some accuracy for arbitrary queries
is possible simply by dividing the loss by y 2 . For nota- (see Section 1.1). To sum up, pemp is the solution to the
tional reasons the absolute squared loss is used for the al- following constrained optimization problem:
gorithm presentation and mathematical derivation. The ex-
1 XX 2
perimental section uses the relative loss which turns out to pemp = arg min qi (1/pi − 1)
be preferred by most practitioners. The reader should keep
p |Q| i
q∈Q
in mind that both absolute and relative squared losses fall X
s.t. pi ci ≤ B and ∀ i pi ∈ [ζ, 1]
under the exact same formulation.
i
The absolute value loss L(ỹ, y) = |ỹ − y| was considered This optimization is computationally feasible because it
by (Chaudhuri et al., 2007). While it is a very reasonable minimizes a convex function over a convex set. However, a
measure of loss it is problematic in the context of optimiza- (nearly) closed form solution to this constrained optimiza-
tion. First, there is no simple closed form expression for tion problem is obtainable using the standard method of
its expectation over ỹ. While this does not rule out gradi- Lagrange multipliers. The ERM solution, pemp , minimizes
ent descent based methods it makes them much less effi-
cient. A more critical issue with setting L(ỹ, y) = |ỹ − y| 1 XX 2 X
max [ qi (1/pi − 1) − αi (pi − ζ)
is the fact that L(p, q) is, in fact, not convex in p. To ver- α,β,γ |Q|
q∈Q i i
ify, consider a dataset with only two records and a single X X
query (q1 , q2 ) = (1, 1). Setting (p1 , p2 ) = (0.1, 0.5) or − βi (1 − pi ) − γ(B − pi ci )]
(p1 , p2 ) = (0.5, 0.1) gives Eỹ [|ỹ − y|] = 1.8. Setting i i
(p1 , p2 ) = (0.3, 0.3) yields Eỹ [|ỹ − y|] = 1.96. This con- where αi , βi and γ are nonnegative. By complementary
tradicts the convexity of L with respect to p. slackness conditions, if ζ < pi < 1 then αi = βi = 0.
Taking the derivative with respect to pi we get that
3. Empirical Risk Minimization q
1
pi ∝ c1i |Q| 2
P
q∈Q qi
Empirical Risk Minimization (ERM) is a standard ap-
proach in machine learning in which the chosen model is This yields pi q= CLIP1ζ (λzi ) for some constant λ
the minimizer of the empirical risk. The empirical risk
(1/ci |Q|) q∈Q qi2 and CLIP1ζ (z) =
P
where zi =
Remp (p) is defined as an average loss of the model over
the training set Q. Here Q is a query log containing a ran- max(ζ, min(1,
P z)). The value for λ is the maximal value
dom collection of queries q drawn independently from the such that pi ci ≤ B and can be computed by binary
unknown query distribution. search. This method for computing pemp is summarized
by Algorithm 1, which only makes a single pass over the
1 X training data (in Line 5).
pemp = arg min Remp (p) = arg min L(p, q)
p p |Q|
q∈Q
Algorithm 1 Train: regularized ERM algorithm
Notice that, unlike most machine learning problems, one 1: input: training queries Q,
could trivially obtain zero loss by setting all sampling prob- 2: budget B, record costs c,
abilities to 1. This clearly gives very accurate “estimates” 3: regularization factor η ∈ [0, 1]
but also, obviously, achieves no reduction in the database P
4: ζ = η · (B/ i ci )
size. In this paper we assume that retaining record i in- q
1 1
P 2
curs cost ci and constrain the sampling to a fixed budget B. 5: ∀ i zi = ci |Q| q∈Q qi
P
One can think of ci , for example, being the size of record 6: Binary search for λ satisfying i ci CLIP1ζ (λzi ) = B
i on disk and B being the total available P storage. The in- 7: output: ∀ i pi = CLIP1ζ (λzi )
teresting scenario
P for sampling is when ci  B. By
enforcing that pi ci ≤ B the expected cost of the sample
fits the budget and the trivial solution is disallowed. 3.1. Model Generalization
ERM is usually coupled with regularization because ag- The reader is reminded that we would have wanted to find
gressively minimizing the loss on the training set runs the the minimizer p∗ of the real risk R(p). However, Algo-
risk of overfitting. We introduce a regularization mecha- rithm 1 find pemp which minimizes Remp (p), the empiri-
P that pi ≥ ζ for some small threshold
nism by enforcing cal risk . Generalization, in this context, refers to the risk
0 ≤ ζ ≤ B/ i ci . When P ζ = 0 no regularization is ap- associated with the empirical minimizer R(pemp ). Stan-
plied. When ζ = B/ i ci the regularization is so severe dard generalization results reason about R(pemp ) − R(p∗ )
that uniform sampling is the only feasible solution. This as a function of the number of training examples and the
type of regularization both insures that the variance is never complexity of the learned concept.
Stratified Sampling Meets Machine Learning

In terms of model complexity, a comprehensive study of the Finally, a straightforward calculation shows that
VC-dimension of SQL queries was presented by Riondato X
et al. (Riondato et al., 2011). For regression problems, such R(p) = (1/pi − 1)Eq qi2
as ours, Rademacher complexity (see for example (Bartlett i
X
& Mendelson, 2003) and (Shalev-Shwartz & Ben-David, (1 + ε)2 /p∗i − 1 Eq qi2


2014)) is a more appropriate measure. Moreover, it is di- i
rectly measurable on the training set which is of great prac- X X
≤ (1 + 3ε) (1/p∗i − 1) Eq qi2 + 3ε Eq qi2
tical importance.
i i
Luckily,
p here, we can bound the generalization directly. Let ≤ (1 + O(ε))R(p∗ ) .
zi∗ = (1/ci )Eq qi2 . Notice that, if we replace zi by zi∗ in
requires that i Eq qi2 is not much larger
P
Algorithm 1 we obtain the optimal solution p∗ . The last inequality
than R(p∗ ) = i (1/p∗i − 1) Eq qi2 . This is a veryPreason-
P
We will show that zi∗ and zi are 1±ε approximationsp of one able assumption. In fact, in we expect i Eq qi2
another, and that ε diminishes proportionally to 1/|Q|. Pmost cases

to be much smaller than i (1/pi − 1) Eq qi2 because the
This will yield that the values of λ and λ∗ , pi and p∗i , and sampling probabilities tend to be rather small. This con-
finally that R(p) and R(p∗ ) are also 1±O(ε) multiplicative cludes the proof of our generalization result
approximations of one another which establishes our claim.
p
For a single record, the variable zi2 is a sum of i.i.d. ran- R(p) ≤ R(p∗ )(1 + O(max skew(i) log(n/δ)/|Q|)) .
i
dom variables. Moreover, zi∗2 = Eq zi2 . Using Hoeffding’s
inequality we can reason about the difference between the
4. Experiments
two values.
2 2
In the previous section we proved that if ERM is given a
Pr zi2 − zi∗2 ≥ εzi∗2 ≤ 2e−2|Q|ε / skew (i) .
 
sufficiently large number of training queries it will generate
sampling probabilities that are nearly optimal for answer-
Definition: The skew of a record is defined as ing future queries.

skew(i) = (max qi2 )/(Eq qi2 ) . In this section we present an array of experimental results
q using our algorithm. We compare it to uniform sampling
and stratified sampling. We also study the effects of varying
It captures the variability in the values a single record con-
the number of training example and strength of the regular-
tributes to different queries. Note that skew(i) is not di-
ization. This is done for both synthetic and real datasets.
rectly observable. Nevertheless, skew(i) is usually a small
constant times the reciprocal probability of record i being Our experiments focus exclusively on the relative error de-
selected by a query. fined by L(ỹ, y) = (ŷ/y − 1)2 . As a practical shortcut,
this is achievable without modifying Algorithm 1 at all.
Taking the union bound over all records, we get the mini-
The only modification needed is normalizing all training
mal value for ε for which we succeed with probability 1−δ.
queries such that y = 1 before executing Algorithm 1. The
p reader can easily verify that this is mathematically identical
ε = O(max skew(i) log(n/δ)/|Q|) to minimizing the relative error. Algorithm 2 describes the
i
testing phase reported below.
From this point on, it is safe to assume zi∗ /(1 + ε) ≤ zi ≤
(1 + ε)zi∗ for all records i simultaneously. To prove that Algorithm 2 Test: measure expected test error.
λ∗ ≤ (1 + ε)λ assume by negation that λ∗ > (1 + ε)λ. 1: input: Test queries Q, probability vector p
Because CLIP1ζ is a monotone non-decreasing function we 2: for q ∈ Q
have that Pdo
3: yq ← i q i
vq2 = E(ỹq /yq − 1)2 = (1/yq2 ) i qi2 (1/pi − 1)
P
X X 4:
B = ci CLIP1ζ (λ∗ zi∗ ) > ci CLIP1ζ (λ(1 + ε)zi∗ ) 5: end for
6: output: (1/|Q|) q∈Q vq2
X P
> ci CLIP1ζ (λzi ) = B

The contradiction proves that λ∗ ≤ (1+ε)λ. Using the fact


that CLIP1ζ (x) ≥ CLIP1ζ (ax)/a for all a ≥ 1 we observe 4.1. Details of Datasets
Cube Dataset The Cube Dataset uses synthetic records
pi = CLIP1ζ (λzi ) ≥ CLIP1ζ (λzi (1 + ε)2 )/(1 + ε)2 and synthetic queries which allows us to dynamically gen-
∗ ∗
≥ CLIP1ζ (λ zi )/(1 + ε)2 = p∗i /(1 + ε)2 erate queries and test the entire parameter space. A record
Stratified Sampling Meets Machine Learning

is a 5-tuple {xk ; 1 ≤ k ≤ 5} of random real values, each Dataset Cube DBLP YAM+
drawn uniformly at random from the interval [0, 1]. The Sampling Rate 0.1 0.01 0.01
dataset contained 10000 records. A query {(tk , sk ); 1 ≤ Uniform Sampling 0.664 0.229 0.104
k ≤ 5} is a 5-tuple of pairs, each containing a random Neyman Allocation 0.643 0.640 0.286
threshold tk in [0, 1] (uniformly) and a randomly chosen Regularized Neyman 0.582 0.228 0.102
sign sk ∈ {−1, 1} with equal probability. We set qx = 1 ERM-η, small training set 0.637 0.222 0.096
iff ∀k, sk (xk −tk ) ≥ 0 and zero else. We also set all record ERM-ρ, small training set 0.623 0.213 0.092
costs to ci = 1. The length of the tuples and the number
ERM-η, large training set 0.233 0.182 0.064
of record is arbitrary. Changing those yields qualitatively
ERM-ρ, large training set 0.233 0.179 0.059
similar results.
Figure 1. Average expected relative squared errors on test set for
two standard baselines (uniform sampling and Neyman alloca-
DBLP Dataset In this dataset we use a real database tion); one novel baseline (Regularized Neyman); and ERM us-
from DBLP and synthetic queries. Records correspond to ing two regularization methods. Note that Neyman allocation is
2,101,151 academic papers from the DBLP public database worse than uniform sampling for two of the three datasets, and
(database). From the publicly available DBLP database that “Regularized Neyman” works better than either of them on
XML file we selected all papers from the 1000 most pop- all three datasets. The best result for each dataset is shown in bold
ulous venues. A venue could be either a conference or a text. In all cases it is achieved by regularized ERM. Also, more
journal. From each paper we extracted the title, the num- training data reduces the testing error, which is to be expected.
ber of authors, and the publication date. From the titles we Surprisingly, a heuristic variant of the regularization (Section 4.4)
extracted the 5000 most commonly occurring words (delet- slightly outperforms the one analyzed in the paper.
ing several typical stop-words such as “a”, “the” etc.).
Next 50,000 random queries were generated as follows. The queries were subdivided to training and testing sets
Select one title word w uniformly at random from the set each containing 800 queries. All training queries were
of 5000 commonly occurring words. Select a year y uni- chronologically issued before any of the testing queries.
formly at random from 1970, . . . , 2015. Select a number k The training and testing sets each contained roughly 60
of authors from 1, . . . , 5. The query matches papers whose queries that matched fewer than 1000 users. These were
titles contain w and one of the following four conditions discarded since they are considered by YAM+ users as too
(1) the paper was published on or before y (2) the paper small to be of any interest. As such, approximating them
was published after y (3) the number of authors is ≤ k (4) well is unnecessary.
the number of authors is > k. Each condition is selected
with equal probability. A candidate query is rejected if it 4.2. Baseline Methods
was generated already or if it matches fewer than 100 pa-
We used three algorithms to establish baselines for judging
pers. The 50,000 random queries were split into 40,000 for
the performance of regularized ERM. Both of the first two
training and 10,000 for testing.
algorithms, uniform sampling and Neyman allocation, are
well known and widely used. The third algorithm (see Sec-
YAM+ Dataset The YAM+ dataset was obtained from tion 4.5) was a novel hybrid of uniform sampling and Ney-
an advertising system at Yahoo. Among its many func- man allocation that was inspired by our best-performing
tions, YAM+ must efficiently estimate the reach of adver- version of regularized ERM.
tising campaigns. It is a real dataset with real queries issued
by campaign managers. In this task, each record contains Standard Baseline Methods The most important base-
a single user’s advertising related actions. The result of line method is uniform random sampling. It is widely used
a query is the number of users, clicks or impressions that by practitioners and has been proved optimal for adversar-
meet some conditions. ial queries. Moreover, as shown in Section 1, it is theoreti-
cally well justified.
In this task, record costs ci correspond to their volume on
disk which varies significantly between records. The bud- The second most important (and well known) baseline is
get is the pre-specified allotted disk space available for stor- Stratified Sampling, specifically Neyman allocation (also
ing the samples. Moreover, unlike the above two exam- known as Neyman optimal allocation). Stratified Sampling
ples, the values qi often represent the number of matching as a whole requires the input records to be partitioned into
events for a given user. These are not binary but instead disjoint sets called strata. In the most basic setting, the op-
vary between 1 and 10,000. To set up our experiment, 1600 timal sampling scheme divides the budget equally between
contracts (campaigns) were evaluated on 60 million users, the strata and then uniformly samples within each stratum.
yielding 1.6 billion nonzero values of qi . This causes the sampling probability of a given record to
Stratified Sampling Meets Machine Learning

Cube Dataset DBLP Dataset YAM+ Dataset


1.2 0.34 0.16
Uniform Sampling p = 1/10 Uniform Sampling p = 1/100 Uniform p = 1/100
1.1 50 Training Queries 5000 Training Queries 0.15 50 Training Queries
100 Training Queries 0.32 10000 Training Queries 100 Training Queries
1 200 Training Queries 20000 Training Queries 200 Training Queries
800 Training Queries 0.3 40000 Training Queries 0.14 400 Training Queries
0.9 6400 Training Queries All Training Queries
0.13
0.28
0.8
Expected Error

Expected Error

Expected Error
0.12
0.7 0.26
0.11
0.6 0.24
0.1
0.5
0.22
0.09
0.4
0.2 0.08
0.3
0.2 0.18 0.07

0.1 0.16 0.06


0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
[weaker...] Value of Regularization Parameter Eta [...stronger] [weaker...] Value of Regularization Parameter Eta [...stronger] [weaker...] Value of Regularization Parameter Eta [...stronger]

Figure 2. The three plots correspond to the three datasets. The y-axis is the average expected normalized squared error on the testing
queries (lower values are better). The different curves in each plot correspond to different sizes of training sets (see the legend). The
black horizontal line corresponds to uniform sampling using a similar sampling rate. The value of η (strength of regularization) varies
along the x-axis. The plots make clear that a) training on more examples reduces the future expected error b) more regularization is
needed for smaller training sets and c) that overfitting is a real concern.

be inversely proportional to the cardinality of its stratum. Strata for YAM+ Data Set For the YAM+ dataset users
Informally, this works well when future queries correlate were put into separate partitions by the type of device they
with the strata and therefore have plenty of matching sam- use most often (smartphone, laptop, tablet etc.) and avail-
ples. able ad-formats on these devices. This creates 71 strata.
YAM+ supports Yahoo ads across many devices and ad-
Strata for Neyman Allocation The difficulty in applying formats and advertisers often choose one or a few formats
Neyman allocation to a given data system lies in designing for their campaigns. Therefore, this partition respects the
the strata. This task incurs a large overhead for developing structure of most queries. Other reasonable partitions we
insights about the database and queries. Our experiment experimented with did not perform as well. For example,
used the most reasonable strata we could come up with. partition by user age and/or gender would have been rea-
It turned out that the only dataset where Neyman alloca- sonable but it correlates poorly with the kind of queries is-
tion beat uniform random sampling was the synthetic Cube sued to the system.
Dataset, whose structure we understood completely (since
we designed it). This, however, does not preclude the pos- Results for Baseline Methods Figure 1 tabulates the
sibility that better strata would have produced better results baseline results against which the accuracy
P of regularized
and possibly have improved on uniform random sampling ERM is judged. The sampling rate is B/ ci . The rest of
the rows contain the quantity (1/|Q|) q∈Q vq2 , the output
P
for the other datasets as well.
of Algorithm 2. A comparison of the two standard base-
Strata for Cube Data Set For the Cube Dataset, we hap- line methods shows that uniform random sampling worked
pened to know that good coverage of the corners of the better than Neyman allocation for both of the datasets that
cube is important. We therefore carved out the 32 corners used real records and whose structure was therefore com-
of the cube and assigned them to a separate “corners” stra- plex and somewhat obscure.
tum as follows. A point was assigned to this stratum if
∀k ∈ {1 . . . 5}, min(xk , 1 − xk ) < C where the threshold 4.3. Main Experimental Results
C = (1/160)(1/5) ≈ 0.362 was chosen so that the total Figure 2 shows the results of applying Algorithm 1 to the
volume of the corners stratum was 20 percent of the vol- three above datasets. There is one plot per dataset. In all
ume of the cube. This corners stratum was then allocated three plots the y-axis is the average expected normalized
50 percent of the sampling scheme’s space budget. This squared error as measured on the testing queries; lower
caused the sampling probabilities of points in the corners values are better. The different curves in each plot in Fig-
stratum to be 4 times larger than the probabilities of other ure 2 report the results for a different size of training set.
points. The worst results (highest curve) correspond to the small-
est training set. The best results (lowest curve) are for the
Strata for DBLP Data Set For the DBLP dataset, we largest training set. There is also a black line across the
experimented with three different stratifications that could middle of the plot showing the performance of uniform ran-
plausibly correlate with queries: 1) by paper venue, 2) by dom sampling at the same average sampling rate (budget).
number of authors, and 3) by year. Stratification by year More training data yields better generalization (and clearly
turned out to work best, with number of authors a fairly does not affect uniform sampling). This confirms our hy-
close second. pothesis that the right model is learned.
Stratified Sampling Meets Machine Learning

Cube Dataset DBLP Dataset YAM+ Dataset


100 10 10
ERM Regularized ERM Regularized ERM
Uniform Sampling Uniform Sampling Uniform Sampling
10 1
1

1 0.1
Expected Error

Expected Error

Expected Error
0.1

0.1 0.01

0.01
0.01 0.001

0.001
0.001 0.0001

0.0001 0.0001 1e-05


1 10 100 1000 10000 100 1000 10000 100000 1e+06 1 10 100 1000 10000 100000 1e+06
Numeric Cardinality of Test Query Numeric Cardinality of Test Query Numeric Cardinality of Test Query

Figure 3. These three plots show the expected error of each test query. Clearly, for all three datasets, error is generally a decreasing
function of the numeric cardinality of the queries. The advantage of ERM over uniform random sampling lies primarily at the more
difficult low cardinality end of the spectrum.

The x-axis in Figure 2 varies with the value of the parame- The estimation accuracy of Neyman with Mixture Regu-
ter η which controls the strength of regularization. Moving larization is tabulated in the “Regularized Neyman” row of
from left to right means that stronger regularization is being Figure 1. Each number was measured using the best value
applied. When the smallest training set is used (top curve), of ρ for the particular dataset (tested in 0.01 increments).
ERM only beats uniform sampling when very strong reg- We note that this hybrid method worked better than either
ularization is applied (towards the right side of the plot). uniform sampling or standard Neyman allocation.
However, the larger the training set becomes, the less reg-
ularization is needed. This effect is frequently observed in 4.6. Accuracy as a Function of Query Size
many machine learning tasks where smaller training sets
require stronger regularization to prevent overfitting. Our main experimental results show that (with appropriate
regularization) ERM can work better overall than uniform
random sampling. However, there is no free lunch. The
4.4. Mixture Regularization
method intuitively works by redistributing the overall sup-
In Algorithm 1, the amount of regularization is determined ply of sampling probability, increasing the probability of
by a probability floor whose height is controlled by the user records involved in hard queries by taking it away from
parameter η. We have also experimented with a different records that are only involved in easy queries. This de-
regularization that seems to work slightly better. In this creases the error of the system on the hard queries while
method, unregularized sampling probabilities p are gen- increasing its error on the easy queries. This tradeoff is
erated by running Algorithm 1 with η = 0. Then, reg- acceptable because easy queries initially exhibit minuscule
0
P via the formula p =
ularized probabilities are computed error rates and remain well below an acceptable error rate
(1 − ρ)p + ρu where u = B/( i ci ) is the uniform sam- even if increased.
pling rate that would hit the space budget. Note that p0 is a
We illustrate this phenomenon using scatter plots that have
convex combinations of two feasible solutions to our opti-
a separate plotted point for each test query showing its ex-
mization problem and is therefore also a feasible solution.
pected error as a function of its numeric cardinality. As
Test error as a function of training set size and the value of
discussed in Section 1.1, the numeric cardinality is a good
ρ are almost identical to those achieved by η-regularization
measure of how hard it is to approximate a query result well
(Figure 2). The only difference is that the minimum test-
using a downsampled database.
ing errors achieved by mixture regularization are slightly
lower. Some of these minima are tabulated in Figure 1. These scatter plots appear in Figure 3. There is one plot
This behavior could be specific to the data used but could for each of the three datasets. Also, within each plot, each
also apply more generally. query is plotted with two points; a blue one showing its er-
ror with uniform sampling, and a red one showing its error
4.5. Neyman with Mixture Regularization with regularized ERM sampling.
The Mixture Regularization method described in Sec- For high cardinality (easy) queries ERM typically exhibits
tion 4.4 can be applied to any probability vector, including more error than uniform sampling. For example, the ex-
a vector generated by Neyman allocation. The resulting treme cardinality queries for the Cube dataset experience a
probability vector is a convex combination of a uniform 0.001 error rate with uniform random sampling. With our
vector and a Neyman vector, with the fraction of uniform solution the error increases to 0.005. This is a five fold
controlled by a parameter ρ ∈ [0, 1]. This idea is similar in increase but still well below an average 0.25 error in this
spirit to Congressional Sampling (Acharya et al., 2000). setting. For low cardinality (hard) queries, ERM typically
Stratified Sampling Meets Machine Learning

achieves less error than uniform sampling. However, it Smoothed histograms of these measurements for regular-
doesn’t exhibit lower error on all of the hard queries. That ized ERM and for uniform sampling appear in Figure 4.
is because error is measured on testing queries that were For esthetic reasons, these histograms were smoothed by
not seen during training. Predicting the future isn’t easy. convolving the discrete data points with a narrow gaussian
(σ = 0.006). They approximate the true distribution of
0.1
YAM+ Dataset
Uniform Sampling
10
YAM+ Dataset
Uniform Sampling
concrete outcomes.
0.09 Regularized ERM Regularized ERM

0.08 1
The two distributions overlap. With probability 7.2%, a
Probability (Rescaled)

0.07

Expected Error
0.06

0.05
0.1
specific ERM outcome was actually worse than the out-
0.04

0.03
0.01
come of uniform sampling with the same vector of random
0.02

0.01
0.001
numbers. Even so, from Figure 4 we clearly see the distri-
0
0 0.05 0.1 0.15 0.2
Average Error
0.25 0.3 0.35 0.4
0.0001
0.0001 0.001 0.01 0.1
’Sampling Rate’ = Budget / (Total Cost)
1 bution for regularized ERM shifted to the left. This corre-
sponds to the reduced expected loss but also shows that the
Figure 4. Left: These smoothed histograms show the variability mode of the distribution is lower.
of results caused by random sampling decisions. Clearly, the dis-
Moreover, the ERM outcomes are more sharply concen-
tribution of outcomes for regularized ERM is preferable to that
trated around their mean, exhibiting standard deviation of
of uniform random sampling. Right: ERM with mixture regular-
ization versus uniform random sampling at various effective sam- 0.049 versus 0.062 using uniform sampling. This is de-
P
pling rates (B/ ci ). The gains might appear unimpressive in spite the fact that the right tail of the ERM distribution
the log-log scale plot but are, in fact, 40%-50% throughout which was slightly worse, with 17/3000 outcomes in the interval
is significant. [0.4, 0.6] versus 11/3000 for uniform. The increased con-
centration is surprising because usually reducing expected
loss comes at the expense of increasing its variance. This
4.7. Variability Caused by Sampling Choices should serve as additional motivation for using the ERM
solution.
1 2
P
The quantity |Q| q∈Q vq output by Algorithm 2 is the av-
erage expected normalized squared error on the queries of
the testing set. While this expected test error is minimized
5. Concluding discussion
by the algorithm, the actual test error is a random variable Using three datasets, we demonstrate that our machine
that depends on the random bits of the sampling algorithm. learning based sampling and estimation scheme provides
Therefore, for any specific sampling, the test error could be a useful level of generalization from past queries to fu-
either higher or lower than its expected value. The same ture queries. That is, the estimation accuracy on future
thing is true for uniform random sampling. Given this ad- queries is better than it would have been had we used uni-
ditional source of variability, it is possible that a concrete form or stratified sampling. Moreover, it is a disciplined
sample obtained using ERM could perform worse than a approach that does not require any manual tuning or data
concrete sample obtained by uniform sampling, even if the insights such as needed for using Stratified Sampling (cre-
expected error of ERM is better. ating strata). Since we believe most systems of this nature
To study the variability caused by sampling randomness, already store a historical query log, this method should be
we first computed two probability vectors, pe and pu for widely applicable.
the YAM+ dataset. The former was the output of ERM with The ideas presented extend far beyond the squared loss and
mixture regularization with ρ = 0.71 (its best value for this the specific ERM algorithm analyzed. Machine learning
dataset). The latter was a uniform probability vector with theory allows us to apply this framework to any convex loss
the same effective sampling rate (0.01). These were kept function using gradient descent based algorithms (Hazan &
fixed throughout the experiment. Kale, 2014). One interesting function to minimize is the
Next the following experiment was repeated 3000 times. deviation indicator function L(ỹ, y) = 1 if |ỹ − y| ≥ εy
In each trial a random vector, r, of n random numbers was and zero else. This choice does not yield a closed form so-
created. Each of the values ri was chosen uniformly at lution for L(p, q) but using Bernstein’s inequality yields a
random from [0, 1]. tight bound that turns out to be convex in p. Online convex
optimization (Zinkevich, 2003) could give provably low
The two concrete samples specified by these values are regret results for any arbitrary sequence of queries. This
i ∈ Se if pe,i < ri and i ∈ Su if pu,i < ri . Finally we mea- avoids the i.i.d. assumption and could be especially appeal-
sured the average normalized squared error over the testing ing in situations where the query distribution is expected to
queries for the concrete samples Se and Su . The reason for change over time.
this construction is so that the two algorithms use the exact
same random bits.
Stratified Sampling Meets Machine Learning

References 253291. URL http://doi.acm.org/10.1145/


253262.253291.
Acharya, Swarup, Gibbons, Phillip B., and Poosala,
Viswanath. Congressional samples for approximate an- Joshi, Shantanu and Jermaine, Christopher. Robust strati-
swering of group-by queries. SIGMOD Rec., 29(2):487– fied sampling plans for low selectivity queries. In Pro-
498, May 2000. ISSN 0163-5808. doi: 10.1145/335191. ceedings of the 2008 IEEE 24th International Confer-
335450. URL http://doi.acm.org/10.1145/ ence on Data Engineering, ICDE ’08, pp. 199–208,
335191.335450. Washington, DC, USA, 2008. IEEE Computer Society.
ISBN 978-1-4244-1836-7. doi: 10.1109/ICDE.2008.
Agarwal, Sameer, Mozafari, Barzan, Panda, Aurojit, Mil-
4497428. URL http://dx.doi.org/10.1109/
ner, Henry, Madden, Samuel, and Stoica, Ion. Blinkdb:
ICDE.2008.4497428.
Queries with bounded errors and bounded response
times on very large data. In Proceedings of the 8th Kearns, Michael J and Vazirani, Umesh Virkumar. An in-
ACM European Conference on Computer Systems, Eu- troduction to computational learning theory. MIT press,
roSys ’13, pp. 29–42, New York, NY, USA, 2013. 1994.
ACM. ISBN 978-1-4503-1994-2. doi: 10.1145/
2465351.2465355. URL http://doi.acm.org/ Laptev, Nikolay, Zeng, Kai, and Zaniolo, Carlo. Early
10.1145/2465351.2465355. accurate results for advanced analytics on mapre-
duce. Proc. VLDB Endow., 5(10):1028–1039, June
Agarwal, Sameer, Milner, Henry, Kleiner, Ariel, Talwalkar, 2012. ISSN 2150-8097. doi: 10.14778/2336664.
Ameet, Jordan, Michael, Madden, Samuel, Mozafari, 2336675. URL http://dx.doi.org/10.14778/
Barzan, and Stoica, Ion. Knowing when you’re wrong: 2336664.2336675.
Building fast and reliable approximate query processing
systems. In Proceedings of the 2014 ACM SIGMOD In- Liberty, Edo, Mitzenmacher, Michael, Thaler, Justin, and
ternational Conference on Management of Data, SIG- Ullman, Jonathan. Space lower bounds for itemset fre-
MOD ’14, pp. 481–492, New York, NY, USA, 2014. quency sketches. CoRR, abs/1407.3740, 2014. URL
ACM. ISBN 978-1-4503-2376-5. doi: 10.1145/ http://arxiv.org/abs/1407.3740.
2588555.2593667. URL http://doi.acm.org/ Neyman, Jerzy. On the two different aspects of the repre-
10.1145/2588555.2593667. sentative method: the method of stratified sampling and
Bartlett, Peter L. and Mendelson, Shahar. Rademacher the method of purposive selection. Journal of the Royal
and gaussian complexities: Risk bounds and struc- Statistical Society, pp. 558–625, 1934.
tural results. J. Mach. Learn. Res., 3:463–482, March Olken, Frank. Random Sampling from Databases. PhD
2003. ISSN 1532-4435. URL http://dl.acm. thesis, University of California at Berkeley, 1993.
org/citation.cfm?id=944919.944944.
Olken, Frank and Rotem, Doron. Simple random
Chaudhuri, Surajit, Das, Gautam, and Narasayya, Vivek. sampling from relational databases. In Proceedings
Optimized stratified sampling for approximate query of the 12th International Conference on Very Large
processing. ACM Trans. Database Syst., 32(2), June Data Bases, VLDB ’86, pp. 160–169, San Francisco,
2007. ISSN 0362-5915. doi: 10.1145/1242524. CA, USA, 1986. Morgan Kaufmann Publishers Inc.
1242526. URL http://doi.acm.org/10.1145/ ISBN 0-934613-18-4. URL http://dl.acm.org/
1242524.1242526. citation.cfm?id=645913.671474.
Cochran, William G. Sampling Techniques, 3rd Edition. Olken, Frank and Rotem, Doron. Random sampling from
John Wiley, 1977. ISBN 0-471-16240-X. database files: A survey. In Statistical and Scientific
database, DBLP XML. http://dblp.uni-trier.de/xml/. Database Management, 5th International Conference
SSDBM, Charlotte, NC, USA, April 3-5, 1990, Procced-
Hazan, Elad and Kale, Satyen. Beyond the regret ings, pp. 92–111, 1990. doi: 10.1007/3-540-52342-1
minimization barrier: Optimal algorithms for stochas- 23.
tic strongly-convex optimization. J. Mach. Learn.
Res., 15(1):2489–2512, January 2014. ISSN 1532- Riondato, Matteo, Akdere, Mert, Ãetintemel, Uǧur,
4435. URL http://dl.acm.org/citation. Zdonik, StanleyB., and Upfal, Eli. The vc-
cfm?id=2627435.2670328. dimension of sql queries and selectivity estimation
through sampling. In Gunopulos, Dimitrios, Hof-
Hellerstein, Joseph M., Haas, Peter J., and Wang, He- mann, Thomas, Malerba, Donato, and Vazirgiannis,
len J. Online aggregation. SIGMOD Rec., 26(2):171– Michalis (eds.), Machine Learning and Knowledge Dis-
182, June 1997. ISSN 0163-5808. doi: 10.1145/253262. covery in Databases, volume 6912 of Lecture Notes
Stratified Sampling Meets Machine Learning

in Computer Science, pp. 661–676. Springer Berlin


Heidelberg, 2011. ISBN 978-3-642-23782-9. doi:
10.1007/978-3-642-23783-6 42. URL http://dx.
doi.org/10.1007/978-3-642-23783-6_42.
Shalev-Shwartz, Shai and Ben-David, Shai. Understanding
Machine Learning: From Theory to Algorithms. Cam-
bridge University Press, New York, NY, USA, 2014.
ISBN 1107057132, 9781107057135.
Valiant, L. G. A theory of the learnable. Commun. ACM, 27
(11):1134–1142, November 1984. ISSN 0001-0782. doi:
10.1145/1968.1972. URL http://doi.acm.org/
10.1145/1968.1972.

Yahoo. https://developer.yahoo.com/flurry/.
Zinkevich, Martin. Online convex programming and gener-
alized infinitesimal gradient ascent. In Machine Learn-
ing, Proceedings of the Twentieth International Confer-
ence (ICML 2003), August 21-24, 2003, Washington,
DC, USA, pp. 928–936, 2003.

You might also like