0% found this document useful (0 votes)

18 views27 pages

Data Programming: Creating Large Training Sets, Quickly

Uploaded by

Jay Khinchi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views27 pages

Data Programming: Creating Large Training Sets, Quickly

Uploaded by

Jay Khinchi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Data Programming:

Creating Large Training Sets, Quickly

Alexander Ratner, Christopher De Sa, Sen Wu, Daniel Selsam, Christopher Ré
Stanford University
{ajratner,cdesa,senwu,dselsam,chrismre}@stanford.edu
arXiv:1605.07723v3 [stat.ML] 8 Jan 2017

Thursday 26th May, 2016

Abstract
Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers
of deep learning techniques. For some applications, creating labeled training sets is the most time-consuming and
expensive part of applying machine learning. We therefore propose a paradigm for the programmatic creation of
training sets called data programming in which users express weak supervision strategies or domain heuristics as
labeling functions, which are programs that label subsets of the data, but that are noisy and may conflict. We show
that by explicitly representing this training set labeling process as a generative model, we can “denoise” the generated
training set, and establish theoretically that we can recover the parameters of these generative models in a handful
of settings. We then show how to modify a discriminative loss function to make it noise-aware, and demonstrate
our method over a range of discriminative models including logistic regression and LSTMs. Experimentally, on the
2014 TAC-KBP Slot Filling challenge, we show that data programming would have led to a new winning score, and
also show that applying data programming to an LSTM model leads to a TAC-KBP score almost 6 F1 points over
a state-of-the-art LSTM baseline (and into second place in the competition). Additionally, in initial user studies we
observed that data programming may be an easier way for non-experts to create machine learning models when training
data is limited or unavailable.

1 Introduction
Many of the major machine learning breakthroughs of the last decade have been catalyzed by the release of a new
labeled training dataset.1 Supervised learning approaches that use such datasets have increasingly become key building
blocks of applications throughout science and industry. This trend has also been fueled by the recent empirical success
of automated feature generation approaches, notably deep learning methods such as long short term memory (LSTM)
networks [14], which ameliorate the burden of feature engineering given large enough labeled training sets. For many
real-world applications, however, large hand-labeled training sets do not exist, and are prohibitively expensive to create
due to requirements that labelers be experts in the application domain. Furthermore, applications’ needs often change,
necessitating new or modified training sets.
To help reduce the cost of training set creation, we propose data programming, a paradigm for the programmatic
creation and modeling of training datasets. Data programming provides a simple, unifying framework for weak
supervision, in which training labels are noisy and may be from multiple, potentially overlapping sources. In data
programming, users encode this weak supervision in the form of labeling functions, which are user-defined programs
that each provide a label for some subset of the data, and collectively generate a large but potentially overlapping set of
training labels. Many different weak supervision approaches can be expressed as labeling functions, such as strategies
which utilize existing knowledge bases (as in distant supervision [22]), model many individual annotator’s labels (as
in crowdsourcing), or leverage a combination of domain-specific patterns and dictionaries. Because of this, labeling
functions may have widely varying error rates and may conflict on certain data points. To address this, we model the
labeling functions as a generative process, which lets us automatically denoise the resulting training set by learning the
accuracies of the labeling functions along with their correlation structure. In turn, we use this model of the training
set to optimize a stochastic version of the loss function of the discriminative model that we desire to train. We show
1 http://www.spacemachine.net/views/2016/3/datasets- over- algorithms

1
that, given certain conditions on the labeling functions, our method achieves the same asymptotic scaling as supervised
learning methods, but that our scaling depends on the amount of unlabeled data, and uses only a fixed number of
labeling functions.
Data programming is in part motivated by the challenges that users faced when applying prior programmatic
supervision approaches, and is intended to be a new software engineering paradigm for the creation and management of
training sets. For example, consider the scenario when two labeling functions of differing quality and scope overlap and
possibly conflict on certain training examples; in prior approaches the user would have to decide which one to use, or
how to somehow integrate the signal from both. In data programming, we accomplish this automatically by learning a
model of the training set that includes both labeling functions. Additionally, users are often aware of, or able to induce,
dependencies between their labeling functions. In data programming, users can provide a dependency graph to indicate,
for example, that two labeling functions are similar, or that one “fixes” or “reinforces” another. We describe cases
in which we can learn the strength of these dependencies, and for which our generalization is again asymptotically
identical to the supervised case.
One further motivation for our method is driven by the observation that users often struggle with selecting features
for their models, which is a traditional development bottleneck given fixed-size training sets. However, initial feedback
from users suggests that writing labeling functions in the framework of data programming may be easier [12]. While
the impact of a feature on end performance is dependent on the training set and on statistical characteristics of the
model, a labeling function has a simple and intuitive optimality criterion: that it labels data correctly. Motivated by this,
we explore whether we can flip the traditional machine learning development process on its head, having users instead
focus on generating training sets large enough to support automatically-generated features.

Summary of Contributions and Outline Our first contribution is the data programming framework, in which users
can implicitly describe a rich generative model for a training set in a more flexible and general way than in previous
approaches. In Section 3, we first explore a simple model in which labeling functions are conditionally independent.
We show here that under certain conditions, the sample complexity is nearly the same as in the labeled case. In
Section 4, we extend our results to more sophisticated data programming models, generalizing related results in
crowdsourcing [17]. In Section 5, we validate our approach experimentally on large real-world text relation extraction
tasks in genomics, pharmacogenomics and news domains, where we show an average 2.34 point F1 score improvement
over a baseline distant supervision approach—including what would have been a new competition-winning score for the
2014 TAC-KBP Slot Filling competition. Using LSTM-generated features, we additionally would have placed second
in this competition, achieving a 5.98 point F1 score gain over a state-of-the-art LSTM baseline [32]. Additionally, we
describe promising feedback from a usability study with a group of bioinformatics users.

2 Related Work
Our work builds on many previous approaches in machine learning. Distant supervision is one approach for program-
matically creating training sets. The canonical example is relation extraction from text, wherein a knowledge base of
known relations is heuristically mapped to an input corpus [8, 22]. Basic extensions group examples by surrounding
textual patterns, and cast the problem as a multiple instance learning one [15, 25]. Other extensions model the accuracy
of these surrounding textual patterns using a discriminative feature-based model [26], or generative models such as
hierarchical topic models [1, 27, 31]. Like our approach, these latter methods model a generative process of training
set creation, however in a proscribed way that is not based on user input as in our approach. There is also a wealth of
examples where additional heuristic patterns used to label training data are collected from unlabeled data [7] or directly
from users [21, 29], in a similar manner to our approach, but without any framework to deal with the fact that said labels
are explicitly noisy.
Crowdsourcing is widely used for various machine learning tasks [13, 18]. Of particular relevance to our problem
setting is the theoretical question of how to model the accuracy of various experts without ground truth available,
classically raised in the context of crowdsourcing [10]. More recent results provide formal guarantees even in the
absence of labeled data using various approaches [4, 9, 16, 17, 24, 33]. Our model can capture the basic model of the
crowdsourcing setting, and can be considered equivalent in the independent case (Sec. 3). However, in addition to
generalizing beyond getting inputs solely from human annotators, we also model user-supplied dependencies between
the “labelers” in our model, which is not natural within the context of crowdsourcing. Additionally, while crowdsourcing

2
d e f lambda_1 ( x ) :
Y
r e t u r n 1 i f ( x . gene , x . pheno ) i n KNOWN_RELATIONS_1 e l s e 0

d e f lambda_2 ( x ) :
r e t u r n -1 i f r e . match ( r ’ . ∗ n o t c a u s e . ∗ ’ , x . t e x t _ b e t w e e n ) e l s e 0

d e f lambda_3 ( x ) : λ1 λ2 λ3
r e t u r n 1 i f r e . match ( r ’ . ∗ a s s o c i a t e d . ∗ ’ , x . t e x t _ b e t w e e n )
and ( x . gene , x . pheno ) i n KNOWN_RELATIONS_2 e l s e 0
(b) The generative model of a training
(a) An example set of three labeling functions written by a user. set defined by the user input (unary fac-
tors omitted).

Figure 1: An example of extracting mentions of gene-disease relations from the scientific literature.

results focus on the regime of a large number of labelers each labeling a small subset of the data, we consider a small
set of labeling functions each labeling a large portion of the dataset.
Co-training is a classic procedure for effectively utilizing both a small amount of labeled data and a large amount
of unlabeled data by selecting two conditionally independent views of the data [5]. In addition to not needing a set
of labeled data, and allowing for more than two views (labeling functions in our case), our approach allows explicit
modeling of dependencies between views, for example allowing observed issues with dependencies between views to
be explicitly modeled [19].
Boosting is a well known procedure for combining the output of many “weak” classifiers to create a strong classifier
in a supervised setting [28]. Recently, boosting-like methods have been proposed which leverage unlabeled data in
addition to labeled data, which is also used to set constraints on the accuracies of the individual classifiers being
ensembled [3]. This is similar in spirit to our approach, except that labeled data is not explicitly necessary in ours, and
richer dependency structures between our “heuristic” classifiers (labeling functions) are supported.
The general case of learning with noisy labels is treated both in classical [20] and more recent contexts [23]. It has
also been studied specifically in the context of label-noise robust logistic regression [6]. We consider the more general
scenario where multiple noisy labeling functions can conflict and have dependencies.

3 The Data Programming Paradigm

In many applications, we would like to use machine learning, but we face the following challenges: (i) hand-labeled
training data is not available, and is prohibitively expensive to obtain in sufficient quantities as it requires expensive
domain expert labelers; (ii) related external knowledge bases are either unavailable or insufficiently specific, precluding
a traditional distant supervision or co-training approach; (iii) application specifications are in flux, changing the model
we ultimately wish to learn.
In such a setting, we would like a simple, scalable and adaptable approach for supervising a model applicable to our
problem. More specifically, we would ideally like our approach to achieve expected loss with high probability, given
O(1) inputs of some sort from a domain-expert user, rather than the traditional Õ( −2 ) hand-labeled training examples
required by most supervised methods (where Õ notation hides logarithmic factors). To this end, we propose data
programming, a paradigm for the programmatic creation of training sets, which enables domain-experts to more rapidly
train machine learning systems and has the potential for this type of scaling of expected loss. In data programming,
rather than manually labeling each example, users instead describe the processes by which these points could be labeled
by providing a set of heuristic rules called labeling functions.
In the remainder of this paper, we focus on a binary classification task in which we have a distribution π over object
and class pairs (x, y) ∈ X × {−1, 1}, and we are concerned with minimizing the logistic loss under a linear model given
some features, h i
l(w) = E(x,y)∼π log(1 + exp(−wT f (x)y)) ,
where without loss of generality, we assume that k f (x)k ≤ 1. Then, a labeling function λi : X 7→ {−1, 0, 1} is a
user-defined function that encodes some domain heuristic, which provides a (non-zero) label for some subset of the
objects. As part of a data programming specification, a user provides some m labeling functions, which we denote in
vectorized form as λ : X 7→ {−1, 0, 1}m .
Example 3.1. To gain intuition about labeling functions, we describe a simple text relation extraction example. In
Figure 1, we consider the task of classifying co-occurring gene and disease mentions as either expressing a causal

3
relation or not. For example, given the sentence “Gene A causes disease B”, the object x = (A, B) has true class y = 1.
To construct a training set, the user writes three labeling functions (Figure 1a). In λ1 , an external structured knowledge
base is used to label a few objects with relatively high accuracy, and is equivalent to a traditional distant supervision
rule (see Sec. 2). λ2 uses a purely heuristic approach to label a much larger number of examples with lower accuracy.
Finally, λ3 is a “hybrid” labeling function, which leverages a knowledge base and a heuristic.
A labeling function need not have perfect accuracy or recall; rather, it represents a pattern that the user wishes to
impart to their model and that is easier to encode as a labeling function than as a set of hand-labeled examples. As
illustrated in Ex. 3.1, labeling functions can be based on external knowledge bases, libraries or ontologies, can express
heuristic patterns, or some hybrid of these types; we see evidence for the existence of such diversity in our experiments
(Section 5). The use of labeling functions is also strictly more general than manual annotations, as a manual annotation
can always be directly encoded by a labeling function. Importantly, labeling functions can overlap, conflict, and even
have dependencies which users can provide as part of the data programming specification (see Section 4); our approach
provides a simple framework for these inputs.

Independent Labeling Functions We first describe a model in which the labeling functions label independently,
given the true label class. Under this model, each labeling function λi has some probability βi of labeling an object
and then some probability αi of labeling the object correctly; for simplicity we also assume here that each class has
probability 0.5. This model has distribution
m
1Y
µα,β (Λ, Y) = βi αi 1{Λi =Y} + βi (1 − αi )1{Λi =−Y} + (1 − βi )1{Λi =0} ,

(1)
2 i=1

where Λ ∈ {−1, 0, 1}m contains the labels output by the labeling functions, and Y ∈ {−1, 1} is the predicted class. If we
allow the parameters α ∈ Rm and β ∈ Rm to vary, (1) specifies a family of generative models. In order to expose the
scaling of the expected loss as the size of the unlabeled dataset changes, we will assume here that 0.3 ≤ βi ≤ 0.5 and
0.8 ≤ αi ≤ 0.9. We note that while these arbitrary constraints can be changed, they are roughly consistent with our
applied experience, where users tend to write high-accuracy and high-coverage labeling functions.
Our first goal will be to learn which parameters (α, β) are most consistent with our observations—our unlabeled
training set—using maximum likelihood estimation. To do this for a particular training set S ⊂ X, we will solve the
problem  
X X  X 
(α̂, β̂) = arg max log P(Λ,Y)∼µα,β (Λ = λ(x)) = arg max log   µα,β (λ(x), y )
0
(2)
α,β α,β
x∈S x∈S y0 ∈{−1,1}

In other words, we are maximizing the probability that the observed labels produced on our training examples occur
under the generative model in (1). In our experiments, we use stochastic gradient descent to solve this problem; since
this is a standard technique, we defer its analysis to the appendix.

Noise-Aware Empirical Loss Given that our parameter learning phase has successfully found some α̂ and β̂ that
accurately describe the training set, we can now proceed to estimate the parameter w which minimizes the expected risk
of a linear model over our feature mapping f , given α̂, β̂. To do so, we define the noise-aware empirical risk Lα̂,β̂ with
regularization parameter ρ, and compute the noise-aware empirical risk minimizer
1 X h T
i
ŵ = arg min Lα̂,β̂ (w; S ) = arg min E(Λ,Y)∼µα̂,β̂ log 1 + e−w f (x)Y Λ = λ(x) + ρ kwk2 (3)
w w |S |
x∈S

This is a logistic regression problem, so it can be solved using stochastic gradient descent as well.
We can in fact prove that stochastic gradient descent running on (2) and (3) is guaranteed to produce accurate
estimates, under conditions which we describe now. First, the problem distribution π needs to be accurately modeled by
some distribution µ in the family that we are trying to learn. That is, for some α∗ and β∗ ,

∀Λ ∈ {−1, 0, 1}m , Y ∈ {−1, 1}, P(x,y)∼π∗ (λ(x) = Λ, y = Y) = µα∗ ,β∗ (Λ, Y). (4)

Second, given an example (x, y) ∼ π∗ , the class label y must be independent of the features f (x) given the labels λ(x).
That is,
(x, y) ∼ π∗ ⇒ y ⊥ f (x) | λ(x). (5)

4
Y Y Y
f r

λ2 λ1 λ3
λ1 s λ2 λ1 e λ2
lambda_1 ( x ) = f ( ’ . ∗ c a u s e . ∗ ’ )
lambda_1 ( x ) = f ( x . word ) lambda_2 ( x ) = f ( ’ . ∗ n o t c a u s e . ∗ ’ ) lambda_1 ( x ) = x i n DISEASES_A
lambda_2 ( x ) = f ( x . lemma ) lambda_3 ( x ) = f ( ’ . ∗ c a u s e . ∗ ’ ) lambda_2 ( x ) = x i n DISEASES_B

S i m i l a r ( lambda_1 , lambda_2 ) F i x e s ( lambda_1 , lambda_2 ) E x c l u d e s ( lambda_1 , lambda_2 )

R e i n f o r c e s ( lambda_1 , lambda_3 )

Figure 2: Examples of labeling function dependency predicates.

This assumption encodes the idea that the labeling functions, while they may be arbitrarily dependent on the features,
provide sufficient information to accurately identify the class. Third, we assume that the algorithm used to solve (3) has
bounded generalization risk such that for some parameter χ,
h i h i
Eŵ ES Lα̂,β̂ (ŵ; S ) − min ES Lα̂,β̂ (w; S ) ≤ χ. (6)
w

Under these conditions, we make the following statement about the accuracy of our estimates, which is a simplified
version of a theorem that is detailed in the appendix.
Theorem 1. Suppose that we run data programming, solving the problems in (2) and (3) using stochastic gradient
descent to produce (α̂, β̂) and ŵ. Suppose further that our setup satisfies the conditions (4), (5), and (6), and suppose
that m ≥ 2000. Then for any > 0, if the number of labeling functions m and the size of the input dataset S are large
enough that
356 m
|S | ≥ 2 log
3
then our expected parameter error and generalization risk can be bounded by
h i
2

E kα̂ − α∗ k2 ≤ m 2 E β̂ − β∗ ≤ m 2 E l(ŵ) − min l(w) ≤ χ + .
w 27ρ
We select m ≥ 2000 to simplify the statement of the theorem and give the reader a feel for how scales with respect
to |S |. The full theorem with scaling in each parameter (and for arbitrary m) is presented in the appendix. This result
establishes that to achieve both expected loss and parameter estimate error , it suffices to have only m = O(1) labeling
functions and |S | = Õ( −2 ) training examples, which is the same asymptotic scaling exhibited by methods that use
labeled data. This means that data programming achieves the same learning rate as methods that use labeled data, while
requiring asymptotically less work from its users, who need to specify O(1) labeling functions rather than manually
label Õ( −2 ) examples. In contrast, in the crowdsourcing setting [17], the number of workers m tends to infinity while
here it is constant while the dataset grows. These results provide some explanation of why our experimental results
suggest that a small number of rules with a large unlabeled training set can be effective at even complex natural language
processing tasks.

4 Handling Dependencies
In our experience with data programming, we have found that users often write labeling functions that have clear
dependencies among them. As more labeling functions are added as the system is developed, an implicit dependency
structure arises naturally amongst the labeling functions: modeling these dependencies can in some cases improve
accuracy. We describe a method by which the user can specify this dependency knowledge as a dependency graph, and
show how the system can use it to produce better parameter estimates.

Label Function Dependency Graph To support the injection of dependency information into the model, we augment
the data programming specification with a label function dependency graph, G ⊂ D × {1, . . . , m} × {1, . . . , m}, which

5
is a directed graph over the labeling functions, each of the edges of which is associated with a dependency type from
a class of dependencies D appropriate to the domain. From our experience with practitioners, we identified four
commonly-occurring types of dependencies as illustrative examples: similar, fixing, reinforcing, and exclusive (see
Figure 2).
For example, suppose that we have two functions λ1 and λ2 , and λ2 typically labels only when (i) λ1 also labels,
(ii) λ1 and λ2 disagree in their labeling, and (iii) λ2 is actually correct. We call this a fixing dependency, since λ2 fixes
mistakes made by λ1 . If λ1 and λ2 were to typically agree rather than disagree, this would be a reinforcing dependency,
since λ2 reinforces a subset of the labels of λ1 .

Modeling Dependencies The presence of dependency information means that we can no longer model our labels using
the simple Bayesian network in (1). Instead, we model our distribution as a factor graph. This standard technique lets us
describe the family of generative distributions in terms of a known factor function h : {−1, 0, 1}m × {−1, 1} 7→ {−1, 0, 1} M
(in which each entry hi represents a factor), and an unknown parameter θ ∈ R M as

µθ (Λ, Y) = Zθ−1 exp(θT h(Λ, Y)),

where Zθ is the partition function which ensures that µ is a distribution. Next, we will describe how we define h using
information from the dependency graph.
To construct h, we will start with some base factors, which we inherit from (1), and then augment them with
additional factors representing dependencies. For all i ∈ {1, . . . , m}, we let

h0 (Λ, Y) = Y, hi (Λ, Y) = Λi Y, hm+i (Λ, Y) = Λi , h2m+i (Λ, Y) = Λ2i Y, h3m+i (Λ, Y) = Λ2i .

These factors alone are sufficient to describe any distribution for which the labels are mutually independent, given the
class: this includes the independent family in (1).
We now proceed by adding additional factors to h, which model the dependencies encoded in G. For each
dependency edge (d, i, j), we add one or more factors to h as follows. For a near-duplicate dependency on (i, j), we
add a single factor hι (Λ, Y) = 1{Λi = Λ j }, which increases our prior probability that the labels will agree. For a fixing
dependency, we add two factors, hι (Λ, Y) = −1{Λi = 0 ∧ Λ j , 0} and hι+1 (Λ, Y) = 1{Λi = −Y ∧ Λ j = Y}, which encode
the idea that λ j labels only when λi does, and that λ j fixes errors made by λi . The factors for a reinforcing dependency
are the same, except that hι+1 (Λ, Y) = 1{Λi = Y ∧ Λ j = Y}. Finally, for an exclusive dependency, we have a single factor
hι (Λ, Y) = −1{Λi , 0 ∧ Λ j , 0}.

Learning with Dependencies We can again solve a maximum likelihood problem like (2) to learn the parameter θ̂.
Using the results, we can continue on to find the noise-aware empirical loss minimizer by solving the problem in (3).
In order to solve these problems in the dependent case, we typically invoke stochastic gradient descent, using Gibbs
sampling to sample from the distributions used in the gradient update. Under conditions similar to those in Section 3,
we can again provide a bound on the accuracy of these results. We define these conditions now. First, there must be
some set Θ ⊂ R M that we know our parameter lies in. This is analogous to the assumptions on αi and βi we made in
Section 3, and we can state the following analogue of (4):

∃θ∗ ∈ Θ s.t. ∀(Λ, Y) ∈ {−1, 0, 1}m × {−1, 1}, P(x,y)∼π∗ (λ(x) = Λ, y = Y) = µθ∗ (Λ, Y). (7)

Second, for any θ ∈ Θ, it must be possible to accurately learn θ from full (i.e. labeled) samples of µθ . More specifically,
there exists an unbiased estimator θ̂(T ) that is a function of some dataset T of independent samples from µθ such that,
for some c > 0 and for all θ ∈ Θ,
Cov θ̂(T ) (2c |T |)−1 I. (8)
Third, for any two feasible models θ1 and θ2 ∈ Θ,
h i
E(Λ1 ,Y1 )∼µθ1 Var(Λ2 ,Y2 )∼µθ2 (Y2 |Λ1 = Λ2 ) ≤ cM −1 . (9)

That is, we’ll usually be reasonably sure in our guess for the value of Y, even if we guess using distribution µθ2 while the
the labeling functions were actually sampled from (the possibly totally different) µθ1 . We can now prove the following
result about the accuracy of our estimates.

6
KBP (News) Genomics Pharmacogenomics
Features Method Prec. Rec. F1 Prec. Rec. F1 Prec. Rec. F1
ITR 51.15 26.72 35.10 83.76 41.67 55.65 68.16 49.32 57.23
Hand-tuned DP 50.52 29.21 37.02 83.90 43.43 57.24 68.36 54.80 60.83
ITR 37.68 28.81 32.66 69.07 50.76 58.52 32.35 43.84 37.23
LSTM
DP 47.47 27.88 35.78 75.48 48.48 58.99 37.63 47.95 42.17

Table 1: Precision/Recall/F1 scores using data programming (DP), as compared to distant supervision ITR approach,
with both hand-tuned and LSTM-generated features.

Theorem 2. Suppose that we run stochastic gradient descent to produce θ̂ and ŵ, and that our setup satisfies the
conditions (5)-(9). Then for any > 0, if the input dataset S is large enough that

2 kθ0 − θ∗ k2
!
2
|S | ≥ 2 2 log ,
c
then our expected parameter error and generalization risk can be bounded by

2
c
E θ̂ − θ∗ ≤ M 2 E l(ŵ) − min l(w) ≤ χ + .
w 2ρ

As in the independent case, this shows that we need only |S | = Õ( −2 ) unlabeled training examples to achieve
error O(), which is the same asymptotic scaling as supervised learning methods. This suggests that while we pay
a computational penalty for richer dependency structures, we are no less statistically efficient. In the appendix, we
provide more details, including an explicit description of the algorithm and the step size used to achieve this result.

5 Experiments
We seek to experimentally validate three claims about our approach. Our first claim is that data programming can
be an effective paradigm for building high quality machine learning systems, which we test across three real-world
relation extraction applications. Our second claim is that data programming can be used successfully in conjunction
with automatic feature generation methods, such as LSTM models. Finally, our third claim is that data programming is
an intuitive and productive framework for domain-expert users, and we report on our initial user studies.

Relation Mention Extraction Tasks In the relation mention extraction task, our objects are relation mention can-
didates x = (e1 , e2 ), which are pairs of entity mentions e1 , e2 in unstructured text, and our goal is to learn a model
that classifies each candidate as either a true textual assertion of the relation R(e1 , e2 ) or not. We examine a news
application from the 2014 TAC-KBP Slot Filling challenge2 , where we extract relations between real-world entities
from articles [2]; a clinical genomics application, where we extract causal relations between genetic mutations and
phenotypes from the scientific literature3 ; and a pharmacogenomics application where we extract interactions between
genes, also from the scientific literature [21]; further details are included in the Appendix.
For each application, we or our collaborators originally built a system where a training set was programmatically
generated by ordering the labeling functions as a sequence of if-then-return statements, and for each candidate, taking
the first label emitted by this script as the training label. We refer to this as the if-then-return (ITR) approach, and note
that it often required significant domain expert development time to tune (weeks or more). For this set of experiments,
we then used the same labeling function sets within the framework of data programming. For all experiments, we
evaluated on a blind hand-labeled evaluation set. In Table 1, we see that we achieve consistent improvements: on average
by 2.34 points in F1 score, including what would have been a winning score on the 2014 TAC-KBP challenge [30].
We observed these performance gains across applications with very different labeling function sets. We describe the
labeling function summary statistics—coverage is the percentage of objects that had at least one label, overlap is the
percentage of objects with more than one label, and conflict is the percentage of objects with conflicting labels—and see
in Table 2 that even in scenarios where m is small, and conflict and overlap is relatively less common, we still realize
2 http://www.nist.gov/tac/2014/KBP/
3 https://github.com/HazyResearch/dd-genomics

7
performance gains. Additionally, on a disease mention extraction task (see Usability Study), which was written from
scratch within the data programming paradigm, allowing developers to supply dependencies of the basic types outlined
in Sec. 4 led to a 2.3 point F1 score boost.

F1 Score Improvement
Application # of LFs Coverage |S λ,0 | Overlap Conflict
HT LSTM
KBP (News) 40 29.39 2.03M 1.38 0.15 1.92 3.12
Genomics 146 53.61 256K 26.71 2.05 1.59 0.47
Pharmacogenomics 7 7.70 129K 0.35 0.32 3.60 4.94
Diseases 12 53.32 418K 31.81 0.98 N/A N/A

Table 2: Labeling function (LF) summary statistics, sizes of generated training sets S λ,0 (only counting non-zero labels), and
relative F1 score improvement over baseline IRT methods for hand-tuned (HT) and LSTM-generated (LSTM) feature sets.

Automatically-generated Features We additionally compare both hand-tuned and automatically-generated features,

where the latter are learned via an LSTM recurrent neural network (RNN) [14]. Conventional wisdom states that deep
learning methods such as RNNs are prone to overfitting to the biases of the imperfect rules used for programmatic
supervision. In our experiments, however, we find that using data programming to denoise the labels can mitigate this
issue, and we report a 9.79 point boost to precision and a 3.12 point F1 score improvement on the benchmark 2014
TAC-KBP (News) task, over the baseline if-then-return approach. Additionally for comparison, our approach is a 5.98
point F1 score improvement over a state-of-the-art LSTM approach [32].

Usability Study One of our hopes is that a user without expertise in ML will be more productive iterating on labeling
functions than on features. To test this, we arranged a hackathon involving a handful of bioinformatics researchers,
using our open-source information extraction framework Snorkel4 (formerly DDLite). Their goal was to build a
disease tagging system which is a common and important challenge in the bioinformatics domain [11]. The hackathon
participants did not have access to a labeled training set nor did they perform any feature engineering. The entire effort
was restricted to iterative labeling function development and the setup of candidates to be classified. In under eight
hours, they had created a training set that led to a model which scored within 10 points of F1 of the supervised baseline;
the gap was mainly due to recall issue in the candidate extraction phase. This suggests data programming may be a
promising way to build high quality extractors, quickly.

6 Conclusion and Future Work

We introduced data programming, a new approach to generating large labeled training sets. We demonstrated that our
approach can be used with automatic feature generation techniques to achieve high quality results. We also provided
anecdotal evidence that our methods may be easier for domain experts to use. We hope to explore the limits of our
approach on other machine learning tasks that have been held back by the lack of high-quality supervised datasets,
including those in other domains such imaging and structured prediction.

Acknowledgements Thanks to Theodoros Rekatsinas, Manas Joglekar, Henry Ehrenberg, Jason Fries, Percy Liang,
the DeepDive and DDLite users and many others for their helpful conversations. The authors acknowledge the support
of: DARPA FA8750-12-2-0335; NSF IIS-1247701; NSFCCF-1111943; DOE 108845; NSF CCF-1337375; DARPA
FA8750-13-2-0039; NSF IIS-1353606;ONR N000141210041 and N000141310129; NIH U54EB020405; DARPA’s
SIMPLEX program; Oracle; NVIDIA; Huawei; SAP Labs; Sloan Research Fellowship; Moore Foundation; American
Family Insurance; Google; and Toshiba. The views and conclusions expressed in this material are those of the authors
and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or
implied, of DARPA, AFRL, NSF, ONR, NIH, or the U.S. Government.
4 snorkel.stanford.edu

8
References
[1] E. Alfonseca, K. Filippova, J.-Y. Delort, and G. Garrido. Pattern learning for relation extraction with a hierarchical
topic model. In Proceedings of the ACL.
[2] G. Angeli, S. Gupta, M. Jose, C. D. Manning, C. Ré, J. Tibshirani, J. Y. Wu, S. Wu, and C. Zhang. Stanford’s
2014 slot filling systems. TAC KBP, 695, 2014.
[3] A. Balsubramani and Y. Freund. Scalable semi-supervised aggregation of classifiers. In Advances in Neural
Information Processing Systems, pages 1351–1359, 2015.
[4] D. Berend and A. Kontorovich. Consistency of weighted majority votes. In NIPS 2014.
[5] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh
annual conference on Computational learning theory, pages 92–100. ACM, 1998.
[6] J. Bootkrajang and A. Kabán. Label-noise robust logistic regression and its applications. In Machine Learning
and Knowledge Discovery in Databases, pages 143–158. Springer, 2012.
[7] R. Bunescu and R. Mooney. Learning to extract relations from the web using minimal supervision. In Annual
meeting-association for Computational Linguistics, volume 45, page 576, 2007.
[8] M. Craven, J. Kumlien, et al. Constructing biological knowledge bases by extracting information from text sources.
In ISMB, volume 1999, pages 77–86, 1999.
[9] N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. Aggregating crowdsourced binary ratings. In Proceedings of
the 22Nd International Conference on World Wide Web, WWW ’13, pages 285–294, 2013.
[10] A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm.
Applied statistics, pages 20–28, 1979.
[11] R. I. Doğan and Z. Lu. An improved corpus of disease mentions in pubmed citations. In Proceedings of the 2012
workshop on biomedical natural language processing.
[12] H. R. Ehrenberg, J. Shin, A. J. Ratner, J. A. Fries, and C. Ré. Data programming with ddlite: putting humans in a
different part of the loop. In HILDA@ SIGMOD, page 13, 2016.
[13] H. Gao, G. Barbier, R. Goolsby, and D. Zeng. Harnessing the crowdsourcing power of social media for disaster
relief. Technical report, DTIC Document, 2011.
[14] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[15] R. Hoffmann, C. Zhang, X. Ling, L. Zettlemoyer, and D. S. Weld. Knowledge-based weak supervision for
information extraction of overlapping relations. In Proceedings of the ACL.
[16] M. Joglekar, H. Garcia-Molina, and A. Parameswaran. Comprehensive and reliable crowd assessment algorithms.
In Data Engineering (ICDE), 2015 IEEE 31st International Conference on.
[17] D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In Advances in neural
information processing systems, pages 1953–1961, 2011.
[18] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al.
Visual genome: Connecting language and vision using crowdsourced dense image annotations. arXiv preprint
arXiv:1602.07332, 2016.
[19] M.-A. Krogel and T. Scheffer. Multi-relational learning, text mining, and semi-supervised learning for functional
genomics. Machine Learning, 57(1-2):61–81, 2004.
[20] G. Lugosi. Learning with an unreliable teacher. Pattern Recognition, 25(1):79 – 87, 1992.
[21] E. K. Mallory, C. Zhang, C. Ré, and R. B. Altman. Large-scale extraction of gene interactions from full-text
literature using deepdive. Bioinformatics, 2015.

9
[22] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In
Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL, 2009.
[23] N. Natarajan, I. S. Dhillon, P. K. Ravikumar, and A. Tewari. Learning with noisy labels. In Advances in Neural
Information Processing Systems 26.

[24] F. Parisi, F. Strino, B. Nadler, and Y. Kluger. Ranking and combining multiple predictors without labeled data.
Proceedings of the National Academy of Sciences, 111(4):1253–1258, 2014.
[25] S. Riedel, L. Yao, and A. McCallum. Modeling relations and their mentions without labeled text. In Machine
Learning and Knowledge Discovery in Databases, pages 148–163. Springer, 2010.
[26] B. Roth and D. Klakow. Feature-based models for improving the quality of noisy training data for relation
extraction. In Proceedings of the 22nd ACM Conference on Knowledge management.
[27] B. Roth and D. Klakow. Combining generative and discriminative model scores for distant supervision. In
EMNLP, pages 24–29, 2013.
[28] R. E. Schapire and Y. Freund. Boosting: Foundations and algorithms. MIT press, 2012.

[29] J. Shin, S. Wu, F. Wang, C. De Sa, C. Zhang, and C. Ré. Incremental knowledge base construction using deepdive.
Proceedings of the VLDB Endowment, 8(11):1310–1321, 2015.
[30] M. Surdeanu and H. Ji. Overview of the english slot filling track at the tac2014 knowledge base population
evaluation. In Proc. Text Analysis Conference (TAC2014), 2014.

[31] S. Takamatsu, I. Sato, and H. Nakagawa. Reducing wrong labels in distant supervision for relation extraction. In
Proceedings of the ACL.
[32] P. Verga, D. Belanger, E. Strubell, B. Roth, and A. McCallum. Multilingual relation extraction using compositional
universal schema. arXiv preprint arXiv:1511.06396, 2015.
[33] Y. Zhang, X. Chen, D. Zhou, and M. I. Jordan. Spectral methods meet em: A provably optimal algorithm for
crowdsourcing. In Advances in Neural Information Processing Systems 27, pages 1260–1268. 2014.

10
A General Theoretical Results
In this section, we will state the full form of the theoretical results we alluded to in the body of the paper. First, we
restate, in long form, our setup and assumptions.
We assume that, for some function h : {−1, 0, 1}m × {−1, 1} 7→ {−1, 0, 1} M of sufficient statistics, we are concerned
with learning distributions, over the set Ω = {−1, 0, 1}m × {−1, 1}, of the form
1
πθ (Λ, Y) = exp(θT h(Λ, Y)), (10)
Zθ
where θ ∈ R M is a parameter, and Zθ is the partition function that makes this a distribution. We assume that we are
given, i.e. can derive from the data programming specification, some set Θ of feasible parameters. This set must have
the following two properties.
First, for any θ ∈ Θ, learning the parameter θ from (full) samples from πθ is possible, at least in some sense. More
specifically, there exists an unbiased estimator θ̂ that is a function of some number D samples from πθ (and is unbiased
for all θ ∈ Θ) such that, for all θ ∈ Θ and for some c > 0,
I
Cov θ̂ . (11)
2cD
Second, for any θ1 , θ2 ∈ Θ,
h i c
E(λ2 ,y2 )∼πθ2 Var(λ1 ,y1 )∼πθ1 (y1 |λ1 = λ2 ) ≤ . (12)
M
That is, we’ll always be reasonably certain in our guess for the value of y, even if we are totally wrong about the true
parameter θ.
On the other hand, we are also concerned with a distribution π∗ which ranges over the set X × {−1, 1}, and represents
the distribution of training and test examples we are using to learn. These objects are associated with a labeling function
λ : X 7→ {−1, 0, 1}m and a feature function f : X 7→ Rn . We make three assumptions about this distribution. First, we
assume that, given (x, y) ∼ π∗ , the class label y is independent of the features f (x) given the labels λ(x). That is,

(x, y) ∼ π∗ ⇒ y ⊥ f (x) | λ(x). (13)

Second, we assume that we can describe the relationship between λ(x) and y in terms of our family in (10) above. That
is, for some parameter θ∗ ∈ Θ,
P(x,y)∼π∗ (λ(x) = Λ, y = Y) = πθ∗ (Λ, Y). (14)
Third, we assume that the features themselves are bounded; for all x ∈ X,

k f (x)k ≤ 1. (15)

Our goal is twofold. First, we want to recover some estimate θ̂ of the true parameter θ∗ . Second, we want to produce
a parameter ŵ that minimizes the regularized logistic loss
h i
l(w) = E(x,y)∼π∗ log(1 + exp(−wT f (x)y)) + ρ kwk2 .

We actually accomplish this by minimizing a noise-aware loss function, given our recovered parameter θ̂,
h h ii
lθ̂ (w) = E( x̄,ȳ)∼π∗ E(Λ,Y)∼πθ̂ log(1 + exp(−wT f ( x̄)Y)) Λ = λ( x̄) + ρ kwk2 .

In fact we can’t even minimize this; rather, we will be minimizing the empirical noise-aware loss function, which is
only this in expectation. Since the analysis of logistic regression is not itself interesting, we assume that we are able to
run some algorithm that produces an estimate ŵ which satisfies, for some χ > 0,

E lθ̂ (ŵ) − min lθ̂ (w) θ̂ ≤ χ. (16)
w

The algorithm chosen can be anything, but in practice, we use stochastic gradient descent.
We learn θ̂ and ŵ by running the following algorithm.
Under these assumptions, we are able to prove the following theorem about the behavior of Algorithm 1.

11
Algorithm 1 Data Programming
Require: Step size η, dataset S ⊂ X, and initial parameter θ0 ∈ Θ.
θ → θ0
for all x ∈ S do
Independently sample (Λ, Y) from πθ , and (Λ̄, Ȳ) from πθ conditionally given Λ = λ(x).
θ ← θ + η(h(Λ, Y) − h(Λ̄, Ȳ)).
θ = PΘ (θ) . Here, PΘ denotes orthogonal projection onto Θ.
end for
Compute ŵ using the algorithm described in (15)
return (θ, ŵ).

Theorem A.1. Suppose that we run Algorithm 1 on a data programming specification that satisfies conditions (11),
(12), (13), (14), (15), and (16). Suppose further that, for some parameter > 0, we use step size
c 2
η=
4
and our dataset is of a size that satisfies
2 kθ0 − θ∗ k2
!
2
|S | = log .
c2 2
Then, we can bound the expected parameter error with

2
E θ̂ − θ∗ ≤ 2 M

and the expected risk with c

E l(ŵ) − min l(w) ≤ χ + .
w 2ρ
This theorem’s conclusions and assumptions can readily be seen to be identical to those of Theorem 2 in the main
body of the paper, except that they apply to the slightly more general case of arbitrary h, rather than h of the explicit
form described in the body. Therefore, in order to prove Theorem 2, it suffices to prove Theorem A.1, which we will do
in Section C.

B Theoretical Results for Independent Model

For the independent model, we can obtain a more specific version of Theorem A.1. In the independent model, the
variables are, as before, Λ ∈ {−1, 0, 1}m and Y ∈ {−1, 1}. The sufficient statistics are Λi Y and Λ2i .
To produce results that make intuitive sense, we also define the alternate parameterization
 1+γ

 βi i Λi = Y
 2


Pπ (Λi |Y) =  (1 − βi ) Λ = 0 .



βi 1−γi

 Λi = −Y
2

In comparison to the parameters used in the body of the paper, we have

1 + γi
αi = .
2
Now, we are concerned with models that are feasible. For a model to be feasible (i.e. for θ ∈ Θ), we require that it
satisfy, for some constants γmin > 0, γmax > 0, and βmin ,
1
γmin ≤ γi ≤ γmax βmin ≤ βi ≤ .
2
For 0 ≤ β ≤ 1 and −1 ≤ γ ≤ 1.
For this model, we can prove the following corollary to Theorem A.1

12
Corollary B.1. Suppose that we run Algorithm 1 on an independent data programming specification that satisfies
conditions (13), (14), (15), and (16). Furthermore, assume that the number of labeling functions we use satisfies
!
9.34 artanh(γmax ) 24m
m≥ log .
(γβ)min γmin
2 βmin

Suppose further that, for some parameter > 0, we use step size

βmin 2
η=
16
and our dataset is of a size that satisfies

2 kθ0 − θ∗ k2
!
32
|S | = log .
β2min 2

Then, we can bound the expected parameter error with

2
E θ̂ − θ∗ ≤ 2 M

and the expected risk with βmin

E l(ŵ) − min l(w) ≤ χ + .
w 8ρ
We can see that if, as stated in the body of the paper, βi ≥ 0.3 and 0.8 ≤ αi ≤ 0.9 (which is equivalent to
0.6 ≤ γi ≤ 0.8), then !
9.34 artanh(0.8) 24 · 2000
2000 ≥ 1896.13 = log .
0.3 · 0.63 0.3
This means that, as stated in the paper, m = 2000 is sufficient for this corollary to hold with

2m(artanh(0.8) − artanh(0.6))2
!
32 356 m
|S | = log = 2 log .
0.3 ·
2 2 3

Thus, proving Corollary B.1 is sufficient to prove Theorem 1 from the body of the paper. We will prove Corollary B.1
in Section E

C Proof of Theorem A.1

First, we state some lemmas that will be useful in the proof to come.

Lemma D.1. Given a family of maximum-entropy distributions

1
πθ (x) = exp(θT h(x)),
Zθ

for some function of sufficient statistics h : Ω 7→ R M , if we let J : R M 7→ R be the maximum log-likelihood objective for
some event A ⊆ Ω,
J(θ) = log P x∼πθ (x ∈ A) ,
then its gradient is
∇J(θ) = E x∼πθ [h(x)|x ∈ A] − E x∼πθ [h(x)]
and its Hessian is
∇2 J(θ) = Cov x∼πθ (h(x)|x ∈ A) − Cov x∼πθ (h(x)) .

13
Lemma D.2. Suppose that we are looking at a distribution from a data programming label model. That is, our
maximum-entropy distribution can now be written in terms of two variables, the labeling function values λ ∈ {−1, 0, 1}
and the class y ∈ {−1, 1}, as
1
πθ (λ, y) = exp(θT h(λ, y)),
Zθ
where we assume without loss of generality that for some M, h(λ, y) ∈ R M and kh(λ, y)k∞ ≤ 1. If we let J : R M 7→ R
be the maximum expected log-likelihood objective, under another distribution π∗ , for the event associated with the
observed labeling function values λ,
h i
J(θ) = E(λ∗ ,y∗ )∼π∗ log P(λ,y)∼πθ (λ = λ∗ ) ,

then its Hessian can be bounded with

h i
∇2 J(θ) MIE(λ∗ ,y∗ )∼π∗ Var(λ,y)∼πθ (y|λ = λ∗ ) − I(θ),

where I(θ) is the Fisher information.

Lemma D.3. Suppose that we are looking at a data programming distribution, as described in the text of Lemma D.2.
Suppose further that we are concerned with some feasible set of parameters Θ ⊂ R M , such that the any model with
parameters in this space satisfies the following two conditions.
First, for any θ ∈ Θ, learning the parameter θ from (full) samples from πθ is possible, at least in some sense. More
specifically, there exists an unbiased estimator θ̂ that is a function of some number D samples from πθ (and is unbiased
for all θ ∈ Θ) such that, for all θ ∈ Θ and for some c > 0,
I
Cov θ̂ .
2cD
Second, for any θ, θ∗ ∈ Θ, h i c
E(λ∗ ,y∗ )∼π∗ Var(λ,y)∼πθ (y|λ = λ∗ ) ≤ .
M
That is, we’ll always be reasonably certain in our guess for the value of y, even if we are totally wrong about the true
parameter θ∗ .
Under these conditions, the function J is strongly concave on Θ with parameter of strong convexity c.
Lemma D.4. Suppose that we are looking at a data programming maximum likelihood estimation problem, as described
in the text of Lemma D.2. Suppose further that the objective function J is strongly concave with parameter c > 0.
If we run stochastic gradient descent on objective J, using unbiased samples from a true distribution πθ∗ , where
θ ∈ Θ, then if we use step size
∗

c 2
η=
4
and run (using a fresh sample at each iteration) for T steps, where

2 kθ0 − θ∗ k2
!
2
T = 2 2 log
c
then we can bound the expected parameter estimation error with

2
E θ̂ − θ∗ ≤ 2 M.

Lemma D.5. Assume in our model that, without loss of generality, k f (x)k ≤ 1 for all x, and that in our true model π∗ ,
the class y is independent of the features f (x) given the labels λ(x).
Suppose that we now want to solve the expected loss minimization problem wherein we minimize the objective
h i
l(w) = E(x,y)∼π∗ log(1 + exp(−wT f (x)y)) + ρ kwk2 .

We actually accomplish this by minimizing our noise-aware loss function, given our chosen parameter θ̂,
h h ii
lθ̂ (w) = E( x̄,ȳ)∼π∗ E(Λ,Y)∼πθ̂ log(1 + exp(−wT f ( x̄)Y)) Λ = λ( x̄) + ρ kwk2 .

14
In fact we can’t even minimize this; rather, we will be minimizing the empirical noise-aware loss function, which is only
this in expectation. Suppose that doing so produces an estimate ŵ which satisfies, for some χ > 0,

E lθ̂ (ŵ) − min lθ̂ (w) θ̂ ≤ χ.
w

(Here, the expectation is taken with respect to only the random variable ŵ.) Then, we can bound the expected risk with
c
E l(ŵ) − min l(w) ≤ χ + .
w 2ρ
Now, we restate and prove our main theorem.
Theorem A.1. Suppose that we run Algorithm 1 on a data programming specification that satisfies conditions (11),
(12), (13), (14), (15), and (16). Suppose further that, for some parameter > 0, we use step size
c 2
η=
4
and our dataset is of a size that satisfies
2 kθ0 − θ∗ k2
!
2
|S | = log .
c2 2
Then, we can bound the expected parameter error with

2
E θ̂ − θ∗ ≤ 2 M

and the expected risk with c

E l(ŵ) − min l(w) ≤ χ + .
w 2ρ
Proof. The bounds on the expected parameter estimation error follow directly from Lemma D.4, and the remainder of
the theorem follows directly from Lemma D.5.

D Proofs of Lemmas
Lemma D.1. Given a family of maximum-entropy distributions
1
πθ (x) = exp(θT h(x)),
Zθ
for some function of sufficient statistics h : Ω 7→ R M , if we let J : R M 7→ R be the maximum log-likelihood objective for
some event A ⊆ Ω,
J(θ) = log P x∼πθ (x ∈ A) ,
then its gradient is
∇J(θ) = E x∼πθ [h(x)|x ∈ A] − E x∼πθ [h(x)]
and its Hessian is
∇2 J(θ) = Cov x∼πθ (h(x)|x ∈ A) − Cov x∼πθ (h(x)) .
Proof. For the gradient,
∇J(θ) = ∇ log Pπθ (A)
T
P !
x∈A exp(θ h(x))
= ∇ log P T
x∈Ω exp(θ h(x))
   
X X
= ∇ log  exp(θ h(x)) − ∇ log  exp(θ h(x))
T
 T


x∈A x∈Ω
T
h(x) exp(θT h(x))
P P
h(x) exp(θ h(x))
= x∈A
P T
− x∈Ω P T
x∈A exp(θ h(x)) x∈Ω exp(θ h(x))
= E x∼πθ [h(x)|x ∈ A] − E x∼πθ [h(x)] .

15
And for the Hessian,
h(x) exp(θT h(x)) h(x) exp(θT h(x))
P P
∇ J(θ) = ∇
2 x∈A
P T
−∇ x∈Ω
P T
x∈A exp(θ h(x)) x∈Ω exp(θ h(x))
P T
P T T
T T x∈A h(x) exp(θ h(x)) x∈A h(x) exp(θ h(x))
P
x∈A h(x)h(x) exp(θ h(x))
= P T
− 2
x∈A exp(θ h(x)) T
P
x∈A exp(θ h(x))
 P P T 
T T
 P
h(x)h(x) T
exp(θ T
h(x)) x∈Ω h(x) exp(θ h(x)) x∈Ω h(x) exp(θ h(x)) 
−  x∈ΩP −
 
T T h(x)) 2
x∈Ω exp(θ h(x))
P 
 x∈Ω exp(θ
h i
= E x∼πθ h(x)h(x)T x ∈ A − E x∼πθ [h(x)|x ∈ A] E x∼πθ [h(x)|x ∈ A]T
h i
− E x∼πθ h(x)h(x)T − E x∼πθ [h(x)] E x∼πθ [h(x)]T
= Cov x∼πθ (h(x)|x ∈ A) − Cov x∼πθ (h(x)) .

Lemma D.2. Suppose that we are looking at a distribution from a data programming label model. That is, our
maximum-entropy distribution can now be written in terms of two variables, the labeling function values λ ∈ {−1, 0, 1}
and the class y ∈ {−1, 1}, as
1
πθ (λ, y) = exp(θT h(λ, y)),
Zθ
where we assume without loss of generality that for some M, h(λ, y) ∈ R M and kh(λ, y)k∞ ≤ 1. If we let J : R M 7→ R
be the maximum expected log-likelihood objective, under another distribution π∗ , for the event associated with the
observed labeling function values λ,
h i
J(θ) = E(λ∗ ,y∗ )∼π∗ log P(λ,y)∼πθ (λ = λ∗ ) ,
then its Hessian can be bounded with
h i
∇2 J(θ) MIE(λ∗ ,y∗ )∼π∗ Var(λ,y)∼πθ (y|λ = λ∗ ) − I(θ),
where I(θ) is the Fisher information.
Proof. From the result of Lemma D.1, we have that
h i
∇2 J(θ) = E(λ∗ ,y∗ )∼π∗ Cov(λ,y)∼πθ (h(λ, y)|λ = λ∗ ) − Cov(λ,y)∼πθ (h(λ, y)) . (17)
We start byu defining h0 (λ) and h1 (λ) such that
1+y 1 − y h(λ, 1) + h(λ, −1) h(λ, 1) − h(λ, −1)
h(λ, y) = h(λ, 1) + h(λ, −1) = +y = h0 (λ) + yh1 (λ).
2 2 2 2
This allows us to reduce (17) to
h i
∇2 J(θ) = E(λ∗ ,y∗ )∼π∗ h1 (λ∗ )h1 (λ∗ )T Var(λ,y)∼πθ (y|λ = λ∗ ) − Cov(λ,y)∼πθ (h(λ, y)) .
On the other hand, the Fisher information of this model at θ is
h i
I(θ) = E ∇θ log πθ (x) 2
 !!2 
exp(θT h(x))
= E  ∇θ log P
 
 
exp(θ T h(z))
z∈Ω
  2 
 X  
= E ∇θ log exp(θ h(x)) − ∇θ log  exp(θ h(z)) 
T T
 
z∈Ω
 P T !2 
z∈Ω h(z) exp(θ h(z)) 
= E  h(x) − P
 
T

z∈Ω exp(θ h(z))
h i
= E (h(x) − E [h(z)])2
= Cov (h(x)) .

16
Therefore, we can write the second derivative of J as
h i
∇2 J(θ) = E(λ∗ ,y∗ )∼π∗ h1 (λ∗ )h1 (λ∗ )T Var(λ,y)∼πθ (y|λ = λ∗ ) − I(θ).

If we apply the fact that

h1 (λ∗ )h1 (λ∗ )T I kh1 (λ∗ )k2 MI kh1 (λ∗ )k2∞ MI,
then we can reduce this to h i
∇2 J(θ) MIE(λ∗ ,y∗ )∼π∗ Var(λ,y)∼πθ (y|λ = λ∗ ) − I(θ).
This is the desired result.

Lemma D.3. Suppose that we are looking at a data programming distribution, as described in the text of Lemma D.2.
Suppose further that we are concerned with some feasible set of parameters Θ ⊂ R M , such that the any model with
parameters in this space satisfies the following two conditions.
First, for any θ ∈ Θ, learning the parameter θ from (full) samples from πθ is possible, at least in some sense. More
specifically, there exists an unbiased estimator θ̂ that is a function of some number D samples from πθ (and is unbiased
for all θ ∈ Θ) such that, for all θ ∈ Θ and for some c > 0,
I
Cov θ̂ .
2cD
Second, for any θ, θ∗ ∈ Θ, h i c
E(λ∗ ,y∗ )∼π∗ Var(λ,y)∼πθ (y|λ = λ∗ ) ≤ .
M
That is, we’ll always be reasonably certain in our guess for the value of y, even if we are totally wrong about the true
parameter θ∗ .
Under these conditions, the function J is strongly concave on Θ with parameter of strong convexity c.
Proof. From the Cramér-Rao bound, we know in general that the variance of any unbiased estimator is bounded by the
reciprocal of the Fisher information
Cov θ̂ (I(θ))−1 .
Since for the estimator described in the lemma statement, we have D independent samples from the distribution, it
follows that the Fisher information of this experiment is D times the Fisher information of a single sample. Combining
this with the bound in the lemma statement on the covariance, we get
I
Cov θ̂ (DI(θ))−1 .
2cD
It follows that
I(θ) 2cI.
On the other hand, also from the lemma statement, we can conclude that
h i
MIE(λ∗ ,y∗ )∼π∗ Var(λ,y)∼πθ (y|λ = λ∗ ) cI.

Therefore, for all θ ∈ Θ, h i

∇2 J(θ) MIE(λ∗ ,y∗ )∼π∗ Var(λ,y)∼πθ (y|λ = λ∗ ) − I(θ) −cI.
This implies that J is strongly concave over Θ, with constant c, as desired.

Lemma D.4. Suppose that we are looking at a data programming maximum likelihood estimation problem, as described
in the text of Lemma D.2. Suppose further that the objective function J is strongly concave with parameter c > 0.
If we run stochastic gradient descent on objective J, using unbiased samples from a true distribution πθ∗ , where
θ∗ ∈ Θ, then if we use step size
c 2
η=
4

17
and run (using a fresh sample at each iteration) for T steps, where

2 kθ0 − θ∗ k2
!
2
T= log
c2 2

then we can bound the expected parameter estimation error with

2
E θ̂ − θ∗ ≤ 2 M.

Proof. First, we note that, in the proof to follow, we can ignore the projection onto the feasible set Θ, since this
projection always takes us closer to the optimum θ∗ .
If we track the expected distance to the optimum θ∗ , then at the next timestep,
2
kθt+1 − θ∗ k2 = kθt − θ∗ k2 + 2γ(θt − θ∗ )∇ J(θ
˜ t ) + γ2 ∇ J˜t (θt ) .

Since we can write our stochastic samples in the form

∇ J˜t (θt ) = h(λt , yt ) − h(λ̄t , ȳt ),

for some samples λt , yt , λ̄t , and ȳt , we can conclude that

2 2
∇ J˜t (θt ) ≤ M ∇ J˜t (θt ) ∞
≤ 4M.

Therefore, taking the expected value conditioned on the filtration,

h i
E kθt+1 − θ∗ k2 Ft = kθt − θ∗ k2 + 2γ(θt − θ∗ )∇J(θt ) + 4γ2 M.

Since J is strongly concave,

(θt − θ∗ )∇J(θt ) ≤ −c kθt − θ∗ k2 ;
and so, h i
E kθt+1 − θ∗ k2 Ft ≤ (1 − 2γc) kθt − θ∗ k2 + 4γ2 M.
If we take the full expectation and subtract the fixed point from both sides,
h i 2γM h i 2γM h i 2γM !
E kθt+1 − θ∗ k2 − ≤ (1 − 2γc)E kθt − θ∗ k2 + 4γ2 M − = (1 − 2γc) E kθt − θ∗ k2 − .
c c c

Therefore, !
h i 2γM 2γM
E kθt − θ∗ k2 − ≤ (1 − 2γc)t kθ0 − θ∗ k2 − ,
c c
and so
h i 2γM
E kθt − θ∗ k2 ≤ exp(−2γct) kθ0 − θ∗ k2 + .
c
In order to ensure that h i
E kθt − θ∗ k2 ≤ 2 ,
it therefore suffices to pick
c 2
γ=
4M
and
2 kθ0 − θ∗ k2
!
2M
t= log .
c2 2
Substituting 2 → 2 M produces the desired result.

18
Lemma D.5. Assume in our model that, without loss of generality, k f (x)k ≤ 1 for all x, and that in our true model π∗ ,
the class y is independent of the features f (x) given the labels λ(x).
Suppose that we now want to solve the expected loss minimization problem wherein we minimize the objective
h i
l(w) = E(x,y)∼π∗ log(1 + exp(−wT f (x)y)) + ρ kwk2 .

In fact we can’t even minimize this; rather, we will be minimizing the empirical noise-aware loss function, which is only
this in expectation. Suppose that doing so produces an estimate ŵ which satisfies, for some χ > 0,

E lθ̂ (ŵ) − min lθ̂ (w) θ̂ ≤ χ.
w

(Here, the expectation is taken with respect to only the random variable ŵ.) Then, we can bound the expected risk with
c
E l(ŵ) − min l(w) ≤ χ + .
w 2ρ

Proof. (To simplify the symbols in this proof, we freely use θ when we mean θ̂.)
The loss function we want to minimize is, in expectation,
h i
l(w) = E(x,y)∼π∗ log(1 + exp(−wT f (x)y)) + ρ kwk2 .

By the law of total expectation,

h h ii
l(w) = E( x̄,ȳ)∼π∗ E(x,y)∼π∗ log(1 + exp(−wT f ( x̄)y)) x = x̄ + ρ kwk2 ,

and by our conditional independence assumption,

h h ii
l(w) = E( x̄,ȳ)∼π∗ E(x,y)∼π∗ log(1 + exp(−wT f ( x̄)y)) λ(x) = λ( x̄) + ρ kwk2 .

Since we know from our assumptions that, for the optimum parameter θ∗ ,

P(x,y)∼π∗ (λ(x) = Λ, y = Y) = P(λ,y)∼πθ∗ (λ = Λ, y = Y) ,

we can rewrite this as

h h ii
l(w) = E( x̄,ȳ)∼π∗ E(Λ,Y)∼πθ∗ log(1 + exp(−wT f ( x̄)Y)) Λ = λ( x̄) + ρ kwk2 .

On the other hand, if we are minimizing the model we got from the previous step, we will be actually minimizing
h h ii
lθ (w) = E( x̄,ȳ)∼π∗ E(Λ,Y)∼πθ log(1 + exp(−wT f ( x̄)Y)) Λ = λ( x̄) + ρ kwk2 .

We can reduce this further by noticing that

h i
E(Λ,Y)∼πθ log(1 + exp(−wT f ( x̄)Y)) Λ = λ( x̄)
1+Y
" #
1−Y
= E(Λ,Y)∼πθ log(1 + exp(−wT f ( x̄))) + log(1 + exp(wT f ( x̄))) Λ = λ( x̄)
2 2
log(1 + exp(−w f ( x̄))) + log(1 + exp(w f ( x̄)))
T T
=
2
log(1 + exp(−wT f ( x̄))) − log(1 + exp(wT f ( x̄)))
+ E(Λ,Y)∼πθ [Y|Λ = λ( x̄)]
2
log(1 + exp(−wT f ( x̄))) + log(1 + exp(wT f ( x̄)))
=
2
wT f ( x̄)
− E(Λ,Y)∼πθ [Y|Λ = λ( x̄)] .
2

19
It follows that the difference between the loss functions will be
" T #
w f ( x̄)
|l(w) − lθ (w)| = E( x̄,ȳ)∼π E(Λ,Y)∼πθ [Y|Λ = λ( x̄)] − E(Λ,Y)∼πθ∗ [Y|Λ = λ( x̄)] .

∗
2

Now, we can compute that

exp(θT h(λ, 1)) − exp(θT h(λ, −1))

∇θ E(Λ,Y)∼πθ [Y|Λ = λ] = ∇θ
exp(θT h(λ, 1)) + exp(θT h(λ, −1))
exp(θT h1 (λ)) − exp(−θT h1 (λ))
= ∇θ
exp(θT h1 (λ)) + exp(θT h1 (λ))
= ∇θ tanh(θT h1 (λ))

= h1 (λ) 1 − tanh2 (θT h1 (λ))
= h1 (λ)Var(Λ,Y)∼πθ (Y|Λ = λ) .

It follows by the mean value theorem that for some ψ, a linear combination of θ and θ∗ ,
" T #
w f ( x̄)
|l(w) − lθ (w)| = E( x̄,ȳ)∼π∗ (θ − θ ) h1 (λ)Var(Λ,Y)∼πψ (Y|Λ = λ) .
∗ T
2

Since Θ is convex, clearly ψ ∈ Θ. From our assumption on the bound of the variance, we can conclude that
h i c
E( x̄,ȳ)∼π∗ Var(Λ,Y)∼πψ (Y|Λ = λ) ≤ .
M
By the Cauchy-Schwarz inequality,
1 h i
|l(w) − lθ (w)| ≤ E( x̄,ȳ)∼π∗ kwk k f ( x̄)k kθ − θ∗ k kh1 (λ)k Var(Λ,Y)∼πψ (Y|Λ = λ) .
2
√
Since (by assumption) k f (x)k ≤ 1 and kh1 (λ)k ≤ M,
√
kwk kθ − θ∗ k M h i
|l(w) − lθ (w)| ≤ E( x̄,ȳ)∼π∗ Var(Λ,Y)∼πψ (Y|Λ = λ)
2
√
kwk kθ − θ∗ k M c
≤ ·
2 M
c kwk kθ − θ k∗
= √ .
2 M
Now, for any w that could conceivably be a solution, it must be the case that
1
kwk ≤ ,
2ρ
since otherwise the regularization term would be too large Therefore, for any possible solution w,
c kθ − θ∗ k
|l(w) − lθ (w)| ≤ √ .
4ρ M
Now, we apply the assumption that we are able to solve the empirical problem, producing an estimate ŵ that satisfies

E lθ (ŵ) − lθ (w∗θ ) ≤ χ,

where w∗θ is the true solution to

w∗θ = arg min lθ (w).
w

20
Therefore,

E l(ŵ) − l(w∗ ) = E lθ (ŵ) − lθ (w∗θ ) + lθ (w∗θ ) − lθ (ŵ) + l(ŵ) − l(w∗ )

≤ χ + E lθ (w∗ ) − lθ (ŵ) + l(ŵ) − l(w∗ )

≤ χ + E |lθ (w∗ ) − l(w∗ )| + |lθ (ŵ) − l(ŵ)|

 c kθ − θ∗ k 
 
≤ χ + E  √ 
2ρ M
c
=χ+ √ E kθ − θ∗ k

2ρ M
c
q h i
≤χ+ √ E kθ − θ∗ k2 .
2ρ M
We can now bound this using the result of Lemma D.4, which results in
c √ 2
E l(ŵ) − l(w∗ ) ≤ χ +

√ M
2ρ M
c
=χ+ .
2ρ
This is the desired result.

E Proofs of Results for the Independent Model

To restate, in the independent model, the variables are, as before, Λ ∈ {−1, 0, 1}m and Y ∈ {−1, 1}. The sufficient
statistics are Λi Y and Λ2i . That is, for expanded parameter θ = (ψ, φ),
1
πθ (Λ, Y) = exp(ψT ΛY + φT Λ2 ).
Z
This can be combined with the simple assumption that P (Y) = 12 to complete a whole distribution. Using this, we can
prove the following simple result about the moments of the sufficient statistics.
Lemma E.1. The expected values and covariances of the sufficient statistics are, for all i , j,

E [Λi Y] = βi γi
h i
E Λ2i = βi
Var (Λi Y) = βi − β2i γi2

Var Λ2i = βi − β2i

Cov Λi Y, Λ j Y = 0

Cov Λ2i , Λ2j = 0

Cov Λi Y, Λ2j = 0.

We also prove the following basic lemma that relates ψi to γi .

Lemma E.2. It holds that
γi = tanh(ψi ).
We also make the following claim about feasible models.
Lemma E.3. For any feasible model, it will be the case that, for any other feasible parameter vector ψ̂,

m(γβ)min γmin
2
 
m
P ψ̂ ΛY ≤ γmin (γβ)min ≤ exp −
T
 .
 
2 9.34 artanh(γmax )

21
We can also prove the following simple result about the conditional covariances
Lemma E.4. The covariances of the sufficient statistics, conditioned on Λ, are for all i , j,

Cov Λi Y, Λ j Y Λ = Λi Λ j sech2 (ψT Λ)

Cov Λ2i , Λ2j Λ = 0.

We can combine these two results to bound the expected variance of these conditional statistics.
Lemma E.5. If θ and θ∗ are two feasible models, then for any u,

mβ2min γmin
3
 
 .
 
Eθ∗ [Varθ (Y|Λ)] ≤ 3 exp −
8 artanh(γmax )
We can now proceed to restate and prove the main corollary of Theorem A.1 that applies in the independent case.
Corollary B.1. Suppose that we run Algorithm 1 on an independent data programming specification that satisfies
conditions (13), (14), (15), and (16). Furthermore, assume that the number of labeling functions we use satisfies
!
9.34 artanh(γmax ) 24m
m≥ log .
(γβ)min γmin
2 βmin

Suppose further that, for some parameter > 0, we use step size

βmin 2
η=
16
and our dataset is of a size that satisfies

2 kθ0 − θ∗ k2
!
32
|S | = 2 2 log .
βmin

Then, we can bound the expected parameter error with

2
E θ̂ − θ∗ ≤ 2 M

and the expected risk with βmin

E l(ŵ) − min l(w) ≤ χ + .
w 8ρ
Proof. In order to apply Theorem A.1, we have to verify all its conditions hold in the independent case.
First, we notice that (11) is used only to bound the covariance of the sufficient statistics. From Lemma E.1, we
know that these can be bounded by βi − β2i γi2 ≥ βmin
2 . It follows that we can choose

βmin
c= ,
4
and we can consider (11) satisfied, for the purposes of applying the theorem.
Second, to verify (12), we can use Lemma E.5. For this to work, we need

m(γβ)min γmin
2
βmin
 
c
= .
 
3 exp −  ≤
9.34 artanh(γmax ) M 8m
This happens whenever the number of labeling functions satisfies
!
9.34 artanh(γmax ) 24m
m≥ log .
(γβ)min γmin
2 βmin

The remaining assumptions, (13), (14), (15), and (16), are satisfied directly by the assumptions of this corollary. So,
we can apply Theorem A.1, which produces the desired result.

22
F Proofs of Independent Model Lemmas
Lemma E.1. The expected values and covariances of the sufficient statistics are, for all i , j,

E [Λi Y] = βi γi
h i
E Λ2i = βi
Var (Λi Y) = βi − β2i γi2

Var Λ2i = βi − β2i

Cov Λi Y, Λ j Y = 0

Cov Λ2i , Λ2j = 0

Cov Λi Y, Λ2j = 0.

Proof. We prove each of the statements in turn. For the first statement,

E [Λi Y] = P (Λi = Y) − P (Λi = −Y)

1 + γi 1 − γi
= βi − βi
2 2
= βi γi .

For the second statement,

h i
E Λ2i = P (Λ = Y) + P (Λ = −Y)
1 + γi 1 − γi
= βi + βi
2 2
= βi .

For the remaining statements, we derive the second moments; converting these to an expression of the covariance is
trivial. For the third statement, h i h i h i
E (Λi Y)2 = E Λ2i Y 2 = E Λ2i = βi .
For the fourth statement, h i h i h i
E (Λ2i )2 = E Λ4i = E Λ2i = βi .
For subsequent statements, we first derive that
1 + γi 1 − γi
E [Λi Y|Y] = βi − βi = βi γi
2 2
and
h i 1 + γi 1 − γi
E Λ2i Y = βi + βi = βi .
2 2
Now, for the fifth statement,
h i h h ii
E (Λi Y)(Λ j Y) = E E [Λi Y|Y] E Λ j Y Y = βi γi β j γ j .

For the sixth statement, h i h h i h ii

E (Λ2i )(Λ2j ) = E E Λ2i Y E Λ2i Y = βi β j .
Finally, for the seventh statement,
h i h h ii
E (Λi Y)(Λ2j ) = E E [Λi Y|Y] E Λ2i Y = βi γi β j .

This completes the proof.

Lemma E.2. It holds that
γi = tanh(ψi ).

23
Proof. From the definitions,
exp(ψi + φi ) + exp(−ψi + φi )
βi =
exp(ψi + φi ) + exp(−ψi + φi ) + 1
and
exp(ψi + φi ) − exp(−ψi + φi )
βi γi = .
exp(ψi + φi ) + exp(−ψi + φi ) + 1
Therefore,
exp(ψi + φi ) − exp(−ψi + φi )
γi = = tanh(ψi ),
exp(ψi + φi ) + exp(−ψi + φi )
which is the desired result.

Lemma E.3. For any feasible model, it will be the case that, for any other feasible parameter vector ψ̂,

m(γβ)min γmin
2
 
m
P ψ̂ ΛY ≤ γmin (γβ)min ≤ exp −
T
 .
 
2 9.34 artanh(γmax )

Proof. We start by noticing that

m
X
ψ̂T ΛY = ψ̂i Λi Y.
i=1

Since in this model, all the Λi Y are independent of each other, we can bound this sum using a concentration bound.
First, we note that
ψ̂i Λi Y ≤ ψ̂i .
Second, we note that h i
E ψ̂i Λi Y = ψ̂i βi γi
and
Var ψ̂i Λi Y = ψ̂2i βi − β2i γi2
but
ψ̂i Λi Y ≤ ψ̂i ≤ artanh(γmax ) , ψ̂max
because, for feasible models, by definition

γmin ≤ artanh(γmin ) ≤ ψ̂i ≤ artanh(γmax ).

Therefore, applying Bernstein’s inequality gives us, for any t,

 m m
  
X X 3t2
P  ψ̂i Λi Y − ψ̂i βi γi ≤ −t ≤ exp − Pm 2  .
  
 
ψ̂ γ β γ + ψ̂

i=1 i=1
6 i=1 i i i i 2 max t

It follows that, if we let

m
1X
t= ψ̂i βi γi ,
2 i=1

24
then we get
 P 2 
3 21 m ψ̂i βi γi
 m m
  
X i=1
X
P  ψ̂i Λi Y − ψ̂i βi γi ≤ −t ≤ exp − P
  
  
 
 6 m ψ̂2 γi βi γi + 2ψ̂max 1 Pm ψ̂i βi γi 
i=1 i=1 i=1 i 2 i=1

3 m ψ̂ β γ
P !
i=1 i i i
≤ exp −
24γmax ψ̂max + 4ψ̂max
3m(1 − γmax )
!
≤ exp −
28ψ̂max
 P 2 
 3 m i=1 ψ̂i βi γi 
≤ exp − P P 
 24 m ψ̂2 βi + 4ψ̂max m ψ̂i βi γi 
i=1 i i=1
P P 
3γmin i=1 ψ̂i βi
m m
ψ̂i βi γi

i=1
 
≤ exp − P 
24ψ̂max i=1 ψ̂i βi + 4ψ̂max m i=1 ψ̂i βi
Pm
P 
 3γmin i=1 ψ̂i βi γi 
m

≤ exp − 
28ψ̂max
 2 
 mγmin (γβ)min 
≤ exp −  .
9.34ψ̂max


This is the desired expression.

Lemma E.4. The covariances of the sufficient statistics, conditioned on Λ, are for all i , j,

Cov Λi Y, Λ j Y Λ = Λi Λ j sech2 (ψT Λ)

Cov Λ2i , Λ2j Λ = 0.

Proof. The second result is obvious, so it suffices to prove only the first result. Clearly,

Cov Λi Y, Λ j Y Λ = Λi Λ j Var (Y|Λ) = Λi Λ j 1 − E [Y|Λ]2 .

Plugging into the distribution formula lets us conclude that

exp(ψT Λ + φT Λ2 ) − exp(−ψT Λ + φT Λ2 )
E [Y|Λ] = = tanh2 (ψT Λ),
exp(ψT Λ + φT Λ2 ) + exp(−ψT Λ + φT Λ2 )
and so
Cov Λi Y, Λ j Y Λ = Λi Λ j 1 − tanh2 (ψT Λ) = Λi Λ j sech2 (ψT Λ),
which is the desired result.

Lemma E.5. If θ and θ∗ are two feasible models, then for any u,

mβ2min γmin
3
 
 .
 
Eθ∗ [Varθ (Y|Λ)] ≤ 3 exp −
8 artanh(γmax )

Proof. First, we note that, by the result of Lemma E.4,

Varθ (Y|Λ) = sech2 (ψT Λ).

Therefore, h i
Eθ∗ [Varθ (Y|Λ)] = Eθ∗ sech2 (ψT Λ) .

25
Applying Lemma E.3, we can bound this with

m(γβ)min γmin
2
 m  
h i 
Eθ∗ Varθ uT ΛY Λ ≤ sech2 (γβ)min γmin
2
+ exp −
 

2 9.34 artanh(γmax )
m(γβ)min γmin
2
 m  
≤ 2 exp − (γβ)min γmin + exp −
2
  

2 9.34 artanh(γmax )
m(γβ)min γmin
2
 
 .
 
≤ 3 exp −
9.34 artanh(γmax )

This is the desired expression.

G Additional Experimental Details

G.1 Relation Extraction Experiments
G.1.1 Systems
The original distantly-supervised experiments which we compare against as baselines–which we refer to as using
the if-then-return (ITR) approach of distant or programmatic supervision–were implemented using DeepDive, an
open-source system for building extraction systems.5 For our primary experiments, we adapted these programs to the
framework and approach described in this paper, directly utilizing distant supervision rules as labeling functions.
In the disease tagging user experiments, we used an early version of our new lightweight extraction framework based
around data programming, formerly called DDLite [12], now Snorkel.6 Snorkel is based around a Jupyter-notebook
based interface, allowing users to iteratively develop labeling functions in Python for basic extraction tasks involving
simple models. Details of the basic discriminative models used can be found in the Snorkel repository; in particular,
Snorkel uses a simple logistic regression model with generic features defined in part over dependency paths7 , and a
basic LSTM model implemented using the Theano library.8 Snorkel is currently under continued development, and all
versions are open-source.

G.1.2 Applications
We consider three primary applications which involve the extraction of binary relation mentions of some specific type
from unstructured text input data. At a high level, all three system pipelines consist of an initial candidate extraction
phase which leverages some upstream model or suite of models to extract mentions of involved entities, and then
considers each pair of such mentions that occurs within the same local neighborhood in a document as a candidate
relation mention to be potentially extracted. In each case, the discriminative model that we are aiming to train–and that
we evaluate in this paper–is a binary classifier over these candidate relation mentions, which will decide which ones to
output as final true extractions. In all tasks, we preprocessed raw input text with Stanford CoreNLP9 , and then either
used CoreNLP’s NER module or our own entity-extraction models to extract entity mentions. Further details of the
basic information extraction pipeline utilized can be seen in the tutorials of the systems used, and in the referenced
papers below.
In the 2014 TAC-KBP Slot Filling task, which we also refer to as the News application, we train a set of extraction
models for a variety of relation types from news articles [30]. In reported results in this paper, we average over scores
from each relation type. We utilized CoreNLP’s NER module for candidate extraction, and utilized CoreNLP outputs
in developing the distant supervision rules / labeling functions for these tasks. We also considered a slightly simpler
discriminative model than the one submitted in the 2014 competition, as reported in [2]: namely, we did not include any
joint factors in our model in this paper.
In the Genomics application, our goal with our collaborators at Stanford Medicine was to extract mentions of genes
that if mutated may cause certain phenotypes (symptoms) linked to Mendelian diseases, for use in a clinical diagnostic
5 http://deepdive.stanford.edu
6 http://snorkel.stanford.edu
7 https://github.com/HazyResearch/treedlib
8 http://deeplearning.net/software/theano/
9 stanfordnlp.github.io/CoreNLP/

26
��

��
��
��

��
��

��
��
��
��
��
��
��
��
��

��

� ��

� � ��

(a) m = 20 (b) m = 100 (c) Adding dependencies.

Figure 3: Comparisons of data programming to two oracle methods on synthetic data.

setting. The code for this project is online, although it remains partially under development and thus some material
from our collaborators is private.10
In the Pharmacogenomics application, our goal was to extract interactions between genes for use in downstream
pharmacogenomics research analyses; full results and system details are reported in [21].
In the Disease Tagging application, which we had our collaborators work on during a set of short hackathons as a
user study, the goal was to tag mentions of human diseases in PubMed abstracts. We report results of this hackathon
in [12], as well as in our Snorkel tutorial online.

G.1.3 Labeling Functions

In general, we saw two broad types of labeling functions in both prior applications (when they were referred to as
“distant supervision rules”) and in our most recent user studies. The first type of labeling function leverages some weak
supervision signal, such as an external knowledgebase (as in traditional distant supervision), very similar to the example
illustrated in Fig. 1(a). All of the applications studied in this paper used some such labeling function or set of labeling
functions.
The second type of labeling function uses simple heuristic patterns as positive or negative signals. For our text
extraction examples, these heuristic patterns primarily consisted of regular expressions, also similar to the example
pseudocode in Fig. 1(a). Further specific details of both types of labeling functions, as well as others used, can be seen
in the linked code repositories and referenced papers.

G.2 Synthetic Experiments

In Fig. 3(a-b), we ran synthetic experiments with labeling functions having constant coverage β = 0.1, and accuracy
drawn from α ∼ Uniform(µα − 0.25, µα + 0.25) where µα = 0.75 in the above plots. In both cases we used 1000
normally-drawn features having mean correlation with the true label class of 0.5.
In this case we compare data programming (DP-Pipelined) against two baselines. First, we compare against an
if-then-return setup where the ordering is optimal (ITR-Oracle). Second, we compare against simple majority vote
(MV).
In Fig. 3(c), we show an experiment where we add dependent labeling functions to a set of mind = 50 independent
labeling functions, and either provided this dependency structure (LDM-Aware) or did not (Independent). In this
case, the independent labeling functions had the same configurations as in (a-b), and the dependent labeling functions
corresponded to “fixes” or “reinforces”-type dependent labeling functions.

10 https://github.com/HazyResearch/dd-genomics

Machine Learning-2
No ratings yet
Machine Learning-2
16 pages
CSE445 NSU Week - 1
No ratings yet
CSE445 NSU Week - 1
28 pages
Course Two
No ratings yet
Course Two
133 pages
Usc 08
No ratings yet
Usc 08
46 pages
ML Merge
No ratings yet
ML Merge
145 pages
Supervised Setting
No ratings yet
Supervised Setting
80 pages
AI Unit 4
No ratings yet
AI Unit 4
91 pages
Snorkel: Rapid Training Data Creation With Weak Supervision
No ratings yet
Snorkel: Rapid Training Data Creation With Weak Supervision
17 pages
MACHINE LEARNING LAB Manual
No ratings yet
MACHINE LEARNING LAB Manual
48 pages
Complete ML Notes
No ratings yet
Complete ML Notes
62 pages
Early Stopping, Dropout, Augmentation, Optimizers New
No ratings yet
Early Stopping, Dropout, Augmentation, Optimizers New
91 pages
NIPS 2017 Decoupling When To Update From How To Update Paper
No ratings yet
NIPS 2017 Decoupling When To Update From How To Update Paper
11 pages
Deep Learning Module 3-1
No ratings yet
Deep Learning Module 3-1
31 pages
Can Large Language Models Design Accurate Label Functions?: Naiqing Guan Kaiwen Chen Nick Koudas
No ratings yet
Can Large Language Models Design Accurate Label Functions?: Naiqing Guan Kaiwen Chen Nick Koudas
9 pages
ML Lab Manual CSE
No ratings yet
ML Lab Manual CSE
50 pages
Lect3 Machine Learning
No ratings yet
Lect3 Machine Learning
27 pages
Lesson 1 - History, Definitions and Basic Concepts
No ratings yet
Lesson 1 - History, Definitions and Basic Concepts
6 pages
A1579305753 - 23783 - 8 - 2019 - Machine Learning
No ratings yet
A1579305753 - 23783 - 8 - 2019 - Machine Learning
18 pages
Data Programming Using Continuous and Quality-Guided Labeling Functions
No ratings yet
Data Programming Using Continuous and Quality-Guided Labeling Functions
8 pages
Unit 3
No ratings yet
Unit 3
47 pages
Deep Learning
No ratings yet
Deep Learning
78 pages
Unit 2
No ratings yet
Unit 2
18 pages
School of Computing and Information Systems The University of Melbourne COMP90049 Introduction To Machine Learning (Semester 1, 2022)
No ratings yet
School of Computing and Information Systems The University of Melbourne COMP90049 Introduction To Machine Learning (Semester 1, 2022)
4 pages
Unit 1
No ratings yet
Unit 1
12 pages
IT 802 ML Unit-2 Notes
No ratings yet
IT 802 ML Unit-2 Notes
19 pages
Introduction To Deep Learning AI 2025
No ratings yet
Introduction To Deep Learning AI 2025
78 pages
3 Datasets
No ratings yet
3 Datasets
5 pages
HW 1 Eeowh 3
No ratings yet
HW 1 Eeowh 3
6 pages
Machine Learning1
No ratings yet
Machine Learning1
8 pages
ML Assignment
No ratings yet
ML Assignment
7 pages
Machine Learning - v1
No ratings yet
Machine Learning - v1
30 pages
Lecture 04 - Optimization - 4p
No ratings yet
Lecture 04 - Optimization - 4p
11 pages
Data Management For Production Quality Deep Learn Models
No ratings yet
Data Management For Production Quality Deep Learn Models
9 pages
Machine Learning
No ratings yet
Machine Learning
6 pages
December Deep Learning
No ratings yet
December Deep Learning
10 pages
Unit 1 - Machine Learning
No ratings yet
Unit 1 - Machine Learning
21 pages
I. The Types of Machine Learning
No ratings yet
I. The Types of Machine Learning
8 pages
hw1 PDF
No ratings yet
hw1 PDF
6 pages
2015 James Ruse Agricultural HS Ext 2
No ratings yet
2015 James Ruse Agricultural HS Ext 2
26 pages
20RMI17 - PG - Notes - Tejaswini B J
No ratings yet
20RMI17 - PG - Notes - Tejaswini B J
50 pages
Statistics in Music Education Research (Joshua A. Russell) (Z-Library)
No ratings yet
Statistics in Music Education Research (Joshua A. Russell) (Z-Library)
353 pages
EGIG - 10th - Report - Final - 09-03-2018 - v1
No ratings yet
EGIG - 10th - Report - Final - 09-03-2018 - v1
50 pages
MIT 302 - Statistical Computing II - Tutorial 03
No ratings yet
MIT 302 - Statistical Computing II - Tutorial 03
16 pages
Power System Operation Control
No ratings yet
Power System Operation Control
4 pages
Angry Birds Mathematics - Parabolas Vectors PDF
No ratings yet
Angry Birds Mathematics - Parabolas Vectors PDF
7 pages
Unit-1 Data Representation
No ratings yet
Unit-1 Data Representation
57 pages
Configuration of Fibers in Staple Yarn
No ratings yet
Configuration of Fibers in Staple Yarn
8 pages
Matlab Homework Solutions
100% (1)
Matlab Homework Solutions
10 pages
CUHK MATH1510 Chapter - 14 Notes
No ratings yet
CUHK MATH1510 Chapter - 14 Notes
21 pages
G8 - Unit 4 - CP - Where's The Proof - Students' PDF
No ratings yet
G8 - Unit 4 - CP - Where's The Proof - Students' PDF
272 pages
Indian Poultry Scenario
No ratings yet
Indian Poultry Scenario
54 pages
Kalo Solutions
No ratings yet
Kalo Solutions
36 pages
Paper 2 2025 TTC
No ratings yet
Paper 2 2025 TTC
32 pages
Rotational Motion Project
No ratings yet
Rotational Motion Project
18 pages
Borello Electrohydraulic Servovalves
No ratings yet
Borello Electrohydraulic Servovalves
12 pages
Circle Theorems pdf2
No ratings yet
Circle Theorems pdf2
11 pages
Gate-Diffusion Input (GDI) A Power Efficient Method For Digital Combinatorial Circuits
No ratings yet
Gate-Diffusion Input (GDI) A Power Efficient Method For Digital Combinatorial Circuits
16 pages
04 Seismic Fragility Methodology Workshop - Read-Only1
No ratings yet
04 Seismic Fragility Methodology Workshop - Read-Only1
35 pages
03 - Power Optimization
No ratings yet
03 - Power Optimization
20 pages
Maths Ch-1 Real Numbers Test 01
No ratings yet
Maths Ch-1 Real Numbers Test 01
2 pages
Consumer Theory
No ratings yet
Consumer Theory
19 pages
Fundamental Assignments
No ratings yet
Fundamental Assignments
21 pages
Solomon S2 F QP
No ratings yet
Solomon S2 F QP
4 pages
Adaptive Backstepping
No ratings yet
Adaptive Backstepping
10 pages
Thermodynamics Lecture 1
No ratings yet
Thermodynamics Lecture 1
15 pages
Paper Published2
No ratings yet
Paper Published2
8 pages
PH 114 Lab 4 - Kenneth
No ratings yet
PH 114 Lab 4 - Kenneth
13 pages
Experiment 4
No ratings yet
Experiment 4
3 pages
Falcon LLM: Architecture and Application: The Complete Guide for Developers and Engineers
From Everand
Falcon LLM: Architecture and Application: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Active Machine Learning with Python: Refine and elevate data quality over quantity with active learning
From Everand
Active Machine Learning with Python: Refine and elevate data quality over quantity with active learning
Margaux Masson-Forsythe
No ratings yet
OneFlow for Parallel and Distributed Deep Learning Systems: The Complete Guide for Developers and Engineers
From Everand
OneFlow for Parallel and Distributed Deep Learning Systems: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Efficient Experiment Tracking with Aim: The Complete Guide for Developers and Engineers
From Everand
Efficient Experiment Tracking with Aim: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
From Everand
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
Elaine Tate
No ratings yet
Alpaca Fine-Tuning with LLaMA: The Complete Guide for Developers and Engineers
From Everand
Alpaca Fine-Tuning with LLaMA: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
KenLM: Efficient Language Modeling in Practice
From Everand
KenLM: Efficient Language Modeling in Practice
William Smith
No ratings yet
Cohere Rerank in Practice: The Complete Guide for Developers and Engineers
From Everand
Cohere Rerank in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
From Everand
DATA MINING AND MACHINE LEARNING. PREDICTIVE TECHNIQUES: REGRESSION, GENERALIZED LINEAR MODELS, SUPPORT VECTOR MACHINE AND NEURAL NETWORKS
César Pérez López
No ratings yet
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet
VICUNA with LLaMA: Techniques and Applications: The Complete Guide for Developers and Engineers
From Everand
VICUNA with LLaMA: Techniques and Applications: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Practical MXNet Applications: Definitive Reference for Developers and Engineers
From Everand
Practical MXNet Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Machine Learning: A Comprehensive Guide to Success
From Everand
Mastering Machine Learning: A Comprehensive Guide to Success
Rick Spair
No ratings yet
Bootstrapping Language-Image Pretraining: The Complete Guide for Developers and Engineers
From Everand
Bootstrapping Language-Image Pretraining: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Deep Learning with Fast.ai: Definitive Reference for Developers and Engineers
From Everand
Deep Learning with Fast.ai: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Effective Cucumber Automation: Definitive Reference for Developers and Engineers
From Everand
Effective Cucumber Automation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
XGBoost in Practice: Definitive Reference for Developers and Engineers
From Everand
XGBoost in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
LightGBM in Practice: Definitive Reference for Developers and Engineers
From Everand
LightGBM in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Applied Techniques for GPT-3: Definitive Reference for Developers and Engineers
From Everand
Applied Techniques for GPT-3: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Self-Supervised Learning: Teaching AI with Unlabeled Data
From Everand
Self-Supervised Learning: Teaching AI with Unlabeled Data
Robert Johnson
No ratings yet
Few-Shot Machine Learning: Doing More with Less Data
From Everand
Few-Shot Machine Learning: Doing More with Less Data
Robert Johnson
No ratings yet
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet

Data Programming: Creating Large Training Sets, Quickly

Uploaded by

Data Programming: Creating Large Training Sets, Quickly

Uploaded by

Data Programming:

Creating Large Training Sets, Quickly

Thursday 26th May, 2016

3 The Data Programming Paradigm

S i m i l a r ( lambda_1 , lambda_2 ) F i x e s ( lambda_1 , lambda_2 ) E x c l u d e s ( lambda_1 , lambda_2 )

Figure 2: Examples of labeling function dependency predicates.

µθ (Λ, Y) = Zθ−1 exp(θT h(Λ, Y)),

Automatically-generated Features We additionally compare both hand-tuned and automatically-generated features,

6 Conclusion and Future Work

(x, y) ∼ π∗ ⇒ y ⊥ f (x) | λ(x). (13)

and the expected risk with   c

B Theoretical Results for Independent Model

In comparison to the parameters used in the body of the paper, we have

Then, we can bound the expected parameter error with

and the expected risk with   βmin 

C Proof of Theorem A.1

Lemma D.1. Given a family of maximum-entropy distributions

then its Hessian can be bounded with

where I(θ) is the Fisher information.

and the expected risk with   c

If we apply the fact that

Therefore, for all θ ∈ Θ, h i

then we can bound the expected parameter estimation error with

Since we can write our stochastic samples in the form

∇ J˜t (θt ) = h(λt , yt ) − h(λ̄t , ȳt ),

for some samples λt , yt , λ̄t , and ȳt , we can conclude that

Therefore, taking the expected value conditioned on the filtration,

Since J is strongly concave,

By the law of total expectation,

and by our conditional independence assumption,

P(x,y)∼π∗ (λ(x) = Λ, y = Y) = P(λ,y)∼πθ∗ (λ = Λ, y = Y) ,

we can rewrite this as

We can reduce this further by noticing that

Now, we can compute that

exp(θT h(λ, 1)) − exp(θT h(λ, −1))

where w∗θ is the true solution to

E l(ŵ) − l(w∗ ) = E lθ (ŵ) − lθ (w∗θ ) + lθ (w∗θ ) − lθ (ŵ) + l(ŵ) − l(w∗ )

≤ χ + E lθ (w∗ ) − lθ (ŵ) + l(ŵ) − l(w∗ )

≤ χ + E |lθ (w∗ ) − l(w∗ )| + |lθ (ŵ) − l(ŵ)|

E Proofs of Results for the Independent Model

We also prove the following basic lemma that relates ψi to γi .

Then, we can bound the expected parameter error with

and the expected risk with   βmin 

E [Λi Y] = P (Λi = Y) − P (Λi = −Y)

For the second statement,

For the sixth statement, h i h h i h ii

This completes the proof.

Proof. We start by noticing that

γmin ≤ artanh(γmin ) ≤ ψ̂i ≤ artanh(γmax ).

Therefore, applying Bernstein’s inequality gives us, for any t,

It follows that, if we let

This is the desired expression.

Plugging into the distribution formula lets us conclude that

Proof. First, we note that, by the result of Lemma E.4,

Varθ (Y|Λ) = sech2 (ψT Λ).

This is the desired expression.

G Additional Experimental Details

��� ���� ����

(a) m = 20 (b) m = 100 (c) Adding dependencies.

Figure 3: Comparisons of data programming to two oracle methods on synthetic data.

G.1.3 Labeling Functions

G.2 Synthetic Experiments

You might also like

and the expected risk with c

and the expected risk with βmin

and the expected risk with c

and the expected risk with βmin

��