0% found this document useful (0 votes)
17 views24 pages

Learning From Multiple Noisy Partial Labelers

Uploaded by

Jay Khinchi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views24 pages

Learning From Multiple Noisy Partial Labelers

Uploaded by

Jay Khinchi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Learning from Multiple Noisy Partial Labelers

Peilin Yu Tiffany Ding Stephen H. Bach


Brown University UC Berkeley Brown University

Abstract et al., 2016, 2020), adversarial label learning (Arachie


and Huang, 2021), learning rules from labeled exem-
Programmatic weak supervision creates mod- plars (Awasthi et al., 2020), and weak supervision with
els without hand-labeled training data by self-training (Karamanolakis et al., 2021). In PWS, la-
combining the outputs of heuristic labelers. beling functions, such as user-written rules and other
Existing frameworks make the restrictive as- heuristics, provide votes on the true labels for unla-
sumption that labelers output a single class beled examples or abstain from voting. Then, a label
label. Enabling users to create partial label- model, such as a probabilistic generative model or min-
ers that output subsets of possible class la- imax game, is often used to estimate the true labels in
bels would greatly expand the expressivity of a way that accounts for unknown differences in accu-
programmatic weak supervision. We intro- racies and other properties of the labeling functions.
duce this capability by defining a probabilis- Finally, these estimated labels are used to train an
tic generative model that can estimate the end model on the unlabeled data to generalize beyond
underlying accuracies of multiple noisy par- the information contained in the labeling functions.
tial labelers without ground truth labels. We This approach has had recent success with applica-
show how to scale up learning, for example tions in natural language processing (Mallinar et al.,
learning on 100k examples in one minute, a 2019; Safranchik et al., 2020), computer vision (Chen
300× speed up compared to a naive imple- et al., 2019), medicine (Fries et al., 2019; Saab et al.,
mentation. We also prove that this class of 2020; Fries et al., 2021), and the Web (Bach et al.,
models is generically identifiable up to label 2019). However, all of these methods assume that la-
swapping under mild conditions. We evalu- beling functions cast votes for individual classes. In
ate our framework on three text classification this work, we propose to generalize PWS to support la-
and six object classification tasks. On text beling functions that cast votes for a subset of classes,
tasks, adding partial labels increases average called partial labels. We refer to such labeling func-
accuracy by 8.6 percentage points. On image tions as partial labeling functions (PLFs). Our goal is
tasks, we show that partial labels allow us to to aggregate information from multiple partial labeling
approach some zero-shot object classification functions that are noisy (i.e., have imperfect accuracy)
problems with programmatic weak supervi- in order to estimate labels for unlabeled data.
sion by using class attributes as partial label- Incorporating partial labels into PWS would enable
ers. On these tasks, our framework has accu- users to take advantage of a wider range of domain
racy comparable to recent embedding-based knowledge. In typical PWS frameworks, only heuris-
zero-shot learning methods, while using only tics that are specific to one class can be incorporated.
pre-trained attribute detectors. As a result, creating labeling functions requires care-
ful task-specific engineering to avoid features that are
shared by more than one class. For example, consider
1 INTRODUCTION the task of classifying images of animals with a label
from the set {HORSE, TIGER, LION, ZEBRA}. There
The need for large-scale labeled datasets has driven are many useful heuristics that can be learned from
recent research on methods for programmatic weak su- other labeled data sets, such as detectors for claws or
pervision (PWS), such as data programming (Ratner stripes (Lampert et al., 2009). Such heuristics divide
the label space with multiple partitions. A claw detec-
Proceedings of the 25th International Conference on Artifi-
cial Intelligence and Statistics (AISTATS) 2022, Valencia, tor could produce two partial labels: {TIGER, LION}
Spain. PMLR: Volume 151. Copyright 2022 by the au- if a claw is detected and {HORSE, ZEBRA} if not. Like-
thor(s). wise, a stripe detector could output {TIGER, ZEBRA}
Learning from Multiple Noisy Partial Labelers

“Stripes” = TRUE sible. There is also a wide body of work on learning


“President” “Won” from partial labels, also called superset learning (Jin
= TRUE Politics = TRUE Tiger Zebra
and Ghahramani, 2002; Nguyen and Caruana, 2008;
“Claws” “Claws” =
= TRUE FALSE Luo and Orabona, 2010; Cour et al., 2011; Liu and
Business Sports
Dietterich, 2012, 2014; Hüllermeier and Cheng, 2015;
Lion Horse
Cabannnes et al., 2020; Wang et al., 2020; Cabannes
“Points” et al., 2021; Zhang et al., 2021). In these settings, there
= TRUE “Stripes” = FALSE
is generally one partial label per example. The ambi-
guity in the imprecise labels in such settings can be
Figure 1: Examples of the expressivity of partial label- resolved as a maximum likelihood or risk minimiza-
ing functions. On the left, three functions each vote for tion problem. Many methods additionally learn the
two of three classes if they observe a particular token likely confusions between partial and true labels (Du-
in a news article and abstain otherwise. On the right, rand et al., 2019; Li et al., 2020; Yan and Guo, 2020;
two functions each vote for two of four classes if they Xie and Huang, 2021). However, such methods do not
detect a particular attribute in an image. Otherwise, handle the case of multiple partial labelers that can
they vote for the other two. disagree and abstain. Finally, some work on zero-shot
learning (ZSL) creates attribute detectors that can be
viewed as partial labelers (Farhadi et al., 2009; Lam-
if stripes are detected and {HORSE, LION} if not (Fig- pert et al., 2009; Palatucci et al., 2009; Jayaraman and
ure 1). However, these heuristics cannot be used as Grauman, 2014). PWS with partial labels can also be
labeling functions in current PWS frameworks. More viewed as a generalization of the transductive ZSL set-
generally, we observe a need for partial labeling func- ting (Xian et al., 2018; Wang et al., 2019), in which
tions in many multiclass applications where users want labelers are allowed to abstain and a class may be asso-
to express heuristics that narrow down the set of pos- ciated with multiple attribute values. Across all these
sible class labels but are not specific to a single class. areas, there remains a need for learning from multiple
Learning from multiple noisy PLFs is challenging be- noisy partial labelers.
cause we must resolve ambiguity arising from three To address this issue, we propose a generalized PWS
sources: (1) PLF imprecision, i.e., voting for a set of framework that supports partial labels and handles the
classes instead of a single class, (2) PLF inaccuracy, additional ambiguity caused by the imprecise outputs
i.e., voting for a set of classes that does not contain of the PLFs. First, we introduce a probabilistic gen-
the true class, and (3) conflict among multiple PLFs. erative model that estimates the agreement between
A further requirement is that PWS frameworks should the outputs of each partial labeling function and the
support labeling functions that abstain, meaning they latent, true label. Second, we show how to learn the
can choose not to label certain examples. This is par- model parameters efficiently for large datasets. For ex-
ticularly critical for hand-engineered rules that might ample, we can learn on 100k examples with 10 PLFs
be highly specialized. A framework for learning from in one minute. Since PWS is inherently human-in-the-
multiple noisy PLFs should therefore be able to resolve loop, fast iteration is crucial. Third, we prove that
all these types of ambiguities in a principled way while this model’s parameters are generically identifiable up
also maintaining the expressive capabilities of existing to label swapping under mild conditions on the PLFs.
PWS frameworks. This result means that we can estimate the accuracy of
This problem setting is quite general and is related each partial labeling function without access to ground
to multiple lines of work in machine learning, al- truth labels in a principled way. Using the learned pa-
though each of them only addresses part of the prob- rameters, we can compute the posterior distribution
lem considered here. As mentioned above, previous over true labels for each example. These probabilistic
PWS frameworks generally require labeling functions training labels can then be used to train an end model
to provide a single class label (Ratner et al., 2016, in the same manner as other PWS frameworks.
2020; Arachie and Huang, 2021; Awasthi et al., 2020; We demonstrate this framework with experiments on
Karamanolakis et al., 2021; Safranchik et al., 2020). three text and six object classification tasks. On the
One exception is Snorkel MeTaL (Ratner et al., 2019), text classification tasks, we show that the additional
which is capable of handling labeling functions with flexibility provided by partial labelers enables heuris-
a multi-task tree structure where higher-level labeling tics that significantly improve over single-class label-
functions are grouped into super classes that encom- ers alone. We find an average 8.6 percentage point
pass fine-grained classes. This requirement of a tree improvement in accuracy. On the object classification
structure makes modeling partial labels that divide the tasks, we find that modeling the accuracies of the PLFs
label space into overlapping subsets practically infea-
Peilin Yu, Tiffany Ding, Stephen H. Bach

explicitly enables us to achieve accuracy comparable to functions and the label modeling processes in dif-
recent embedding-based ZSL methods using only pre- ferent ways, but all so far assume that each label-
trained attribute detectors. These results provide a ing function votes for a single class. Adversarial la-
foundation for constructing and learning more modu- bel learning (Arachie and Huang, 2021), performance-
lar, reusable knowledge sources for weak supervision. guaranteed majority vote (Mazzetto et al., 2021b), ad-
versarial multi class learning (Mazzetto et al., 2021a),
and related work in semi-supervised ensemble learn-
2 RELATED WORK ing (Balsubramani and Freund, 2015, 2016b,a) solve
minimax games based on assumed or estimated con-
In the past few years, programmatic weak supervision straints on labeling function accuracies. Awasthi et al.
(PWS) has emerged as a systematic approach to ef- (2020) proposed learning from rules and exemplars for
ficiently create labeled training data (Ratner et al., those rules, learning to downweight the confidence in
2016, 2020; Arachie and Huang, 2021; Awasthi et al., the rules on data instances not similar to the exem-
2020; Karamanolakis et al., 2021). A typical PWS plars. Karamanolakis et al. (2021) proposed integrat-
framework consists of three stages. First, domain ex- ing PWS with semi-supervised self-training.
perts engineer weak supervision sources in the form of
labeling functions, such as rules or classifiers related Other work on learning with partial labels has focused
to the target task. Second, a label model, such as a on the case where there is a single partial label per
probabilistic generative model, is used to estimate the example. Classifiers can be learned via maximum
latent true labels using the labeling function outputs. likelihood estimation (Jin and Ghahramani, 2002;
Third, the estimated labels are used to train an end Liu and Dietterich, 2012) or empirical risk minimiza-
model that generalizes beyond the information in the tion (Nguyen and Caruana, 2008; Luo and Orabona,
supervision sources. The core of a typical PWS frame- 2010; Cour et al., 2011; Liu and Dietterich, 2014;
work is the label modeling stage. The choice of label Hüllermeier and Cheng, 2015; Cabannnes et al., 2020;
model determines what types of supervision sources Cabannes et al., 2021; Feng et al., 2020b). Many
are supported. Many frameworks are based on crowd- methods additionally learn the likely confusions be-
sourcing methods (Dawid and Skene, 1979; Nitzan and tween partial and true labels (Durand et al., 2019; Li
Paroush, 1982; Gao and Zhou, 2013), where providing et al., 2020; Yan and Guo, 2020; Xie and Huang, 2021).
a single label is a natural assumption. In the original Wang et al. (2020) proposed learning multiple partially
data programming framework (Ratner et al., 2016), la- labeled tasks simultaneously, in order to exploit struc-
beling functions can output a single label or abstain. ture among the tasks, but during training there is still
The label model is generative, meaning that each true only one partial label per prediction. Partial labels
label is a latent variable and the observed votes of are also related to complementary labels (Ishida et al.,
the labeling functions are conditioned on the true la- 2017; Feng et al., 2020a), which are annotations that
bels. The parameters of the label model are learned indicate which label the example does not have.
by maximizing the marginal likelihood of the observed Our problem setting is also related to some forms
votes. Statistical dependencies such as correlations of zero-shot learning (ZSL) (Xian et al., 2018; Wang
among the votes can be modeled, and methods exist et al., 2019). In zero-shot classification, a model learns
to learn specific types of dependencies from unlabeled to match semantic descriptions of classes to exam-
data (Bach et al., 2017; Varma et al., 2019). ples of those classes. Once learned, the model can
The Snorkel MeTaL (Ratner et al., 2019) framework be applied to novel classes. Many early approaches to
extends data programming to learn across multiple, ZSL created detectors for different attributes (Farhadi
related tasks organized in a tree structure. For exam- et al., 2009; Lampert et al., 2009; Palatucci et al.,
ple, in a fine-grained named entity recognition task, 2009; Jayaraman and Grauman, 2014). In the trans-
one might use a set of labeling functions that vote on ductive setting (Xian et al., 2018; Wang et al., 2019),
coarse-grained entity types and separate sets of label- in which the target classes are known and unlabeled
ing functions to further vote on the subtypes within examples of them are available during model devel-
each coarse type. The outputs of labeling functions opment, these detectors can be viewed as restricted
at higher levels of the tree can be thought of as a re- partial labeling functions that always divide the la-
stricted form of partial labels, in the sense that all la- bel set into non-overlapping groups and never abstain.
beling functions must follow the same tree-structured More recently, much work on ZSL has moved away
organization of the classes. In contrast, in our setting, from relying entirely on attribute detectors, and re-
each partial labeling function can organize the classes cent work can be grouped into either embedding-based
into its own, possibly overlapping groups. or generative-based methods (Pourpanah et al., 2020).
Embedding-based methods align representation spaces
Other PWS frameworks have approached labeling
Learning from Multiple Noisy Partial Labelers

between classes and examples in order to classify un- stains. We denote this set of partial labels for a PLF G
labeled data (Socher et al., 2013; Frome et al., 2013; as T (G) = G \ {Y}. To ensure that our label model is
Romera-Paredes and Torr, 2015; Xian et al., 2016, well-defined, we will impose these conditions on T (G):
2018; Wang et al., 2019). Some work, e.g., Liu et al. (1) each label y ∈ Y appears in at least one element
(2019) and Liu et al. (2020), also learn to exploit and of T (G), and (2) no label y ∈ Y appears in every el-
expand attribute-based information, but generally still ement of T (G). These are very mild conditions that
do not use separate attribute detectors. On the other can easily be satisfied by adding a “dummy” output
hand, generative-based ZSL methods generate exam- to the codomain G that the PLF might not actually
ples of the unseen classes with deep generative models produce. A PLF can be defined based on a variety of
and then train a classifier with that data (Bucher et al., noisy supervision heuristics using domain knowledge
2017; Verma et al., 2018; Felix et al., 2018; Sariyildiz and/or available resources, such as classifiers for re-
and Cinbis, 2019; Xian et al., 2019; Narayan et al., lated tasks. To better understand PLFs, consider the
2020). In our experiments, we compare with transduc- following text and object classification examples cor-
tive embedding-based methods, which are more similar responding to Figure 1:
to PWS because both involve trying to label a fixed,
unlabeled data set. We leave incorporating zero-shot
Example 1. Consider a news classification task where
data generation into PWS for future work.
Y = {POLITICS, SPORTS, BUSINESS}. In this task,
some words can be very informative as supervision
3 A FRAMEWORK FOR PLFs sources even if they do not narrow the example down
to a specific class. For example, the word “presi-
Following prior work in PWS (Ratner et al., 2016, dent” may frequently appear in both political and
2020), our framework consists of three stages. First, business contexts. We can construct a PLF G based
we define partial labeling functions as sources of weak on a simple token matcher for “president” such that
supervision (Section 3.1). Second, we propose and an- G : X → {{BUSINESS, POLITICS}, {SPORTS}, Y}. If
alyze a label model to aggregate their outputs (Sec- the token “president” appears in the example X,
tion 3.2). Third, the learned label model is used to then G(X) ={BUSINESS, POLITICS}. Otherwise,
compute the posterior distribution for the true label G(X) = Y, i.e., G abstains, because the absence of
of each unlabeled example, which is used to train a the token is not enough to conclude anything about
noise-aware classifier (Section 3.3). the label with high confidence. In this example,
T (G) = {{BUSINESS, POLITICS}, {SPORTS}}. Notice
3.1 Partial Labeling Functions here {SPORTS} is a “dummy” label set to satisfy the
conditions on T (G) described above.
We propose generalizing labeling functions to partial
labeling functions (PLF) in order to make use of many
Example 2. Consider an object classification task
available weak supervision sources that are informative
where Y = {HORSE, TIGER, LION, ZEBRA}. Follow-
but not specific enough to identify a single class. PLFs
ing work in zero-shot learning, we can build a bi-
can range in granularity, from dividing the label space
nary classifier for the visual attribute of having
into two large groups down to identifying a specific
stripes by training on other classes of animals for
class, i.e., a regular labeling function. This flexibility
which we already have labels. We can then use
allows users to take advantage of many additional su-
the classifier’s output to define a PLF G1 : X →
pervision signals, as we illustrate in our experiments.
{{TIGER, ZEBRA}, {HORSE, LION}}. For an example
A PLF G is a function that maps an unlabeled exam- X, if the stripes detector returns a positive la-
ple to a proper subset of the possible labels or abstains bel, then G1 (X) = {TIGER, ZEBRA}. Otherwise,
by outputting the full set of all possible labels. For- G1 (X) = {HORSE, LION}. We can similarly con-
mally, our goal is to learn a classifier C : X → Y, struct a PLF with a claw detector as G2 : X →
where X is the space of inputs and Y = {y1 , . . . , yk } {{TIGER, LION}, {HORSE, ZEBRA}}.
is the set of possible labels. A PLF is then a function
G : X → G ⊆ P(Y) \ {∅}, where P(Y) is the power set
of Y. G(X) is a partial label for X ∈ X , i.e., the set PLFs are a generalization of the labeling functions
of labels that the PLF indicates the example X could used in prior work on PWS; traditional labeling func-
have (although this information could be incorrect). tions can be represented as PLFs with codomain
If G(X) = Y, the PLF is said to abstain, because it G = {{y1 }, . . . , {yk }, Y}. PLFs provide the additional
provides no information about the true label. As de- flexibility to users of incorporating weak supervision
scribed further in Section 3.2, a key characteristic of a heuristics with differing granularitities and ways of di-
PLF is its codomain G, excluding when the PLF ab- viding the label space.
Peilin Yu, Tiffany Ding, Stephen H. Bach

3.2 Label Model shorthand for the m × n array of PLF outputs where
Gai = Gi (Xa ) when it is clear from context.
In our framework, users provide two inputs: PLFs Joint Distribution We define a joint distribution
and unlabeled examples in X . Like other PWS frame- P (G, Y ) over the outputs of the PLFs on X and the
works, at the core of our method is a probabilistic label latent, true labels Y . Like prior work (Ratner et al.,
model that captures the properties of the weak super- 2016), we assume that the PLF outputs are condition-
vision sources by representing the unknown ground- ally independent given the true labels, i.e., the naive
truth labels as latent variables. In this subsection, we Bayes assumption. In practice this works well, but ex-
propose and analyze a probabilistic label model for tending work on learning more complex distributions
PLFs. Our label model is straightforward to define for other types of PWS is a potential direction for
as a generalization of existing ones for labelers that future exploration (Bach et al., 2017; Varma et al.,
provide a single label (Dawid and Skene, 1979; Ratner 2019). Analogous to prior work (Ratner et al., 2019),
et al., 2016). It defines a correct output of a partial la- for each PLF Gi , we define parameters αi ∈ [0, 1]k
beler as one that is consistent with the unknown, true and βi ∈ [0, 1]. Each element αij is the accuracy of
label, i.e., that the true label is in the output set. If Gi on examples of class yj , i.e., the probability that
the partial labeler is trivially correct because it out- yj ∈ Gai given that Xa has label yj and Gai is not Y.
puts the set of all possible labels, it is said to abstain, βi is the propensity of Gi voting, i.e., not abstaining.
which is not counted towards its estimated accuracy. In other words, βi = P (Gai ̸= Y). In our framework,
While straightforward to define, using the model in- the class balance P (Y ) can either be a learned distri-
troduces new challenges. The first challenge is prac- bution or fixed. We assume that if a PLF Gi makes
tical. Existing methods for optimizing the marginal a mistake, it outputs an incorrect partial label from
likelihood of the label model in the single-label case T (Gi ) uniformly at random.
do not work for PLFs. The matrix factorization ap- To define the joint distribution P (G, Y ), for each PLF
proach proposed by Ratner et al. (2019) exploits the Gi we also need to refer to the sets in T (Gi ) that are
fact that the matrix of agreements among labelers can consistent or inconsistent with each label. Let Nij =
be expressed as a product of the model parameters, {L|yj ∈ L for L ∈ T (Gi )} be the set of label sets in
but this does not hold in the PLF case. Bach et al. the codomain of Gi that contain label yj (excluding
(2019) proposed optimizing the likelihood in an auto- C
Y). Likewise, let Nij = {L|yj ∈/ L for L ∈ T (Gi )} be
differentiation framework like PyTorch (Paszke et al., the set of label sets in the codomain of Gi that do not
2017). We find that naively applying this approach contain label yj . Then, the joint distribution is
is impractically slow, and that carefully defining the
m n
computation graph leads to a 300× speedup. Y Y
P (G, Y ) = P (Ya ) P (Gai |Ya ) (1)
The other challenge we address is theoretical. We con- a=1 i=1
sider when it is possible to learn the parameters of the
model without access to ground truth in a principled where
way, i.e., when the model is identifiable. Prior theoret- 
1 − βi , if Gai = Y
ical work either on PWS or learning with partial labels




is not applicable to our scenario. Prior work on identi-
β α
i ij
P (Gai |Ya = yj ) = |Nij | , if yj ∈ Gai ̸= Y (2)
fiability for PWS does not consider partial labels (Rat- 

ner et al., 2019; Safranchik et al., 2020), and new

 βi (1−αij ) , if y ∈

|N C | j / Gai .
conditions and arguments are needed to handle them. ij

Prior work on risk bounds for learning with partial la-


bels does not address noise nor multiple labelers (Cour Learning Given the unlabeled examples and PLF
et al., 2011; Liu and Dietterich, 2014; Hüllermeier and outputs G, our goal is to estimate the parameters of
Cheng, 2015; Feng et al., 2020c). Therefore, our main P (G, Y ) (denoted collectively as Θ) and compute the
contributions in this section are a fast learning tech- posterior P (Y |G) over the unknown labels. To esti-
nique and a theorem characterizing sufficient identifi- mate Θ, we maximize the marginal likelihood of the
ability conditions for our model. observed outputs of the PLFs:
Setup For a classification task with input space X Θ̂ = argmax PΘ (G) = argmax
X
PΘ (G, Y ) . (3)
and label space Y = {y1 , . . . , yk }, we are given m un- Θ Θ
Y
labeled examples X = (X1 , . . . , Xm ) with unknown
ground truth labels Y = (Y1 , . . . , Ym ) such that (X, Y ) This optimization is implemented in PyTorch (Paszke
are i.i.d. samples from some distribution D. We are et al., 2017). The marginal log likelihood of a batch
also given n PLFs G = (G1 , . . . , Gn ). We use G as of examples is computed in the forward pass, and
Learning from Multiple Noisy Partial Labelers

stochastic gradient descent is used to update the pa- where ⊙ is element-wise multiplication. The marginal
rameters. We find that the way the likelihood com- likelihood is then combuted by summing over the k
putation is implemented in the forward pass can lead possible classes. This modification allows us to remove
to an orders-of-magnitude difference in training time. for loops in our code from the computation graph, and
For every example, we need to compute its conditional instead use only optimized matrix operations. This
likelihood for every class based on votes from every approach leads to a 300× speedup in training time
PLF. Naively, this will require three layers of for loops compared to a naive approach. This speedup makes
through examples, PLFs, and classes. We can speed the framework practical for iterative PLF develop-
up the computation by expressing the conditional log ment. For example, learning with 100k examples and
likelihood computation as a sequence of matrix oper- 10 PLFs on an Intel i5-6600k CPU takes one minute.
ations. This optimization trades space for speed by
Identifiability An important theoretical question
precomputing intermediate values and taking advan-
is whether it is reasonable to try to learn the parame-
tage of vectorized matrix operations.
ters of P (G, Y ) even though Y is never observed. We
Let m be the number of instances in one batch, n be answer this question affirmatively by showing that as
the number of PLFs and k be the number of classes. long as the codomains of the PLFs are sufficiently tar-
For each batch we precompute accuracy indicator ma- geted or diverse, it is possible to determine the param-
trices AI ∈ {−1, 1}m×n×k and count matrices N ∈ eters of the label model (up to label swapping) using
Zm×n×k where entry AIa,i,j = 1, Na,i,j = − log |Ni,j | only the distribution of PLF outputs P (G), except for
if class yj is in the label subset output by the i-th on a measure zero subset of the space of possible pa-
PLF on the a-th example, and AIa,i,j = −1, Na,i,j = rameter values. This property is the strongest useful
C
− log |Ni,j | otherwise. We also precompute propen- notion of identifiability for models with latent vari-
sity indicator matrices PI ∈ {0, 1}m×n where entry ables (Allman et al., 2015). A model whose parame-
PIa,i = 1 if the a-th instance received a non-abstaining ters can be determined except for on a measure zero
vote (vote is not Y) from the i-th PLF. Let A ∈ Rn×k subset is called generically identifiable. Label swapping
be the log of the accuracy parameters and B ∈ Rn refers to the fact that unobserved classes in a latent
be the log of the propensity parameters. We can map variable model can be relabeled without changing the
these parameters back to probability space as observed distribution. This means that the map going
from the observed distribution of a label model with
exp(Ai,j )
αi,j = and k classes to parameter values is at best k!-to-one and
exp(Ai,j ) + exp(−Ai,j ) cannot be one-to-one even under ideal conditions. In
(4)
exp(Bi ) practice, label swapping is not an issue because most
βb = .
exp(Bi ) + 1 PLFs are more accurate than random guessing. We
state a condition on the PLF codomains that is suffi-
We extend PI, A, and B to PIext , Aext , and Bext in cient for identifiability in the following theorem.
3 dimensions, with PI replicated along the third axis
k times, A replicated along the first axis m times, and Theorem 1. The parameters of the model P(G, Y)
B replicated along the first axis m times and third described in Section 3.2 are generically identifiable up
axis k times. Then, during each forward pass, we only to label swapping provided that the collection G of
need to calculate normalizing matrices ZA ∈ Rn×k partial labeling functions can be partitioned into three
and ZB ∈ Rn for accuracy and propensity respectively disjoint non-empty sets S1 , S2 , and S3 such that, for
where sets j = 1, 2 and all classesTy ∈ Y, we can choose label
sets ti ∈ T (Gi ) satisfying Gi ∈Sj ti = {y}.
ZAi,j = − log (exp(Ai,j ) + exp(−Ai,j )) and
(5)
ZBi = − log(exp(Bi ) + 1) . The proof is given in Appendix B. This theorem tells
us that it is reasonable to try to estimate the PLF
accuracies even though the true class labels are never
We similarly extend ZA and ZB to ZAext and ZBext .
observed. Our proof adapts ideas presented in The-
During the forward pass we calculate the batch condi-
orem 4 of Allman et al. (2009), which uses Kruskal’s
tional log likelihood by first computing a 3-dimension
unique factorization theorem and feature grouping to
tensor:
establish conditions for the generic identifiability of a
T = Aext ⊙ AI + N + Bext + ZAext naive Bayes model with arbitrary parameters. Since
m×n×k m×n×k m×n×k m×n×k m×n×k m×n×k
the space of models we consider is equivalent to a mea-
and then summing over the n PLFs: sure zero subset of the parameters in an arbitrary naive
Bayes model, an additional proof is needed to show
X 
log P (G|Y ) = ZBext + PIext ⊙ T (6) that these parameters are generically identifiable. We
m×k n
m×n×k m×n×k m×n×k develop a novel argument to show that the above is
Peilin Yu, Tiffany Ding, Stephen H. Bach

a sufficient condition for identifiability. We show that def first_word_PLF(instance):


'''
for any distribution satisfying the condition described Label by first word of question.
in Theorem 1, we can construct matrices representing ABBR - Abbreviation
DESC - Description and concepts
factors of the joint distribution over observations that ENTY - Entities
HUM - Human beings
generically have full Kruskal rank. This is sufficient to LOC - Locations
satisfy the conditions in Kruskal’s theorem. NUM - Numeric values
'''
word = instance.split()[0].lower()
In words, the condition described in Theorem 1 re- if word == "who": return [HUM]
quires that for each class y, we can select a label group elif word == "where": return [LOC]
elif word == "when": return [NUM]
from the codomain of each PLF in S1 such that the in- elif word == "why": return [DESC]
tersection of these label groups contains only the class elif word == "how": return [DESC, NUM]
y. This condition also applies to S2 . One way to satisfy elif word == "name": return [ENTY, HUM]
else: return [ABBR, DESC, ENTY,
this condition is to create PLFs that produce single- HUM, LOC, NUM] # Abstain
class label groups. For example, if PLF Gi contains
{1}, {2}, ..., {k} in its codomain, then any set Sj that
contains Gi will satisfy the Theorem 1 condition. How- Figure 2: A partial labeling function developed for the
ever, even if no PLFs output any single-class label sets, TREC-6 question-type task. It uses the first word of
it is still possible for the label model parameters to be the sentence to possibly narrow down the set of labels.
identifiable because the condition can also be satis-
fied by using multiple PLFs with different codomains. shows that discrete attribute detectors can be compet-
Suppose that we want to show that the condition is itive with recent ZSL approaches. These two sets of ex-
satisfied for class 1 and we have {1, 2, 3} ∈ T (G1 ), periments are complementary, showing that our frame-
{1, 3, 4} ∈ T (G2 ) and {1, 2, 4} ∈ T (G3 ). The intersec- work benefits both hand-engineered rules and partial
tion of these sets is {1}. labeling schemes defined in prior work.

3.3 Noise-Aware Classifier We make available the code for our framework1 and ex-
periments.2 Additional details about the experiments,
The final stage of our framework is to train a classi- datasets, and methods are available in Appendix D.
fier. After P (G, Y ) is estimated with unlabeled data,
we compute the posterior P (Y |G). Then, we minimize 4.1 Text Classification
the expected empirical risk with respect to this distri-
bution. For classifiers that output probabilistic pre- We first evaluate the benefit of incorporating PLFs
dictions, the loss function becomes the cross-entropy into text classification with hand-engineered rules.
loss weighted by the posterior over true labels. As Datasets We consider three datasets. First, SciCite
in other PWS frameworks (Ratner et al., 2016, 2020, (Cohan et al., 2019) is a citation classification dataset
2019; Safranchik et al., 2020), many off-the-shelf neu- sourced from scientific literature. The corresponding
ral networks can be chosen based on the task. task is to classify a citation as referring to either back-
ground information, method details, or results. Sec-
4 EXPERIMENTAL RESULTS ond, TREC-6 (Li and Roth, 2002) is a question classifi-
cation dataset containing open-domain questions. The
We demonstrate benefits of incorporating partial la- task is to classifiy each question as asking about one
bels into PWS on applications in text and object clas- of six semantic categories. Finally, AG-News (Gulli,
sification. In Section 4.1, we compare our framework 2005) is a large-scale news dataset. The task is to
with baselines that (1) use only traditional labeling classify each example as one of four topics.
functions and (2) heuristically aggregate partial labels
PLF Development
without a probabilistic model. Our proposed approach
For text tasks, we develop PLFs by inspecting exam-
significantly improves accuracy over both baselines. In
ples from the development set for each dataset (916
Section 4.2, we use pretrained visual attribute detec-
examples for SciCite, 389 for TREC-6, and 500 for
tors as PLFs for classifying unseen objects. Our frame-
AG-News). We implement heuristic rules as Python
work achieves accuracy that is competitive with recent
functions that take as input the example text and any
embedding-based transductive ZSL methods. While
available metadata (such as the name of the section
our framework is not designed specifically for ZSL, we
in which the sentence appears for SciCite). Most rules
present this comparison to demonstrate its flexibility
rely on checking for keywords or other surface patterns
and show another scenario where modeling the noise of
multiple partial labeling functions can significantly im- 1
github.com/BatsResearch/nplm
2
prove performance relative to a heuristic approach. It github.com/BatsResearch/yu-aistats22-code
Learning from Multiple Noisy Partial Labelers

SciCite TREC-6 AG-News


ACC F1 ACC F1 ACC F1
Supervised (Dev. Set) 78.5±1.1 75.5±1.5 88.3±1.6 76.2±2.0 80.1±1.7 79.9±1.9
LFs Only (w/o End) 65.1 44.7 22.0 23.8 33.0 25.1
NC (w/o End) 73.2 69.2 29.6 32.6 46.1 43.5
NPLM (w/o End) 71.5 69.4 38.2 43.0 51.1 49.4
LFs Only 78.6±0.7 76.7±0.8 67.2±0.9 69.3±1.0 79.8±0.9 79.6±0.8
NC 80.2±0.6 77.7±0.6 67.8±1.5 56.5±0.3 78.0±1.0 77.2±1.0
NPLM 81.4±1.3 79.5±1.3 85.0±0.8 85.7±0.7 85.0±0.5 84.7±0.5
NPLM vs. LFs Only ↑ 2.8 ↑ 2.8 ↑ 17.8 ↑ 16.4 ↑ 5.2 ↑ 5.1
NPLM vs. NC ↑ 1.2 ↑ 1.8 ↑ 17.2 ↑ 29.2 ↑ 7.0 ↑ 7.5

Table 1: Results for text classification with mean accuracy (ACC), macro F1 (F1) and 95% CIs.

in the text. Figure 2 shows an example for TREC-6, supervision that can be expressed as PLFs, and the ad-
in which we use the first word of a question to vote on vantage over NC demonstrates that the proposed label
what type of question it is. This example illustrates model is learning useful information. The ablated ver-
some of the utility of PLFs. Some words like “who” are sions of the methods significantly underperform their
sufficient to reliabily identify a single class. Others like counterparts, showing that in all cases the end model
“how” greatly narrow the set of possible lables, even learns to generalize beyond the information contained
though they do not specify a single label. The full set in the weak supervision heuristics. Many of the errors
of PLFs are in Appendix E and the experiment code. are on examples for which all supervision sources ab-
stain, where a label is chosen arbitrarily or according
Methods We evaluate methods for aggregating the
to the class prior P (Y ). For context, we also report
outputs of the PLFs to train an end model. First, as
the performance of the end model trained on the de-
a baseline, we consider using only the PLFs that are
velopment set with ground-truth labels. NPLM signif-
equivalent to traditional labeling functions, i.e., they
icantly outperforms it in most cases. On TREC-6, the
always output one label or abstain. We call this base-
supervised baseline has a higher accuracy but much
line LFs Only. Second, as another baseline we use a
lower macro-averaged F1, indicating that our method
heuristic called Nearest Class (NC), which chooses
does significantly better on the rarer classes.
the class with the highest number of compatible par-
tial labels. This baseline is a generalized majority
vote heuristic for PLFs. Finally, our method, called 4.2 Object Classification
Noisy Partial Label Model (NPLM), is our label
In this task, we show how our framework can be used
model from Section 3.2. In all cases, we use the esti-
to model discrete visual attribute detectors, and that
mated labels to fine-tune a pretrained BERT (Devlin
this approach can achieve results competitive with re-
et al., 2019) base uncased English model with a three-
cent embedding-based ZSL methods. Although they
layer classification head. Following prior work (Ratner
have not been used often in recent ZSL work, dis-
et al., 2016, 2020), we use the expected cross entropy
crete attribute detectors have benefits such as mod-
w.r.t. P (Y |G) as our training loss. Hyperparameters
ularity and interpretability. These experiments show
and additional end model details are in Appendix D.
that modeling them as PLFs with our unsupervised
As ablations, we also report performance using the ag-
label model can lead to good accuracy.
gregated PLF outputs directly as predictions, without
training an end model, denoted (w/o End). Datasets We consider the Large-Scale Attribute
Dataset (LAD) (Zhao et al., 2019) and Animals with
Results We report mean micro-averaged accuracy
Attributes 2 (AwA2) (Xian et al., 2018), which both
and macro-averaged F1 of the compared methods in
provide class-level discrete visual attributes. LAD is
Table 1 on the standard test sets. Results using the
a recently proposed attribute-based dataset with 78k
end model are shown with 95% confidence intervals
instances that organizes common objects into five sub-
obtained using five different random seeds. NPLM
datasets: electronics, vehicles, fruits, hairstyles and
consistently improves F1 and accuracy relative to LFs
animals. For each sub-dataset, the classes are divided
Only (8.6 and 8.1 percentage points on average, respec-
into five folds of seen and unseen classes, and aver-
tively) and NC (8.5 and 12.8 percentage points on av-
age performance over all tasks is used as a benchmark
erage, respectively). The performance advantage over
for ZSL. AwA2 is a popular ZSL animal classification
LFs Only demonstrates the benefits of additional weak
dataset consisting of ∼30k instances with 85 binary
Peilin Yu, Tiffany Ding, Stephen H. Bach

LAD (ACC) AwA2 (MCA)


Animals Fruit Vehicles Electronics Hairstyles Avg. U S H
ConSE—Norouzi et al. (2013) 36.9 29.8 37.5 28.3 24.6 31.4 0.5 90.6 1.0
ESZSL—Romera-Paredes and Torr (2015) 50.2 37.2 45.8 32.8 31.8 39.6 77.8 5.9 11.0
SynC—Changpinyo et al. (2016) 61.6 51.4 54.9 43.0 29.1 48.0 90.5 10.0 18.0
VCL—Wan et al. (2019) 75.4± 0.8 35.0± 1.0 62.4± 0.5 36.7± 0.5 33.8± 0.7 48.7± 0.3 21.4 89.6 34.6
QFSL—Song et al. (2018) - - - - - - 66.2 93.1 77.4
WDVSc—Wan et al. (2019) 97.2± 0.8 43.3± 1.3 82.1± 0.6 54.8± 1.1 31.1± 2.6 61.7± 0.6 76.4 88.1 81.8
NC (w/o End) 65.8 31.2 60.3 40.3 39.1 47.3 47.7 - -
NPLM (w/o End) 86.0 38.7 73.5 51.8 45.9 59.2 68.2 - -
NC 71.9±1.2 36.2±0.6 65.3±1.2 48.0±0.7 40.9±0.5 52.5±0.3 43.1±1.2 91.8±0.2 58.6± 1.1
NPLM 87.6± 0.2 42.4± 0.8 77.0± 0.2 57.7± 0.7 46.9± 0.9 62.3± 0.2 71.1±0.6 91.9±0.1 80.1±0.3
NPLM vs. NC ↑ 15.7 ↑ 6.2 ↑ 11.7 ↑ 9.7 ↑ 6.0 ↑ 9.8 ↑ 28.0 - ↑ 21.5

Table 2: Results for object classification. For LAD, we report mean accuracy (ACC) with 95% CIs across the five
standard splits for each of the five subtasks. For AWA2, we report mean class accuracy (MCA) with 95% CIs.
We evaluate AWA2 in a generalized setting: S and U denotes MCA on the seen and unseen classes respectively.
H is the harmonic mean.

attributes, 40 seen classes, and 10 unseen classes. ods, ConSE (Norouzi et al., 2013), ESZSL (Romera-
Paredes and Torr, 2015), SynC (Changpinyo et al.,
PLF Development Following early work on zero-
2016), although they are at a disadvantage because
shot object classifcation (Farhadi et al., 2009; Lam-
they do not access the unlabeled data nor any infor-
pert et al., 2009; Palatucci et al., 2009; Jayaraman
mation about the unseen classes. For LAD, we repli-
and Grauman, 2014), we model each visual attribute
cate and report the results of WDVSc and VCL using
in the datasets with a binary classifier. In all cases, the
the same features.
classifiers are trained on the seen classes for that task
or fold, and the unseen classes are not used at all, not Results We report the average results and 95% con-
even as validation data. To create classifiers for LAD, fidence intervals based on five random seeds in Ta-
we extract features from a ResNet-50 (He et al., 2016) ble 2. Similar to the text classification tasks, NPLM
pretrained on ILSVRC (Russakovsky et al., 2015), in significantly outperforms NC (an average of 9.8 per-
order to compare fairly with prior work. For AwA2, we centage points on LAD and 21.5 percentage points on
fine-tune a pretrained ResNet-101 on the seen classes. AwA2), and the ablations show that the end model
Each class is trained with respect to the class-wise at- generalizes beyond the PLFs. NPLM is also compet-
tribute annotations on the training sets of the seen itive with WDVSc, the top-performing ZSL method,
classes. We define the PLFs according to the provided either slightly underperforming or outperforming.
attribute annotations from the respective datasets.
Methods We incorporate PLFs into the NC and 5 CONCLUSION
NPLM methods as described in Section 4.1. We use
examples of the unseen classes as our unlabeled data. We have introduced a new capability for programmatic
In all cases, our end models are three-layer percep- weak supervision (PWS): the ability to learn from par-
trons trained on the extracted features (ResNet-50 for tial labeling functions using a novel probabilistic la-
LAD and ResNet-101 for AwA2). As with the text bel model. We demonstrated a scalable way to learn
data, we use the expected cross entropy with respect these models, and our theoretical analysis shows they
to P (Y |G) as our training loss. Following the liter- are generically identifiable up to label swapping, the
ature, for LAD we evaluate in a strict zero-shot set- strongest useful notion of identifiability for latent vari-
ting, meaning that the model is only evaluated on un- able models (Allman et al., 2015). Our experiments
seen classes. For AwA2, we evaluate in a generalized show that our framework can (1) significantly improve
zero-shot setting, meaning that the model is evalu- the accuracy of PWS on text classification tasks and
ated on both seen and unseen classes, so we mix the (2) enable pre-trained attribute detectors to achieve
unseen classes with estimated labels and seen classes performance comparable to embedding-based methods
with given labels during training. Again, additional for transductive ZSL on object classification tasks. We
details are in Appendix D. We compare with three aim to enable the incorporation of a wider range of su-
recent transductive, embedding-based ZSL methods: pervision sources into PWS systems. As future work,
QFSL (Song et al., 2018), VCL (Wan et al., 2019), we envision creating libraries of rules and pre-trained
and WDVSc (Wan et al., 2019). For context, we also models that are more generic and modular because
report results from three standard inductive meth- they are freed from the requirement that they narrow
the label space down to a single class.
Learning from Multiple Noisy Partial Labelers

Acknowledgements Chapelle, O., Scholkopf, B., and Zien, A. (2009). Semi-


Supervised Learning. MIT Press.
This material is based on research sponsored by De- Chen, V. S., Varma, P., Krishna, R., Bernstein, M., Re,
fense Advanced Research Projects Agency (DARPA) C., and Fei-Fei, L. (2019). Scene graph prediction with
and Air Force Research Laboratory (AFRL) under limited labels. In ICCV.
agreement number FA8750-19-2-1006. The U.S. Gov-
Cohan, A., Ammar, W., Van Zuylen, M., and Cady, F.
ernment is authorized to reproduce and distribute
(2019). Structural scaffolds for citation intent classifica-
reprints for Governmental purposes notwithstanding
tion in scientific publications. In NAACL.
any copyright notation thereon. The views and con-
clusions contained herein are those of the authors and Cohn, D. A., Ghahramani, Z., and Jordan, M. I. (1996).
should not be interpreted as necessarily representing Active learning with statistical models. J. of Artificial
the official policies or endorsements, either expressed Intelligence Research, 4:129–145.
or implied, of Defense Advanced Research Projects Cour, T., Sapp, B., and Taskar, B. (2011). Learning from
Agency (DARPA) and Air Force Research Laboratory partial labels. The J. of Mach. Learn. Research, 12:1501–
(AFRL) or the U.S. Government. We gratefully ac- 1536.
knowledge support from Google and Cisco. Disclosure:
Stephen Bach is an advisor to Snorkel AI, a company Dawid, A. P. and Skene, A. M. (1979). Maximum like-
that provides software and services for weakly super- lihood estimation of observer error-rates using the EM
vised machine learning. algorithm. Journal of the Royal Statistical Society C,
28(1):20–28.
References Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K.
(2019). Bert: Pre-training of deep bidirectional trans-
Allman, E. S., Matias, C., Rhodes, J. A., et al. (2009). formers for language understanding. In NAACL.
Identifiability of parameters in latent structure mod-
Durand, T., Mehrasa, N., and Mori, G. (2019). Learning
els with many observed variables. The Ann. of Stat.,
a deep convnet for multi-label classification with partial
37(6A):3099–3132.
labels. In CVPR.
Allman, E. S., Rhodes, J. A., Stanghellini, E., and Val-
Farhadi, A., Endres, I., Hoiem, D., and Forsyth, D. A.
torta, M. (2015). Parameter identifiability of discrete
(2009). Describing objects by their attributes. In CVPR.
bayesian networks with hidden variables. J. of Causal
Inference, 3(2):189–205. Felix, R., Reid, I., Carneiro, G., et al. (2018). Multi-
modal cycle-consistent generalized zero-shot learning. In
Arachie, C. and Huang, B. (2021). A general framework for ECCV.
adversarial label learning. J. of Mach. Learn. Research,
22:1–33. Feng, L., Kaneko, T., Han, B., Niu, G., An, B., and
Sugiyama, M. (2020a). Learning with multiple comple-
Awasthi, A., Ghosh, S., Goyal, R., and Sarawagi, S. (2020). mentary labels. In ICML.
Learning from rules generalizing labeled exemplars. In
ICLR. Feng, L., Lv, J., Han, B., Xu, M., Niu, G., Geng, X.,
An, B., and Sugiyama, M. (2020b). Provably consistent
Bach, S. H., He, B., Ratner, A., and Ré, C. (2017). Learn- partial-label learning. In NeurIPS.
ing the structure of generative models without labeled
data. In ICML. Feng, L., Lv, J., Han, B., Xu, M., Niu, G., Geng, X.,
An, B., and Sugiyama, M. (2020c). Provably consistent
Bach, S. H., Rodriguez, D., Liu, Y., Luo, C., Shao, H., partial-label learning.
Xia, C., Sen, S., Ratner, A., Hancock, B., Alborzi, H.,
Kuchhal, R., Ré, C., and Malkin, R. (2019). Snorkel Fries, J., Varma, P., Chen, V., Xiao, K., Tejeda, H.,
DryBell: A case study in deploying weak supervision at Priyanka, S., Dunnmon, J., Chubb, H., Maskatia, S.,
industrial scale. In SIGMOD. Fiterau, M., Delp, S., Ashley, E., Ré, C., and Priest,
J. (2019). Weakly supervised classification of rare aor-
Balsubramani, A. and Freund, Y. (2015). Optimally com- tic valve malformations using unlabeled cardiac MRI se-
bining classifiers using unlabeled data. In COLT. quences. Nature Communications, 10(1).
Balsubramani, A. and Freund, Y. (2016a). Optimal binary Fries, J. A., Steinberg, E., Khattar, S., Fleming, S. L.,
classifier aggregation for general losses. In NeurIPS. Posada, J., Callahan, A., and Shah, N. H. (2021).
Balsubramani, A. and Freund, Y. (2016b). Scalable semi- Ontology-driven weak supervision for clinical entity clas-
supervised aggregation of classifiers. In NeurIPS. sification in electronic health records. Nature Commu-
Bucher, M., Herbin, S., and Jurie, F. (2017). Generat- nications, 12(1):1–11.
ing visual representations for zero-shot classification. In Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J.,
ICCV Workshops. Ranzato, M., and Mikolov, T. (2013). Devise: A deep
Cabannes, V., Bach, F., and Rudi, A. (2021). Disambigua- visual-semantic embedding model. In NeurIPS.
tion of weak supervision with exponential convergence Gao, C. and Zhou, D. (2013). Minimax optimal conver-
rates. arXiv preprint arXiv:2102.02789. gence rates for estimating ground truth from crowd-
Cabannnes, V., Rudi, A., and Bach, F. (2020). Structured sourced labels. CoRR, abs/1207.0016.
prediction with partial labelling through the infimum Gardner, M., Grus, J., Neumann, M., Tafjord, O., Dasigi,
loss. In ICML. P., Liu, N. F., Peters, M., Schmitz, M., and Zettlemoyer,
Changpinyo, S., Chao, W.-L., Gong, B., and Sha, F. L. S. (2017). Allennlp: A deep semantic natural lan-
(2016). Synthesized classifiers for zero-shot learning. In guage processing platform.
CVPR. Gulli, A. (2005). AG’s corpus of news articles.
Peilin Yu, Tiffany Ding, Stephen H. Bach

He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep Nguyen, N. and Caruana, R. (2008). Classification with
residual learning for image recognition. In CVPR. partial labels. In KDD.
Hüllermeier, E. and Cheng, W. (2015). Superset learn- Nitzan, S. and Paroush, J. (1982). Optimal decision rules in
ing based on generalized loss minimization. In ECML uncertain dichotomous choice situations. International
PKDD. Economic Review, 23(2):289–97.
Ishida, T., Niu, G., Hu, W., and Sugiyama, M. (2017). Norouzi, M., Mikolov, T., Bengio, S., Singer, Y., Shlens, J.,
Learning from complementary labels. In NeurIPS. Frome, A., Corrado, G. S., and Dean, J. (2013). Zero-
Jayaraman, D. and Grauman, K. (2014). Zero shot recog- shot learning by convex combination of semantic embed-
nition with unreliable attributes. In NeurIPS. dings. arXiv preprint arXiv:1312.5650.
Jin, R. and Ghahramani, Z. (2002). Learning with multiple Palatucci, M., Pomerleau, D., Hinton, G. E., and Mitchell,
labels. In NeurIPS. T. M. (2009). Zero-shot learning with semantic output
codes. In NeurIPS.
Karamanolakis, G., Mukherjee, S., Zheng, G., and Awadal-
lah, A. H. (2021). Self-training with weak supervision. Pan, S. J. and Yang, Q. (2009). A survey on transfer learn-
In NAACL. ing. IEEE Transactions on Knowledge and Data Engi-
neering, 22(10):1345–1359.
Kruskal, J. B. (1977). Three-way arrays: rank and unique-
ness of trilinear decompositions, with application to Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang,
arithmetic complexity and statistics. Linear algebra and E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and
its applications, 18(2):95–138. Lerer, A. (2017). Automatic differentiation in PyTorch.
In NeurIPS Autodiff Workshop.
Lampert, C. H., Nickisch, H., and Harmeling, S. (2009).
Learning to detect unseen object classes by between- Pourpanah, F., Abdar, M., Luo, Y., Zhou, X., Wang, R.,
class attribute transfer. In CVPR. Lim, C., and Wang, X. (2020). A review of generalized
zero-shot learning methods. ArXiv, abs/2011.08641.
Li, C., Li, X., and Ouyang, J. (2020). Learning with noisy
partial labels by simultaneously leveraging global and Ratner, A. J., Bach, S. H., Ehrenberg, H. E., Fries, J.,
local consistencies. In CIKM. Wu, S., and Ré, C. (2020). Snorkel: Rapid training data
creation with weak supervision. The VLDB Journal,
Li, X. and Roth, D. (2002). Learning question classifiers.
29(2):709–730.
In COLING.
Liu, L. and Dietterich, T. (2014). Learnability of the su- Ratner, A. J., De Sa, C. M., Wu, S., Selsam, D., and Ré,
perset label learning problem. In ICML. C. (2016). Data programming: Creating large training
sets, quickly. In NeurIPS.
Liu, L. and Dietterich, T. G. (2012). A conditional multi-
nomial mixture model for superset label learning. In Ratner, A. J., Hancock, B., Dunnmon, J., Sala, F., Pandey,
NeurIPS. S., and Ré, C. (2019). Training complex models with
multi-task weak supervision. In AAAI.
Liu, L., Zhou, T., Long, G., Jiang, J., and Zhang, C.
(2020). Attribute propagation network for graph zero- Romera-Paredes, B. and Torr, P. (2015). An embarrass-
shot learning. In AAAI. ingly simple approach to zero-shot learning. In ICML.
Liu, Y., Guo, J., Cai, D., and He, X. (2019). Attribute at- Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S.,
tention for semantic disambiguation in zero-shot learn- Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein,
ing. In ICCV. M., Berg, A. C., and Fei-Fei, L. (2015). ImageNet Large
Scale Visual Recognition Challenge. International J. of
Loshchilov, I. and Hutter, F. (2019). Decoupled weight Computer Vision.
decay regularization. In ICLR.
Saab, K., Dunnmon, J., Ré, C., Rubin, D., and Lee-Messer,
Luo, J. and Orabona, F. (2010). Learning from candidate C. (2020). Weak supervision as an efficient approach for
labeling sets. Technical report. automated seizure detection in electroencephalography.
Mallinar, N., Shah, A., Ugrani, R., Gupta, A., Gurusankar, NPJ Digital Medicine, 3(1):1–12.
M., Ho, T. K., Liao, Q. V., Zhang, Y., Bellamy, R. K., Safranchik, E., Luo, S., and Bach, S. H. (2020). Weakly
Yates, R., et al. (2019). Bootstrapping conversational supervised sequence tagging from noisy rules. In AAAI.
agents with weak supervision. In AAAI.
Saleiro, P., Kuester, B., Hinkson, L., London, J., Stevens,
Mazzetto, A., Cousins, C., Sam, D., Bach, S. H., and Upfal, A., Anisfeld, A., Rodolfa, K. T., and Ghani, R. (2018).
E. (2021a). Adversarial multiclass learning under weak Aequitas: A bias and fairness audit toolkit. arXiv
supervision with performance guarantees. In ICML. preprint arXiv:1811.05577.
Mazzetto, A., Sam, D., Park, A., Upfal, E., and Bach,
Sariyildiz, M. B. and Cinbis, R. G. (2019). Gradient match-
S. H. (2021b). Semi-supervised aggregation of dependent
ing generative networks for zero-shot learning. In CVPR.
weak supervision sources with performance guarantees.
In AISTATS. Settles, B. (2012). Active learning. Synthesis Lectures on
Artificial Intelligence and Machine Learning, 6(1):1–114.
Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., and
Galstyan, A. (2019). A survey on bias and fairness in Socher, R., Ganjoo, M., Manning, C. D., and Ng, A.
machine learning. arXiv preprint arXiv:1908.09635. (2013). Zero-shot learning through cross-modal trans-
Narayan, S., Gupta, A., Khan, F. S., Snoek, C. G., fer. In NeurIPS.
and Shao, L. (2020). Latent embedding feedback and Song, J., Shen, C., Yang, Y., Liu, Y., and Song, M. (2018).
discriminative features for zero-shot classification. In Transductive unbiased embedding for zero-shot learning.
ECCV. In CVPR.
Learning from Multiple Noisy Partial Labelers

Van Engelen, J. E. and Hoos, H. H. (2020). A sur-


vey on semi-supervised learning. Machine Learning,
109(2):373–440.
Varma, P., Sala, F., He, A., Ratner, A., and Ré, C. (2019).
Learning dependency structures for weak supervision
models. In ICML.
Verma, V. K., Arora, G., Mishra, A., and Rai, P. (2018).
Generalized zero-shot learning via synthesized examples.
In CVPR.
Wan, Z., Chen, D., Li, Y., Yan, X., Zhang, J., Yu, Y., and
Liao, J. (2019). Transductive zero-shot learning with
visual structure constraint. In NeurIPS.
Wang, H., Liu, W., Zhao, Y., Hu, T., Chen, K., and Chen,
G. (2020). Learning from multi-dimensional partial la-
bels. In IJCAI.
Wang, W., Zheng, V. W., Yu, H., and Miao, C. (2019).
A survey of zero-shot learning: Settings, methods, and
applications. ACM Transactions on Intelligent Systems
and Technology.
Xian, Y., Akata, Z., Sharma, G., Nguyen, Q., Hein, M.,
and Schiele, B. (2016). Latent embeddings for zero-shot
classification. In CVPR.
Xian, Y., Lampert, C. H., Schiele, B., and Akata, Z. (2018).
Zero-shot learning: A comprehensive evaluation of the
good, the bad and the ugly. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence.
Xian, Y., Sharma, S., Schiele, B., and Akata, Z. (2019). f-
vaegan-d2: A feature generating framework for any-shot
learning. In CVPR.
Xie, M.-K. and Huang, S.-J. (2021). Partial multi-label
learning with noisy label identification. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence.
Yan, Y. and Guo, Y. (2020). Partial label learning with
batch label correction. In AAAI.
Zhang, X., Zhao, J., and LeCun, Y. (2015). Character-
level convolutional networks for text classification. In
NeurIPS.
Zhang, Z.-R., Zhang, Q.-W., Cao, Y., and Zhang, M.-
L. (2021). Exploiting unlabeled data via partial label
assignment for multi-class semi-supervised learning. In
AAAI.
Zhao, B., Fu, Y., Liang, R., Wu, J., Wang, Y., and Wang,
Y. (2019). A large-scale attribute dataset for zero-shot
learning. In CVPR Workshops.
Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H.,
Xiong, H., and He, Q. (2020). A comprehensive survey
on transfer learning. Proceedings of the IEEE, 109(1):43–
76.
Supplementary Material:
Learning from Multiple Noisy Partial Labelers

A LIMITATIONS AND BROADER IMPACTS

Our work expands the space of supervision sources that can be incorporated into PWS systems. Weak supervision
is complementary to many other techniques, such as semi-supervised learning (Chapelle et al., 2009; Van Engelen
and Hoos, 2020), transfer learning (Pan and Yang, 2009; Zhuang et al., 2020), active learning (Cohn et al., 1996;
Settles, 2012), and zero-shot data generation (Bucher et al., 2017; Verma et al., 2018; Felix et al., 2018; Sariyildiz
and Cinbis, 2019; Xian et al., 2019; Narayan et al., 2020). A limitation of our work is that exploring how partial
labeling functions interact with these techniques is left as future work. The same is true for complementary
techniques within weak supervision, such as adversarial label learning (Arachie and Huang, 2021), learning rules
from labeled exemplars (Awasthi et al., 2020), and weak supervision with self-training (Karamanolakis et al.,
2021). Additionally, while PWS can enable more rapid development, its dependence on heuristics introduces the
potential for bias. For this reason, auditing any created models for potential negative impacts is as important,
if not more important, in PWS as in traditional supervised learning (Saleiro et al., 2018; Mehrabi et al., 2019).

B PROOF OF THEOREM 1

Theorem 1 provides sufficient conditions for the generic identifiability of the label model described in Section
3.2. In this section, we prove this theorem. Our proof is non-trivial because our label model yields probability
distributions that are a measure-zero subset of the distributions considered by Allman et al. (2009). Allman
et al. (2009) allows each entry of the class-conditional distributions to be any value in the interval [0, 1] such that
the entries sum to 1, whereas our label model imposes additional algebraic constraints on the entries. Allman
et al. (2009) establishes identifiability except for on a measure-zero subset of the distributions that they consider,
but we are unable to directly apply their results because our family of distributions might be contained in the
measure-zero subset they exclude. It is therefore necessary to establish that the set of distributions for which
identifiability does not exist is of measure zero with respect to the distributions that can be produced by our
label model, or, equivalently, the set of values of the accuracies αi,j , propensities βi , and class balance P (Y ) for
which identifiability does not exist has measure zero with respect to the set of all possible parameter values.
Background The key tool that we use in our proof is Kruskal’s unique factorization theorem, which relies on
the concept of Kruskal rank (Kruskal, 1977). The Kruskal rank of a matrix is defined to be the largest integer n
such that every set of n rows is linearly independent. A useful fact is that a matrix with full row rank also has
full Kruskal rank. Kruskal’s theorem says that if, for u = 1, 2, 3, we have a k × ru matrix Mu with Kruskal rank
Iu , and these Iu satisfy
I1 + I2 + I3 ≥ 2k + 2 (7)
then, given only the three-dimensional tensor M where entry (a, b, c) is given by
k
X
M (a, b, c) = M1 (j, a)M2 (j, b)M3 (j, c) (8)
j=1

we can recover the original matrices Mu .


Allman et al. (2009) makes the connection that the probability distribution of a latent variable model with three
observed variables and one latent variable that takes on a finite set of values can be described by the matrix M .
Each Mu can be intrepreted as a conditional probability matrix where row c is a probability distribution over
the possible values of feature u given that the latent variable has value c.
In situations where there are more than three observed variables, variables can be combined to form “grouped”
variables that satisfy the theorem conditions, if needed. In our case, it is possible that individual PLFs do not
Learning from Multiple Noisy Partial Labelers

have codomains that are large and diverse enough to satisfy the Kruskal rank requirement. In these situations,
the condition can be satisfied by amalgamating multiple PLFs to form a grouped PLF, which can be viewed as
an observed variable with a codomain that is the Cartesian product of the codomains of its member PLFs.
We show that the conditions of Theorem 1 ensure that, for a generic choice of parameters in a k-class model, there
is a tripartion of the PLFs such that two of the corresponding conditional probability matrices have full Kruskal
rank (k) and the third conditional probability matrix has a Kruskal rank of at least 2. Thus, the conditions
of Kruskal’s unique factorization theorem are satisfied, so we can recover the conditional probability matrices,
from which the accuracies αi,j , propensities βi , and class balance P (Y ) can be computed by solving a system of
equations.

Proof of Theorem 1. S1 , S2 , and S3 partition the PLFs into three disjoint subsets. For u = 1, 2, 3, define some
ordering to the PLFs in subset Su so that Su,i gives the i-th PLF in the subset. We will treat Su as a “grouped”
PLF with codomain G(Su ) = {(t1 , t2 , ..., t|Sj | ) | t1 ∈ G(Su,1 ), t2 ∈ G(Su,2 ), ..., t|Su | ∈ G(Su,|Su | )}, where G(G)
denotes the codomain of PLF G. Let Mu denote the k × |G(Su )| conditional probability matrix for the combined
output of all PLFs in subset Su , where each entry is a product containing some combination of βi , (1 − βi ), αi,j ,
(1 − αi,j ), and normalizing constants. We assume that the class balance P (Y ) has positive entries and all PLFs
have non-zero propensities βi , because any extraneous class labels or non-voting PLFs would be removed. Define
M̃1 = diag(P (Y ))M1 , where diag(v) denotes the matrix with the entries of vector v along its main diagonal and
zeros elsewhere. P (G), the observed distribution of PLF outputs, corresponds to the three-dimensional tensor
obtained from applying Equation 8 to M̃1 , M2 , and M3 . We will consider the Kruskal ranks of M̃1 , M2 , and
M3 , which we respectively denote I1 , I2 , and I3 .
We first consider M2 . The (row) rank of a matrix A is equal to the largest integer n for which there exists
an n × n submatrix of A that has a nonzero determinant. The determinant of such a submatrix is called an
n-minor. M2 has less than full row rank if and only if all of its k-minors are zero. This condition can be
expressed as the nonvanishing of a polynomial in the entries of M2 , which are themselves functions of the label
model parameters. In other words, the set of parameter values for which M2 does not full row rank is the zero
set of this polynomial. As described in Allman et al. (2009), so long as the polynomial is not identically zero,
the parameter values yielding less than full row rank is a measure-zero subset of the full parameter space. To
show that this polynomial is not zero for all values in the parameter space, it is sufficient to show that there
exists at least one set of parameter values for which the polynomial is nonzero, or, equivalently, that there is a
set of parameter values for which M2 has full row rank.
The values of the propensities βi and class balance P (Y ) do not affect row rank as long as they are nonnegative,
as assumed above. We now show that there is a setting of the accuracies αi,j for which the Kruskal rank of M2
is k. Set all αi,j = 1. By the conditions of Theorem 1, for each class c, there is an output in the codomain of S2
for which c appears in all of the individual PLF outputs and no other class appears in all outputs. This implies
the following two statements about the column in M2 that is associated with this output: (1) the c-th entry of
this column does not contain (1 − αi,j ) in its product, and (2) all other entries are products containing at least
one (1 − αi,j ). When αi,j = 1, these entries containing (1 − αi,j ) are all zero. In other words, M2 has k columns
that are all zero except for a single entry, and the row containing this entry is different across the k columns.
These columns form the basis of a column space of dimension k. For any matrix A, dim(Col A) = dim(Row A)
= row rank of A. Thus, the row rank of M2 is k. Since M2 has full row rank when all αi,j = 1, it also has full
Kruskal rank. This shows that the polynomial whose nonvanishing determines whether M2 has full row rank is
not identically zero, so M2 generically has full row and Kruskal rank.
We now consider M̃1 . The arguments applied to M2 can be applied exactly to M1 , but we are interested in
M̃1 = diag(P (Y ))M1 . However, since we assumed that P (Y ) contains only positive entries, and multiplying each
row of a matrix by a nonzero scalar does not change its row rank, the same arguments can be applied to M̃1 .
We conclude that M̃1 also generically has full Kruskal rank.
Finally, we consider M3 . The Kruskal rank of a matrix is less than two only if there are two rows that are scalar
multiples of each other. This can happen in our model only when the class conditional accuracies for two classes
are exactly equal, which corresponds to a measure zero subset of the parameter space. Thus, we can generically
assume that M3 has a Kruskal rank of I3 ≥ 2.
Since, generically, M1 and M2 have Kruskal ranks of k and M3 has a Kruskal rank of at least 2, we have
Peilin Yu, Tiffany Ding, Stephen H. Bach

I1 + I2 + I3 ≥ 2k + 2, so Kruskal’s unique factorization theorem tells us that we can recover M̃1 , M2 , and M3
from P (G), the observed distribution of PLF outputs. Once M̃1 , M2 , and M3 are known, the accuracies αi,j ,
propensities βi , and class balance P (Y ) can be computed using algebraic manipulations.

C DATASET INFORMATION

SciCite (Cohan et al., 2019) is a citation purpose classification dataset containing 8243 train, 916 development
and 1861 test instances of 3 categories sourced from scientific literatures and it is publicly available under Apache
License 2.0.
TREC-6 (Li and Roth, 2002) is a publicly available dataset for research use and it is a question classification
dataset containing a broad amount of open-domain questions from 6 semantic categories. Since the original
dataset lack a validation/development set, we sample 389 instances from the training set to make a train/dev/test
size split of 5063/389/500 respectively.
AG-News (Gulli, 2005; Zhang et al., 2015) is a publicly available dataset for research use. It is a large-scale
news topic classification dataset containing 4 categories. We similarly sample 500 training instances as our
development set. The train/dev/test size are 119.5k/500/7600 respectively.
LAD (Zhao et al., 2019) is a publicly available dataset for research use. It has approximately 78k instances
that organizes common objects into five sub-datasets: electronics, vehicles, fruits, hairstyles and animals. Each
sub-dataset is associated with 5 different seen/unseen class splits.
AwA2 (Xian et al., 2018) is a publicly available dataset for research use. It has 85 binary attributes for 50
animal classes. For our experiment, following the dataset authors, we adopt the proposed split that divides the
50 classes into 40 seen classes and 10 unseen classes.
All datasets we use are publicly available standard research datasets. These datasets generally do not contain
personally identifiable information. Public figures are sometimes mentioned in the text datasets.

D Additional Experiment Details

For the PLFs development and label modeling stage of the text classification task, the experiments are set on a
local PC with Intel i5-6600k CPU and 32 GB of RAM. For the discriminative modeling and PLFs development
that involves neural network inference/training for the object classification task, we perform our experiments on
virtual computing instances with Intel Xeon E5-2698 v4 CPU, 1 NVIDIA V100 GPU or 1 NVIDIA RTX 3090
GPU and 32 GB of RAM.

D.1 Text Classification

For both LFs Only and NPLM, following prior practices in programmatic weak supervision (Ratner et al., 2020;
Bach et al., 2019), we filter the training instances by only retaining ones with at least one PLFs/LFs votes
and the filtered instances are used for LF-Only/NPLM label and end model training. For the optimization of
LFs Only and NPLM, we use a initial learning rate of 0.01 and a reduce-learning-rate-on-plateau learning rate
scheduler with decreasing factor of 0.1. We train the NPLM/LFs Only Label models for 5 epochs.
For the end model, we adopt a pretrained(English) BERT-base model (Devlin et al., 2019) (bert-base-uncased)
with a 3-layer classification head. The model is implemented with AllenNLP (Gardner et al., 2017). The 3-layer
classification head has a hidden size of dimension 256 with LeakyReLU as activation, batch normalization and
a dropout layer with 50% probability. For all of the end models, we use AdamW (ADAM with weight decay)
(Loshchilov and Hutter, 2019) optimizer and we train the models with a 16 training batch size. For AG-News,
we train 20 epochs and a starting learning rate of 5e-7 and we set the gradient clipping threshold to 2.0. For
TREC-6, we train with a total of 25 epochs with a 2.0 gradient clip and 3e-5 initial learning rate. For SciCite,
we train the model with a total of 20 epochs with a 2.0 gradient clip and 3e-6 initial learning rate. The best
end model is picked based on the best validation macro-averaged F1. Please refer to <dataset> pipeline.ipynb,
run end model <dataset>.py, and /end/backbones/text classifiers.py in the experiment code repository for cor-
Learning from Multiple Noisy Partial Labelers

responding code.3

D.2 Object Classification

We use AwA2 for our generalized zero-shot experiments and the sub-tasks of LAD for zero-shot evaluations. For
AwA2, we follow the proposed splits and we follow the seen/unseen class split guide noted by the original authors
for LAD. We adopt previous practices and guidelines in evaluating generalized AwA results, using average per
class top-1 accuracy (or mean class accuracy, MCA) as the main performance metric for both unseen and seen
classes at test and then report the harmonic mean. For LAD, we follow the authors’ practices to report the
average accuracy over the 5 sub-categories, each with 5 different seen/unseen class splits.
While the train/validation split among the seen classes are given in AwA2, LAD does not supply a
train/validation split among the seen classes. We randomly sample at least one and at most 10% of the seen
classes as validation classes for the detector and the rest seen data as training.
For both AwA2 and LAD, we train detectors for each attribute with a 3 layer MLP. We consider setting the
hidden dimensions to either 512 or 1024. We select the size that gives the higher minimum per-class accuracy.
(In other words, whichever improves the worst-scoring class the most.) We use ILSVRC-pretrained ResNet-50
features for LAD and seen class-finetuned ResNet-101 features on AwA2. The minority class is balanced by
oversampling. We also apply batch normalization and 50% dropout during training at each layer. The activation
function used is LeakyReLU. For the optimization, we adopt a Adam optimizer with initial learning rate from
1e-4 and a multi-step learning rate scheduler. We train the detector with respect to {100, 300, 500} epochs with
a learning rate scheduling step size of {30, 80, 200} respectively. Best model is selected based on best validation
accuracy measured on the held-out seen classes.
For the NPLM label model, we use Adam optimizer with a reduce-learning-rate-on-plateau learning rate sched-
uler. The full set of hyperparameters can be found in the experiment code repository. The end model is a 3-layer
MLP with both hidden layers size being 1024. We apply batch normalization and 50% dropout at each layer
and it is activated with LeakyReLU. We optimize the end discriminative model with a initial learning rate of
1e-4 and we adopt a reduce-learning-rate-on-plateau learning rate scheduler with decreasing factor of 0.1. For
the generalized task on AwA2, we train the model for 11 epochs. For LAD, we pick the best model with the
lowest training loss. Same as the text tasks, the training objective is soft cross entropy.

E PLFs for Text Classification

In this section, we list the detailed partial labeling functions involved in our experiments. Corresponding code
for PLFs can be also found in the supplementary code as ipython notebook files.
For text experiments, for faster and more convenient PLFs development, we developed an abstraction class
WeakRule(exec module, label maps) for each PLFs, which has two crucial arguments/elements: exec module
that is the decision function programmatically defined and label maps that map the result from the decision
function to a partial label group.
To make development process of some rules even simpler, we develop a further abstraction inherited from
class WeakRule, BinaryRERule. BinaryRERule will exam if specified regular expression match a given
sentence and give either positive/negative feedback for each partial label groups or positive/abstain feedback if
it is a unipolar Binary PLF (meaning that it only cast supervision signals onto a single partial label group or
abstains).
For SciCite (10 PLFs):
Label Index Mapping:
0 - Background 1 - Method 2 - Result

def df1(instance):
sen = instance['preprocessed_string'].lower()
if ' we ' in sen
or ' our ' in sen
or ' by us ' in sen:
3
github.com/BatsResearch/yu-aistats22-code
Peilin Yu, Tiffany Ding, Stephen H. Bach

return 1
return -1
firstperson_rule = WeakRule(
exec_module=df1,
label_maps={0:[0], 1:[1,2]})

def sectionTitleRule(instance):
results_pat = 'result|discussion|Lconclusion|observation'
method_pat = 'method|approach|experiment|evaluation'
intro_pat = 'introduction|background'
title = str(instance['sectionName']).lower()
if re.search(method_pat, title):
return 1
elif re.search(intro_pat, title):
return 0
elif re.search(results_pat, title):
return 2
return -1
sectionTitle_rule = WeakRule(
exec_module=sectionTitleRule,
label_maps={0:[0], 1:[1], 2:[0, 2]})

def df2(instance):
sen = instance['preprocessed_string'].lower()
if ' result ' in sen
or ' results ' in sen:
return 1
return 0
result_rule = WeakRule(
exec_module=df2,
label_maps={0:[0,1], 1:[2]})

def length_citation(instance):
m = instance['citation']
if len(m.split(';')) > 2:
return 1
return -1
cit_len_rule = WeakRule(
exec_module=length_citation,
label_maps={0:[2], 1:[0,1]})

def df3(instance):
def result_related(inst):
patterns = ['equivocal result',
'similar result',
'same result',
'different result',
'expected result']
for pattern in patterns:
if pattern in inst:
return 1
return -1
return result_related(instance['preprocessed_string'].lower())
res_rule = WeakRule(
exec_module=df3,
label_maps={0:[1], 1:[0,2]})

re_patt_0 = 'using|measuring|used|the method|' +


'data|state-of-art|calculated|applied|' +
'according to|approach'
Learning from Multiple Noisy Partial Labelers

simple_re_rule_1 = BinaryRERules(
re_pattern=re_patt_0,
preproc=lambda inst:inst['string'],
label_maps={0:[0,2], 1:[1]}, unipolar=False)

def df4(instance):
def comparison(inst):
patterns = [
'in line with',
'discordant with',
'consistent with',
'keeping with',
'accordance with',
'agreement with',
'similar with',
'compared to',
'contrast to',
'contrary to',
'comparable to',
'contradict to',
'affirmed by',
'supported by',
'in support of',
]
for pattern in patterns:
if pattern in inst:
return 1
return -1
return comparison(instance['preprocessed_string'].lower())
resultp_rule = WeakRule(
exec_module=df4,
label_maps={0:[0,1], 1:[2]})

def df5(instance):
def match(inst):
patterns = [
'has been',
'in order to',
'considered to',
'initially',
'even if',
'have shown',
'has shown'
]
for pattern in patterns:
if pattern in inst:
return 1
return -1
return match(instance['preprocessed_string'].lower())
add_rule = WeakRule(
exec_module=df5,
label_maps={0:[2], 1:[0,1]})

re_patt_1 = 'our result|this is in keeping|' +


'with previous|this study differ|' +
'this is in agreement with|this conclusion|' +
'this finding|' +
'similar result.*(found|observed|obtained)'
re_tb = BinaryRERules(
re_pattern=re_patt_1,
preproc=lambda inst:inst['string'],
label_maps={0:[0,1], 1:[2]}, unipolar=True)
Peilin Yu, Tiffany Ding, Stephen H. Bach

re_patt_2 = 'we (employ|utilize)|iap-as|' +


'metaanalyses|temporal transition|' +
'was estimated|quantitative (analyses|analysis)' +
'|this procedure|implementation|sequence analysis' +
'|regularization method|were analysed|we adopt|' +
'bayesian|were sampled|quantitative method|fracture|' +
simulating|this design|algorithm|developed|' +
'model performance|(was|were) evaluated|as control' +
'|scheme|control management|(was|were|is|are) measured' +
'|the rats|the pigs|we appl'
re_tm = BinaryRERules(
re_pattern=re_patt_2,
preproc=lambda inst:inst['string'],
label_maps={0:[0,2], 1:[1]}, unipolar=True)

For TREC-6 (16 PLFs):


Label Index Mapping:
0 - Abbreviation 1 - Description and concepts 2 - Entities 3 - Human beings 4 - Locations 5 - Numeric values
def first_word(instance):
st = instance['string']
word = st.split()[0].lower()
if word == "who":
return 0
elif word == "where":
return 1
elif word == "when":
return 2
elif word == "why":
return 3
elif word == "how":
return 4
elif word == "name":
return 5
else:
return -1
first_word_rule = WeakRule(
exec_module=first_word,
label_maps={
0: [3],
1: [4],
2: [5],
3: [1],
4: [1, 5],
5: [2, 3],
6: [0]
})

def called(instance):
st = instance['string'].lower().split()
if "called" in st:
return 1
return -1
r1 = WeakRule(
exec_module=called,
label_maps={
0: [0, 1, 5],
1: [2, 3, 4]
})

def mean(instance):
st = instance['string'].lower().split()
if "mean" in st or "meaning" in st:
return 1
return -1
r2 = WeakRule(
exec_module=mean,
label_maps={
0: [2, 3, 4, 5],
1: [0, 1]
})
Learning from Multiple Noisy Partial Labelers

def abbre(instance):
st = instance['string'].lower()
if "stand for" in st or "abbreviat" in st:
return 1
return -1
r3 = WeakRule(
exec_module=abbre,
label_maps={
0: [1, 2, 3, 4, 5],
1: [0]
})

def desc(instance):
st = instance['string'].lower()
tokens = st.split()
if "definition" in tokens or \
"come from" in st or \
"origin" in tokens:
return 1
return -1
r4 = WeakRule(
exec_module=desc,
label_maps={
0: [0, 2, 3, 4, 5],
1: [1]
})

def enty(instance):
subject = get_subject(instance['string'])
if subject in ("animal", "body",
"color", "creative",
"currency", "disease",
"event", "food", "instrument",
"language", "letter", "plant",
"product", "religion",
"sport", "substance", "symbol",
"technique", "term",
"vehicle", "word"):
return 1
return -1
r5 = WeakRule(
exec_module=enty,
label_maps={
0: [0, 1, 3, 4, 5],
1: [2]
})

def loc(instance):
subject = get_subject(instance['string'])
if subject in ("city", "country", "mountain", "state", "capital"):
return 1
return -1
r7 = WeakRule(
exec_module=loc,
label_maps={
0: [0, 1, 2, 3, 5],
1: [4]
})

def num(instance):
st = instance['string'].lower()
if "what year" in st:
Peilin Yu, Tiffany Ding, Stephen H. Bach

return 1
return -1
r8 = WeakRule(
exec_module=num,
label_maps={
0: [0, 1, 2, 3, 4],
1: [5]
})

def num1(instance):
st = instance['string'].lower()
if "how many" in st or 'how much' in st or 'how old' in st:
return 1
return -1
r9 = WeakRule(
exec_module=num1,
label_maps={
0: [0, 1, 2, 3, 4],
1: [5]
})

whatmean = BinaryRERules(
name='what_*_mean',
re_pattern='what.*mean',
preproc=lambda inst:inst['string'].lower(),
label_maps={0:[0, 2, 3, 4,5], 1:[1]}, unipolar=True)

desc_r1 = BinaryRERules(
name='what_*_',
re_pattern='what.*use of|what.*origin of|why do',
preproc=lambda inst:inst['string'].lower(),
label_maps={0:[0, 2, 3, 4,5], 1:[1]}, unipolar=True)

numpattern1 = BinaryRERules(
name='num_patt',
re_pattern='how far|what.*birthday|how long|how deep|'+
'when did|when was|how tall|what month|population|toll'+
'|how big|how long|what year',
preproc=lambda inst:inst['string'].lower(),
label_maps={0:[0,1, 2, 3, 4], 1:[5]}, unipolar=True)

descpatt_1 = BinaryRERules(
name='num_patt',
re_pattern='what is the origin|what is the history'+
'|what.*mean|how do you buy|what is the difference'+
'|how can I|how do I|what effect',
preproc=lambda inst:inst['string'].lower(),
label_maps={0:[0, 2, 3, 4,5], 1:[1]},
unipolar=True)

descpatt_2 = BinaryRERules(
name='num_patt',
re_pattern='how.*tell|how d.*affect|how do.*work'+
'|how do you fix|how do you get|how do you find'+
Learning from Multiple Noisy Partial Labelers

'|how do I find|how.*made',
preproc=lambda inst:inst['string'].lower(),
label_maps={0:[0, 2, 3, 4,5], 1:[1]}, unipolar=True)

newhow = BinaryRERules(
name='num_patt',
re_pattern='how do|how was|how are|how is'+
'|how was|how could|how can',
preproc=lambda inst:inst['string'].lower(),
label_maps={0:[0, 2, 3, 4,5], 1:[1]}, unipolar=True)

wsf = BinaryRERules(
name='what_*_sf',
re_pattern='what.*stand for',
preproc=lambda inst:inst['string'].lower(),
label_maps={0:[1, 2, 3, 4,5], 1:[0]}, unipolar=True)

For AG-News (11 PLFs):


Label Index Mapping: 0 - World 1 - Sports 2 - Business 3 - Science/Technology

def df1(instance):
entity_doc = instance['sen_ner']
if 'EVENT' in entity_doc and 'GPE' in entity_doc:
return 1
return -1
gpe_event_title = WeakRule(
name='gpe+event_title',
exec_module=df1,
label_maps={0:[0,2,3], 1:[1]})

def df2(instance):
entity_doc = instance['title_ner']
if 'LOC' in entity_doc:
return 1
return -1
loc_title_rule = WeakRule(name='loc_title',
exec_module=df2, l
abel_maps={0:[1,2,3], 1:[0]})

sp0 = BinaryRERules(
name='sports_names',
re_pattern='kelvim escobar|red sox|'+
'formula one|grand prix|svetlana kuznetsova'+
'|billy wagner| kobe | beckham|johnny damon'+
'|robin ventura|olivier panis',
preproc=lambda inst:inst['sen_lemma'].lower(),
label_maps={0:[0,2,3], 1:[1]}, unipolar=True)

def bexclusive_lemmas(instance):
business_keywords = {'profit', 'bankrupt', 'yen', 'financial'}
doc = instance['sen_lemma'].lower().split()
for word in doc:
for keyword in business_keywords:
if word.startswith(keyword):
return 1
Peilin Yu, Tiffany Ding, Stephen H. Bach

return -1
bexclusive_lemmas_rule = WeakRule(
name='bus_lemma',
exec_module=bexclusive_lemmas,
label_maps={0:[0,1,3], 1:[2]})

tech_patt = 'space.com|space station|'+


'network authentication|(python|java|matlab|c) developer|'+
'application|virus|browser hijack|search engine|internet-based|'+
'windows update|smart phone|source code|mangement software'+
'|software.*develop|internet connection|interactive gam'+
'|game console|transfer datum|internet security|g network'+
'|internet company|storage capacity|music player|microsystem'+
'|comsumer electronic|operat.*system|wireness network|motherboard'+
'|spacecraft|malicious program|video game'
techr = BinaryRERules(name='tech_terms',
re_pattern=tech_patt,
preproc=lambda inst:inst['sen_lemma'].lower(),
label_maps={0:[0,1,2], 1:[3]}, unipolar=True)

def exclusive_lemmas(instance):
world_pol_keywords = {'mideast', 'iraq',
'baghdad', 'pakistan',
'afghan', 'kurd', 'arab',
'egypt', 'iran', 'turkey',
'syria', 'bahrain',
'israel', 'jordan',
'kuwait', 'lebanon',
'oman', 'palestine',
'qatar', 'saudi', 'uae',
'yemen' ,'chechnya'}
world_pol_keywords |= {'al-qaeda', 'taliban'}
world_pol_keywords |= {'hostage', 'abduct',
'hijack'}
world_pol_keywords |= {'invasion', ' coup ',
'curfew', 'army', 'troop',
'peace', 'militant', 'missile'}
world_pol_keywords |= {'murder', 'death'}
sports_keywords = {'baseball', 'football', 'soccer',
'hockey', 'basketball',
'tennis', 'golf'}
sports_keywords |= {'stadium', 'arena'}
sports_keywords |= {'season', 'playoff', 'tournament'}
sports_keywords |= {'mlb', 'nfl', 'nba', 'mls',
'nhl', 'ncaa', 'league', 'racing'}
sports_keywords |= {'premiership'}
sports_keywords |= {'quarterback', 'centerback',
'fullback', 'pitcher'}
business_keywords = {'profit', 'bankrupt', 'financial'}
bt_keywords = {'robot', 'robotic'}
bt_keywords |= {'web', 'internet'}
bt_keywords |= {'linux'}
bt_keywords |= {'stem-cell', 'biotechnology'}
bt_keywords |= {'xbox', 'playstation'}
bt_keywords |= {'microsoft'}
bt_keywords |= {'space', 'nasa'}
bt_keywords |= {'adobe', 'ipod', 'apple', 'xerox', 'ibm'}
doc = instance['sen_lemma'].lower().split()
for word in doc:
for keyword in bt_keywords:
if word.startswith(keyword):
return 3
for word in doc:
for keyword in world_pol_keywords:
if word.startswith(keyword):
return 0
for word in doc:
for keyword in sports_keywords:
Learning from Multiple Noisy Partial Labelers

if word.startswith(keyword):
return 1
for word in doc:
for keyword in business_keywords:
if word.startswith(keyword):
return 2
return -1
exclusive_lemmas_rule = WeakRule(
exec_module=exclusive_lemmas,
label_maps={0:[0], 1:[1], 2:[2], 3:[2,3]})

def df3(instance):
entity_doc = instance['title_ner']
if 'PERSON' in entity_doc and 'GPE' in entity_doc:
return 1
return -1
person_gpe_title_rule = WeakRule(
name='p+gpe_title',
exec_module=df3,
label_maps={0:[2, 3], 1:[0, 1]})

def df4(instance):
entity_doc = instance['sen_ner']
if 'PERSON' in entity_doc and 'EVENT' in entity_doc:
return 1
return -1
person_event_sen_rule = WeakRule(
name='person+event_sen',
exec_module=df4,
label_maps={0:[2,3], 1:[0,1]})

def df5(instance):
entity_doc = instance['title_ner']
if 'PRODUCT' in entity_doc:
return 1
return -1
sen_prod_t_rule = WeakRule(
exec_module=df5,
label_maps={0:[0,1], 1:[2,3]})

dpw2 = BinaryRERules(
name='dpw2',
re_pattern='minister|chairman',
preproc=lambda inst:inst['sen_lemma'].lower(),
label_maps={0:[1,3], 1:[0,2]}, unipolar=True)

intnr = BinaryRERules(
name='tech',
re_pattern='internet',
preproc=lambda inst:inst['sen_lemma'].lower(),
label_maps={0:[1,2], 1:[0,3]}, unipolar=True)

You might also like