0% found this document useful (0 votes)

39 views37 pages

Agnostic Learning of Monomials by Halfspaces Is Hard

This document summarizes a research paper that proves the following result: Finding a halfspace (linear threshold function) that correctly classifies at least a (1/2 + ε) fraction of examples is NP-hard, even when there exists a monomial (conjunction of literals) that correctly classifies a (1 - ε) fraction, for any constant ε > 0. This subsumes previous hardness results for learning monomials, decision lists, and halfspaces. The techniques involve defining distributions with matching low-order moments and using invariance principles and structural results about halfspaces.

Uploaded by

Diana Mihaela Ilie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views37 pages

Agnostic Learning of Monomials by Halfspaces Is Hard

Uploaded by

Diana Mihaela Ilie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Electronic Colloquium on Computational Complexity, Report No.

185 (2010)

Agnostic Learning of Monomials by Halfspaces is Hard ∗

Vitaly Feldman† Venkatesan Guruswami‡ Prasad Raghavendra§ Yi Wu¶

December 1, 2010

Abstract
We prove the following strong hardness result for learning: Given a distribution of labeled
examples from the hypercube such that there exists a monomial consistent with (1 − ) of
the examples, it is NP-hard to find a halfspace that is correct on (1/2 + ) of the examples,
for arbitrary constants > 0. In learning theory terms, weak agnostic learning of monomials
is hard, even if one is allowed to output a hypothesis from the much bigger concept class of
halfspaces. This hardness result subsumes a long line of previous results, including two recent
hardness results for the proper learning of monomials and halfspaces. As an immediate corollary
of our result we show that weak agnostic learning of decision lists is NP-hard.
Our techniques are quite different from previous hardness proofs for learning. We define
distributions on positive and negative examples for monomials whose first few moments match.
We use the invariance principle to argue that regular halfspaces (all of whose coefficients have
small absolute value relative to the total `2 norm) cannot distinguish between distributions
whose first few moments match. For highly non-regular subspaces, we use a structural lemma
from recent work on fooling halfspaces to argue that they are “junta-like” and one can zero
out all but the top few coefficients without affecting the performance of the halfspace. The
top few coefficients form the natural list decoding of a halfspace in the context of dictatorship
tests/Label Cover reductions.
We note that unlike previous invariance principle based proofs which are only known to give
Unique-Games hardness, we are able to reduce from a version of Label Cover problem that
is known to be NP-hard. This has inspired follow-up work on bypassing the Unique Games
conjecture in some optimal geometric inapproximability results.

∗
An extended abstract appeared in the Proceedings of the 50th IEEE Symposium on Foundations of Computer
Science, 2009.
†
IBM Almaden Research Center, San Jose, CA. [email protected].
‡
Computer Science Department, Carnegie Mellon University, Pittsburgh, PA. [email protected].
§
College of Computing, Georgia Institute of Technology, Atlanta, GA. [email protected]. Some of this
work was done when visiting Carnegie Mellon University.
¶
IBM Almaden Research Center, San Jose, CA. [email protected]. Most of this work was done when the author
was at Carnegie Mellon University.

ISSN 1433-8092
1 Introduction

Boolean conjunctions (or monomials), decision lists, and halfspaces are among the most basic
concept classes in learning theory. They are all long-known to be efficiently PAC learnable, when
the given examples are guaranteed to be consistent with a function from any of these concept
classes [42, 7, 39]. However, in practice data is often noisy or too complex to be consistently
explained by a simple concept. A common practical approach to such problems is to find a predictor
in a certain space of hypotheses that best fits the given examples. A general model for learning that
addresses this scenario is the agnostic learning model [21, 26]. An agnostic learning algorithm for
a class of functions C using a hypothesis space H is required to perform the following task: Given
examples drawn from some unknown distribution, the algorithm must find a hypothesis in H that
classifies the examples nearly as well as is possible by a hypothesis from C. The algorithm is said
to be a proper learning algorithm if C = H.
In this work we address the complexity of agnostic learning of monomials by algorithms that
output a halfspace as a hypothesis. Learning methods that output a halfspace as a hypothesis such
as Perceptron [40], Winnow [34], Support Vector Machines [43] as well as most boosting algorithms
are well-studied in theory and widely used in practical prediction systems. These classifiers are
often applied to labeled data sets which are not linearly separable. Hence it is of great interest to
determine the classes of problems that can be solved by such methods in the agnostic setting. In
this work we demonstrate a strong negative result on agnostic learning by halfspaces. We prove
that non-trivial agnostic learning of even the relatively simple class of monomials by halfspaces is
an NP-hard problem.
Theorem 1.1. For any constant > 0, it is NP-hard to find a halfspace that correctly labels
(1/2 + )-fraction of given examples over {0, 1}n even when there exists a monomial that agrees
with a (1 − )-fraction of the examples.

Note that this hardness result is essentially optimal since it is trivial to find a hypothesis with
agreement rate 1/2 — output either the function that is always 0 or the function that is always
1. Also note that Theorem 1.1 measures agreement of a halfspace and a monomial with the given
set of examples rather than the probability of agreement of h with an example drawn randomly
from an unknown distribution. Uniform convergence results based on the VC dimension imply that
these settings are essentially equivalent (see for example [21, 26]).
The class of monomials is a subset of the class of decision lists which in turn is a subset of the
class of halfspaces. Therefore our result immediately implies an optimal hardness result for proper
agnostic learning of decision lists.

Previous work

Before describing the details of the prior body of work on hardness results for learning, we note that
our result subsumes all these results with just one exception (the hardness of learning monomials
by t-CNFs [32]). This is because we obtain the optimal inapproximability factor and allow learning
of monomials by the much richer class of halfspaces.
The results of the paper are noteworthy in the broader context of hardness of approximation.
Previously, hardness proofs based on the invariance principle were only known to give Unique-Games

2
hardness. In this work, we are able to harness invariance principles to show NP-hardness result by
working with a version of Label Cover whose projection functions are only required to be unique-
on-average. This could be one potential approach to revisit the many strong inapproximability
results conditioned on the Unique Games conjecture (UGC), with an eye towards bypassing the
UGC assumption. Such a goal was achieved for some geometric problems recently [?]; see Section
2.3.
Agnostic learning of monomials, decision lists and halfspaces has been studied in a number of
previous works. Proper agnostic learning of a class of functions C is equivalent to the ability to
come up with a function in C which has the optimal agreement rate with the given set of examples
and is also referred to as the Maximum Agreement problem for a class of function C.
The Maximum Agreement problem for halfspaces is equivalent to the so-called Hemisphere
problem and is long known to be NP-complete [23, 17]. Amaldi and Kann [1] showed that Maximum
Agreement for halfspaces is NP-hard to approximate within 261 262 factor. This was later improved
by Ben-David et al. [5], and Bshouty and Burroughs [9] to approximation factors 415 418 , and 85 ,
84

respectively. An optimal inapproximability result was established independently by Guruswami and

Raghavendra [20] and Feldman et al. [15] showing NP-hardness of approximating the Maximum
Agreement problem for halfspaces within (1/2 + ) for every constant > 0. The reduction in [15]
requires examples with real-valued coordinates, whereas the proof in [20] also works for examples
drawn from the Boolean hypercube.
The Maximum Agreement problem for monotone monomials was shown to be NP-hard by
Angluin and Laird [2], and NP-hardness for general monomials was shown by Kearns and Li [27].
The hardness of approximating the maximum agreement within 767 770 was shown by Ben-David et
al. [5]. The factor was subsequently improved to 58/59 by Bshouty and Burroughs [9]. Finally,
Feldman et al. [14, 15] showed a tight inapproximability result, namely that it is NP-hard to
distinguish between the instances where (1 − )-fraction of the labeled examples are consistent with
some monomial and instances where every monomial is consistent with at most (1/2 + )-fraction of
the examples. Recently, Khot and Saket [32] proved a similar hardness result even when a t-CNF
is allowed as output hypothesis for an arbitrary constant t (a t-CNF is the conjunction of several
clauses, each of which has at most t literals; a monomial is thus a 1-CNF).
For the concept class of decisions lists, APX-hardness (or hardness to approximate within some
constant factor) of the Maximum Agreement problem was shown by Bshouty and Burroughs [9].
As mentioned above, our result subsumes all these results with the exception of [32].
A number of hardness of approximation results are also known for the complementary problem
of minimizing disagreement for each of the above concept classes [26, 22, 3, 8, 14, 15]. Another
well-known evidence of the hardness of agnostic learning of monomials is that even a non-proper
agnostic learning of monomials would give an algorithm for learning DNF — a major open problem
in learning theory [33]. Further, Kalai et al. proved that even agnostic learning of halfspaces with
respect to the uniform distribution implies learning of parities with random classification noise —
a long-standing open problem in learning theory and coding [24].
Monomials, decision lists and halfspaces are known to be efficiently learnable in the presence
of more benign random classification noise [2, 25, 28, 10, 6, 12]. Simple online algorithms like
Perceptron and Winnow learn halfspaces when the examples can be separated with a significant
margin (as is the case if the examples are consistent with a monomial) and are known to be robust

3
to a very mild amount of adversarial noise [16, 4, 18]. Our result implies that these positive results
will not hold when the adversarial noise rate is for any constant > 0.
√
Kalai et al. gave the first non-trivial algorithm for agnostic learning monomials in time 2 Õ( n)
[24]. They also gave a breakthrough result for agnostic learning of halfspaces with respect to the
uniform distribution on the hypercube up to any constant accuracy (and analogous results for a
number of other settings). Their algorithms output linear thresholds of parities as hypotheses. In
contrast, our hardness result is for algorithms that output a halfspace (which is a linear threshold
of single variables).

Organization of the paper: We sketch the idea of our proof in Section 2. We define some
probability and analytical tools in Section 3. In Section 4 we define the dictatorship test, which is
an important gadget for the hardness reduction. For the purpose of illustration, we also show
why this dictatorship test already suffices to prove Theorem 1.1 assuming the Unique Games
Conjecture [29]. In Section 5, we describe a reduction from a variant of the Label Cover problem
to prove Theorem 1.1 under the assumption that P 6= NP.

Notation: We use 0 to encode “False” and 1 to encode “True”. We denote pos(t) : R → {0, 1}
as the indicator function of whether t > 0; i.e., pos(t) = 1 when t > 0 and pos(t) = 0 when t < 0.
For x = (x1 , x2 , . . . , xn ) ∈ {0, 1}n , w ∈ Rn , and θ ∈ R, a halfspace h(x) isVa Boolean function of
the form pos(w ∙ x − θ); a monomial (conjunction) is a function of the form i∈S si , where S ⊆ [n]
W si is the literal of xi which can represent either xi or ¬xi ; a disjunction is a function of the form
and
i∈S si . One special case of monomials is the function f (x) = xi for some i ∈ [n], also referred to
as the i-th dictator function.

2 Proof Overview

We prove Theorem 1.1 by exhibiting a reduction from the k-Label Cover problem, which is
a particular variant of the Label Cover problem. The k-Label Cover problem is defined as
follows:

Definition 2.1. For positive integer M, N that M > N and k > 2, an instance of k-Label
Cover L(G(V, E), M, N, {π v,e |e ∈ E, v ∈ e}) consists of a k-uniform connected (multi-)hypergraph
G(V, E) with vertex set V and an edge multiset E; a set of functions {π vi ,e }ki=1 . Every hyperedge
e = (v1 , . . . , vk ) is associated with a k-tuple of projection functions {π vi ,e }ki=1 where π vi ,e : [M ] →
[N ].
A vertex labeling Λ is an assignment of labels to vertices Λ : V → [M ]. A labeling Λ is said to
strongly satisfy an edge e if π vi ,e (Λ(vi )) = π vj ,e (Λ(vj ))) for every vi , vj ∈ e. A labeling Λ weakly
satisfies edge e if π vi ,e (Λ(vi )) = π vj ,e (Λ(vj ))) for some vi , vj ∈ e, vi 6= vj .

The goal in Label Cover is to find a vertex labeling that satisfies as many edges (projection
constraints) as possible.

4
2.1 Hardness assuming the Unique Games conjecture

For the sake of clarity, we first sketch the proof of Theorem 1.1 with a reduction from the k-
Unique Label Cover problem which is a special case of k-Label Cover where M = N and all
the projection functions {π v,e |v ∈ e, e ∈ E} are bijections. The following inapproximability result
[?] for k-Unique Label Cover is equivalent to the Unique Games Conjecture of Khot [29].

Conjecture 2.2. For every constant η > 0 and a positive integer k, there exists an integer R0
such that for all positive integers R > R0 , given an instance L(G(V, E), R, R, {π v,e |e ∈ E, v ∈ e})
it is NP-hard to distinguish between,

• strongly satisfiable instances: there exists a labeling Λ : V → [R] that strongly satisfies 1 − kη
fraction of the edges E.
2k2
• almost unsatisfiable instances: there is no labeling that weakly satisfies Rη/4
fraction of the
edges.

Given an instance L of k-Unique Label Cover, we will produce a distribution D over labeled
examples such that the following holds: if L is a strongly satisfiable instance, then there is a
disjunction that agrees with the label on a randomly chosen example with probability at least 1 − ,
while if L is an almost unsatisfiable instance then no halfspace agrees with the label on a random
example from D with probability more than 12 + . Clearly, such a reduction implies Theorem 1.1
assuming the Unique Games Conjecture but with disjunctions in place of conjunctions. De Morgan’s
law and the fact that a negation of a halfspace is a halfspace then imply that the statement is also
true for monomials (we use disjunctions only for convenience).
Let L be an instance of k-Unique Label Cover on hypergraph G = (V, E) and a set of labels
[R]. The examples we generate will have |V | × R coordinates, i.e., belong to {0, 1}|V |×R . These
coordinates are to be thought of as one block of R coordinates for every vertex v ∈ V . We will
(r)
index the coordinates of x ∈ {0, 1}|V |×R as x = (xv )v∈V,r∈[R] .
For every labeling Λ : V → [R] of the instance, there is a corresponding disjunction over
{0, 1}|V |×R given by, _
h(x) = xv(Λ(v)) .
v
(r)
Thus, using a label r for a vertex v is encoded as including the literal xv in the disjunction. Notice
that an arbitrary halfspace over {0, 1}|V |×R need not correspond to any labeling at all. The idea
would be to construct a distribution on examples which ensures that any halfspace agreeing with at
least 12 + fraction of random examples somehow corresponds to a labeling of Λ weakly satisfying
a constant fraction of the edges in L.
Fix an edge e = (v1 , . . . , vk ). For the sake of exposition, let us assume π vi ,e is the identity
permutation for every i ∈ [k]. The general case is not anymore complicated.
For the edge e, we will construct a distribution on examples De with the following properties:

(r)
• All coordinates xv for a vertex v ∈
/ e are P
fixed to be zero. Restricted to these examples, the
halfspace h can be written as h(x) = pos( i∈[k] hwvi , xvi i − θ).

5
• For any label r ∈ [R], the labeling Λ(v1 ) = . . . = Λ(vk ) = r strongly satisfies the edge e.
(r)
Hence, the corresponding disjunction ∨i∈[k] xvi needs to have agreement > 1 − with the
examples from De .
• There exists a decoding procedure that given a halfspace h outputs a labeling Λh for L such
that, if h has agreement > 12 + with the examples from De , then Λh weakly satisfies the edge
e with non-negligible probability.

For conceptual clarity, let us rephrase the above requirement as a testing problem. Given
a halfspace h, consider a randomized procedure that samples an example (x, b) from the distri-
bution De , and accepts if h(x) = b. This amounts to a test that checks if the function h cor-
responds
P to a consistent labeling. Further, let us suppose the halfspace h is given by h(x) =
pos v∈V hw v , x i
v P− θ . Define the linear function f v : {0, 1}R → R as f (x ) = hw , x i. Then,
v v v v
we have h(x) = pos( v∈V fv (xv ) − θ).
(Λ(v))
For a halfspace h corresponding to a labeling Λ, we will have fv (xv ) = xv – a dictator
function. Thus, in the intended solution every linear function fv associated with the halfspace h is
a dictator function.
Now, let us again restate the above testing problem in terms of these linear functions. For
succinctness, we write fi for the linear function fvi . We need a randomized procedure that does
the following:

Given k linear functions f1 , . . . , fk : {0, 1}R → R, queries

P the functions at one point
each (say x1 , . . . , xk respectively), and accepts if pos( ki=1 fi (xi ) − θ) = b.

The procedure must satisfy,

• (Completeness) If each of the linear functions fi is the r’th dictator function for some r ∈ [R],
then the test accepts with probability 1 − .

• (Soundness) If the test accepts with probability 1

2 + , then at least two of the linear functions
are close to the same dictator function.

A testing problem of the above nature is referred to as a Dictatorship Testing and is a recurring
theme in hardness of approximation.
Notice that the notion of a linear function being close to a dictator function is not formally
defined yet. In most applications, a function is said to be close to a dictator if it has influential
coordinates. It is easy to see that this notion is not sufficient by itself here. For example, in the
linear function pos(10100 x1 + x2 − 0.5), although the coordinate x2 has little influence on the linear
function, it has significant influence on the halfspace.
We resolve this problem by using the notion of critical index (Definition 3.1) that was introduced
in [41] and has found numerous applications in the analysis of halfspaces [35, 38, 13]. Roughly
speaking, given a linear function f , the idea is to recursively delete its influential coordinates until
there are none left. The total number of coordinates so deleted is referred to as the critical index
of f . Let cτ (wi ) denote the critical index of wi , and let Cτ (wi ) denote the set of cτ (wi ) largest
coordinates of wi . The linear function l is said to be close to the i’th dictator function for every i

6
in Cτ (wi ). A function is far from every dictator if it has critical index 0 – no influential coordinate
to delete.
An important issue is that the critical index of a linear function can be much larger than the
number of influential coordinates and cannot be appropriately bounded. In other words, a linear
function can be close to a large number of dictator functions, as per the definition above. To counter
this, we employ a structural lemma about halfspaces that was used in the recent work on fooling
halfspaces with limited independence [13]. Using this lemma, we are able to prove that if the critical
index is large, then one can in fact zero out the coordinates of wi outside the t largest coordinates
for some large enough t, and the agreement of the halfspace h only changes by a negligible amount!
Thus, we first carry out the zeroing operation for all linear functions with large critical index.
We now describe the above construction and analysis of the dictatorship test in some more
detail. It is convenient to think of the k queries x1 , . . . , xk as the rows of a k × R matrix with {0, 1}
entries. Henceforth, we will refer to matrices {0, 1}k×R and their rows and columns.
k
We construct two distributions D0 , D1 on {0, 1} such that for s ∈ {0, 1}, we have Prx∈Ds ∨i=1 xi =
k

s > 1 − /2 for = ok (1) (this will ensure the completeness of the reduction, i.e., certain disjunc-
tions pass with high probability). Further, the distributions D0 , D1 will be carefully chosen to have
matching first four moments. This will be used in the soundness analysis where we will use an
invariance principle to infer structural properties of halfspaces that pass the test with probability
noticeably greater than 1/2.
We define the distribution D̃sR on matrices {0, 1}k×R by sampling R columns independently
according to Ds , and then perturbing each bit with a small probability /2. We define the following
test (or equivalently, distribution on examples): given a halfspace h on {0, 1}k×R , with probability
1/2 we check h(x) = 0 for a sample x ∈ D̃0R , and with probability 1/2 we check h(x) = 1 for a
sample x ∈ D̃1R .
(j)
Completeness: By construction, each of the R disjunctions ORj (x) = ∨ki=1 xi passes the test
(j)
with probability at least 1 − (here xi denotes the entry in the i’th row and j’th column of x).
Soundness: For the soundness analysis, suppose h(x) = pos(hw, xi − θ) is a halfspace that
passes the test with probability at least 1/2 + . The halfspace h can be written
Pk in two ways by
expanding the inner product hw, xi along rows and columns, i.e., h(x) = pos( i=1 hwi , xi i − θ) =
P
pos( R i=1 hw , x i − θ). Let us denote fi (x) = hwi , xi i.
(i) (i)

First, let us see why the linear functions hwi , xi i must be close to some dictator. Note that we
need to show that two of the linear functions are close to the same dictator.
Suppose each of the linear functions fi is not close to any dictator. In other words, for each
i, no single coordinate of the vector wi is too large (contains more than τ -fraction of the `2 mass
kwi k2 of vector wi ). Clearly, this implies that no single column of the matrix w is too large.
P P
Recall that the halfspace is given by h(x) = pos( j∈[R] hw(j) , x(j) i−θ). Here l(x) = j∈[R] hw(j) , x(j) i−
θ is a degree 1 polynomial into which we are substituting values from two product distributions D0R
and D1R . Further, the distributions D0 and D1 have matching moments up to order 4 by design.
Using the invariance principle, the distribution of l(x) is roughly the same, whether x is from D0R
or D1R . Thus, by the invariance principle, the halfspace h is unable to distinguish between the
distributions D0R and D1R with a noticeable advantage.

7
Further, suppose no two linear functions fi are close to the same dictator, i.e., Cτ (wi )∩Cτ (wj ) =
(j)
∅. In this case, we condition on the values of xi for j ∈ Cτ (wi ). Since Cτ (wi ) ∩ Cτ (wj ) = ∅, this
conditions at most one value in each column. Therefore, the conditional distribution on each column
in cases D0 and D1 still have matching first three moments. We thus apply the invariance principle
using the fact that after deleting the coordinates in Cτ (wi ), all the remaining coefficients of the
weight vector w are small (by definition of critical index). This implies that Cτ (wi ) ∩ Cτ (wj ) 6= ∅
for some two rows i, j and finishes the proof of the soundness claim.
The above consistency-enforcing test almost immediately yields the Unique Games hardness of
weak learning disjunctions by halfspaces via standard methods.

2.2 Extending to NP-hardness

To prove NP-hardness as opposed to hardness assuming the Unique Games conjecture, we reduce
a version of Label Cover to our problem. This requires a more complicated consistency check, and
we have to overcome several additional technical obstacles in the proof.
The main obstacle encountered in transferring the dictatorship test to a Label Cover-based
hardness is one that commonly arises for several other problems. Specifically, the projection con-
straint on an edge e = (u, v) maps a large set of labels R = {r1 , . . . , rd } corresponding to a vertex
u to a single label r for the vertex v. While composing the Label Cover constraint (u, v) with the
dictatorship test, all labels in R have to be necessarily equivalent. In several settings including this
work, this requires the coordinates corresponding to labels in R to be mostly identical! However, on
making the coordinates corresponding to R identical, the prover corresponding to u can determine
the identity of edge (u, v), thus completely destroying the soundness of the composition. In fact,
the natural extension of the Unique Games-based reduction for MaxCut [31] to a corresponding
Label Cover hardness fails primarily for this reason.
Unlike MaxCut or other Unique Games-based reductions, in our case, the soundness of the
dictatorship test is required to hold against a specific class of functions, i.e, halfspaces. Harnessing
this fact, we execute the reduction starting from a Label Cover instance whose projections are
unique on average. More precisely, a smooth Label Cover (introduced in [30]) is one in which for
every vertex u, and a pair of labels r, r 0 , the labels {r, r 0 } project to the same label with a tiny
probability over the choice of the edge e = (u, v). Technically, we express the error term in the
invariance principle as a certain fourth moment of the coefficients of the halfspace, and use the
smoothness to bound this error term for most edges of the Label Cover instance.

2.3 Bypassing the Unique Games conjecture

Unlike previous invariance principle based proofs which are only known to give Unique-Games
hardness, we are able to reduce from a version of the Label Cover problem, based on unique
on average projections, that can be shown to be NP-hard. It is of great interest to find other
applications where a weak uniqueness property like the smoothness condition mentioned above
can be used to convert a Unique-Games hardness result to an unconditional NP-hardness result.
Indeed, inspired by the success of this work in avoiding the UGC assumption and using some of
our methods, follow-up work has managed to bypass the Unique Games conjecture in some optimal
geometric inapproximability results [?]. To the best of our knowledge, the results of [?] are the
first NP-hardness proofs showing a tight inapproximability factor that is related to fundamental

8
parameters of Gaussian space, and among the small handful of results where optimality of a non-
trivial semidefinite programming based algorithm is shown under the assumption P 6= NP. We hope
that this paper has thus opened the avenue to convert at least some of the many tight Unique-Games
hardness results to NP-hardness results.

3 Preliminaries

In this section, we define two important tools in our analysis: i) critical index, ii) invariance
principle.

3.1 Critical Index

The notion of critical index was first introduced by Servedio [41] and plays an important role in
the analysis of halfspaces in [35, 38, 13].

Definition 3.1. Given any real vector w = (w(1) , w(2) , . . . , w (n) ) ∈ Rn . Reorder the Pcoordinates
by decreasing absolute value, i.e., |w(i1 ) | > |w(i2 ) | > . . . > |w(in ) | and denote σt2 = nj=t |w(ij ) |2 .
For 0 6 τ 6 1, the τ -critical index of the vector w is defined to be the smallest index k such
|w(ik ) | 6 τ σk . If no such k exists (∀k, |w(ik ) | > τ σk ), the τ -critical index is defined to be +∞. The
vector w is said to be τ -regular if the τ -critical index is 1.

A simple observation from [13] is that if the critical index of a sequence is large then the sequence
must contain a geometrically decreasing subsequence.

Lemma 3.2. (Lemma 5.5 in [13]) Given a vector w = (w(i) )ni=1 such that |w(1) | > |w(2) | > . . . >
|w(n) |, if the τ -critical index of the vector w is larger than l, then for any 1 6 i 6 j 6 l + 1,
p p
|w(j) | 6 σj 6 ( 1 − τ 2 )j−i σi 6 ( 1 − τ 2 )j−i |w(i) |/τ.

In particular, if j > i + (4/τ 2 ) ln(1/τ ) then |w(j) | 6 |w(i) |/3.

For a τ -regular weight vector, the following lemma bounds the probability that its weighted
sum falls into a small interval under certain distributions on the points. The proof is in Appendix
B.
P (i) 2
Lemma 3.3. Let w ∈ Rn be a τ -regular vector w, and |w | = 1. D is a distribution over
{0, 1} . Define a distribution D̃ on {0, 1} as follows: to generate y from D̃, first sample x from
n n

D and then define, (

x(i) with probability 1 − γ
y (i) =
random bit with probability γ.
Then for any interval [a, b], we have
h i 4|b − a| 4τ γ2
Pr hw, yi ∈ [a, b] 6 √ + √ + 2e− 2τ 2 .
γ γ

9
Intuitively, by the Berry-Esseen Theorem, hw, yi is τ close to the Gaussian distribution if each
y (i)is a random bit; therefore we can bound the probability that hw, yi falls into the interval [a, b].
In above lemma, each y (i) has probability γ to be a random bit, then γ fraction of y (i) is set to be
a random bit and we can similarly bound the probability that hw, yi falls into the interval [a, b].

Definition 3.4. For a vector w ∈ Rn , define set of indices Ht (w) ⊆ [n] as the set of indices
containing the t biggest coordinates of w by absolute value. Suppose its τ -critical index is cτ , define
set of indices Cτ (w) = Hcτ (w). In other words, Cτ (w) is the set of indices whose deletion makes
the vector w to be τ -regular.

Definition 3.5. For a vector w ∈ Rn and a subset of indices S ⊆ [n], define the vector Truncate(w, S) ∈
Rn as:
(
w(i) if i ∈ S
(Truncate(w, S))(i) =
0 otherwise

As suggested by Lemma 3.2, a weight vector with a large critical index has a geometrically
decreasing subsequence. The following two lemmas use this fact to bound the probability that the
weighted sum of a geometrically decreasing sequence of weights falls into a small interval. First,
we restate Claim 5.7 from [13] here.

Lemma 3.6. [Claim 5.7, [13]] Let w = (w(1) , . . . , w (T ) ) be such that |w(1) | > |w(2) | . . . > |w(T ) | > 0
(i) (T ) (T )
and |w(i+1) | 6 | w3 | for 1 6 i 6 T − 1 . Then for any interval I = [α − w 6 , α + w 6 ] of length
|w(T ) |
3 , there is at most one point x ∈ {0, 1}T such that hw, xi ∈ I.

Lemma 3.7. Let w = (w(1) , . . . , w (T ) ) be such that |w(1) | > |w(2) | . . . > |w(T ) | > 0 and |w(i+1) | 6
(i)
| w3 | for 1 6 i 6 T − 1. Let D be a distribution over {0, 1}T . Define a distribution D̃ on {0, 1}T
as follows: To generate y from D̃, sample x from D and set
(
x(i) with probability 1 − γ
y (i) =
random bit with probability γ.

Then for any θ ∈ R we have

h w(T ) w(T ) i γ T
Pr hw, yi ∈ [θ − ,θ + ] 6 1− .
6 6 2
h T
i
|wT |
Proof. By Lemma 3.6, we know that for the interval J = θ − |w6 | , θ + 6 , there is at most one
point r ∈ {0, 1}T such that hw, ri ∈ J. If no such r exists then clearly the probability is zero.
(1) (2) (T )
On the other hand, suppose there exists such an r, then hw, yi ∈ J only if (y1 , y1 , . . . , y1 ) =
(r(1) , . . . , r (T ) ) holds.
Conditioned on any fixing of the bits x, every bit y (j) is an independent random bit with
probability γ. Therefore, for every fixing of x, for each i ∈ [T ], with probability at least γ/2, y (i)
T
is not equal to r(i) . Therefore, Pr[y (1) = r(1) , y (2) = r(2) , . . . , y (T ) = r(T ) ] 6 1 − γ2 .

10
3.2 Invariance Principle

While invariance principles have been shown in various settings by [37, 11, 36], we restate a version of
the principle well suited for our application. We present a self-contained proof for it in Appendix C.

Definition 3.8. A function Ψ(x) : R → R for which fourth-order derivatives exist everywhere on
R is said to be K-bounded if |Ψ0000 (t)| 6 K for all t ∈ R.

Definition 3.9. Two ensembles of random variables P = (p1 , . . . , pk ) and Q = (q1 , . . . , qk ) are said
to have Q
matching moments
Q up to degree d if for every multi-set S of elements from [k], |S| 6 d, we
have E[ i∈S pi ] = E[ i∈S qi ].

Theorem 3.10. (Invariance Principle) Let A = {A{1} , . . . , A{R} }, B = {B {1} , . . . , B {R} } be fam-
(i) (i) (i) (i)
ilies of ensembles of random variables with A{i} = {a1 , . . . , aki } and B {i} = {b1 , . . . , bki }, satis-
fying the following properties:

• For each i ∈ [R], the random variables in ensembles (A{i} , B {i} ) have matching moments up
to degree 3. Further all the random variables in A and B are bounded by 1.
• The ensembles A{i} are all independent of each other, similarly the ensembles B {i} are inde-
pendent of each other.

Given a set of vectors l = {l{1} , . . . , l{R} }(l{i} ∈ Rki ), define the linear function l : Rk1 ×∙ ∙ ∙×RkR →
R as X
l(x) = hl{i} , x{i} i
i∈[R]

Then for a K-bounded function Ψ : R → R we have

h i h i
X
E Ψ l(A) − θ − E Ψ l(B) − θ 6 K kl{i} k41
A B
i∈[R]

for all θ > 0. Further, define the spread function c(α) corresponding to the ensembles A, B and the
linear function l as follows,

(Spread Function: )For 1/2 > α > 0, let

h i h i
c(α) = max sup PrA l(A) ∈ [θ − α, θ + α] , sup PrB l(B) ∈ [θ − α, θ + α]
θ θ

then for all θ,

X
E [pos (l(A) − θ)] − E [pos (l(B) − θ)] 6 O 1

A kl{i} k41 + 2c(α).
B α4
i∈[R]

Roughly speaking, the second part of the theorem states that pos function can be thought of
as 1
α4
-bounded with error parameter c(α).

11
4 Construction of the Dictatorship Test

In this section we describe the construction of the dictatorship test which will be the key ingredient
in the hardness reduction from k-Unique Label Cover.

4.1 Distributions D0 and D1

The dictatorship test is based on following two distributions D0 and D1 defined on {0, 1}k .

Lemma 4.1. For k ∈ N, there exists two probability distributions D0 , D1 on {0, 1}k such that for
x = (x1 , . . . , xk ),
2 1
Prx ∼D0 {every xl is 0} > 1 − √ and Prx ∼D1 {every xl is 0} 6 √ ,
k k
while matching moments up to degree 4, i.e., ∀i, j, m, n ∈ [k]

E [xi ] = E [xi ] E [xi xj xm xn ] = E [xi xj xm xn ]

D0 D1 D0 D1
E [xi xj ] = E [xi xj ] E [xi xj xm ] = E [xi xj xm ]
D0 D1 D0 D1

Proof. For = √1 , take D1 to be the following distribution:

1. with probability (1 − ), randomly set exactly one of the bit to be 1 and all the other to be 0;

2. with probability 4 , independently set every bit to be 1 with probability 1

k1/3
;

3. with probability 4 , independently set every bit to be 1 with probability 2

k1/3
;

4. with probability 4 , independently set every bit to be 1 with probability 3

k1/3
;

5. with probability 4 , independently set every bit to be 1 with probability 4

k1/3
.

The distribution D0 is defined to be the following distribution with parameter 1 , 2 , 3 , 4 to be

specified later:

1. with probability 1 − (1 + 2 + 3 + 4 ), set every bit to be zero;

2. with probability 1 , independently set every bit to be 1 with probability 1

k1/3
;

3. with probability 2 , independently set every bit to be 1 with probability 2

k1/3
;

4. with probability 3 , independently set every bit to be 1 with probability 3

k1/3
;

5. with probability 4 , independently set every bit to be 1 with probability 4

k1/3
.

12
From the definition of D0 , D1 , we know that Prx ∼D0 [every xi is 0] > 1 − (1 + 2 + 3 + 4 ) and
Prx ∼D1 [every xi is 0] 6 = √1 .
k
It remains to determine each i . Notice that the moment matching conditions can be expressed
as a linear system over the parameters 1 , 2 , 3 , 4 as follows:
4
X 4
X
i i
i ( ) = (1 − )/k + ( )
k 1/3 4 k 13
i=1 i=1
4
X 4
i 2 X i 2
i ( 1/3 ) = ( )
k 4 k 13
i=1 i=1
X4 X4
i i 3
i ( )3 = ( )
k
1
3 4 k 13
i=1 i=1
X4 4
i 4 X i 4
i ( 1 ) = ( ) .
k3 4 k 13
i=1 i=1

P4
√We then show that such a linear system has a feasible solution 1 , 2 , 3 , 4 > 0 and i=1 i 6
2/ k .
To prove this, by applying Cramer’s rule,

(1 − )/k + P4 ( i ) 2 3 4
i=1 4 1 1 1 1
P4 i 2 k 3 k3 k3 k3
i=1 4 ( 1 )
4 9 16
P4 ki3 3
2
k3
2
k3
2
k3

i=1 4 ( 1 )
8 27 64
3 3 3
P4 ki3 4 k3 k3 k3

i=1 4 ( 13 )
16 81 256
4 4 4
1 = 1 k 2 3
k3
4 k3 k3
1
k 3 k 13 k 13 1
k3
1 4 9 16
2 2 2 2
k13 k83 k273 k3
3 64
k 3 k 33 k 33 3
1 k3
4 164 814 256
4
k3 k3 k3 k3

With some calculation using basic linear algebra, we get

(1 − )/k 21 3 4
k3 k3
1
k3
1

0 4 9 16
2 2 2
k3 k3 k3
0 8 3 64
3 3 3
k3 k3 k3
0 16
4
3
4
256
4 1 1
1 = /4 + 1 2
k3
3
k3
4
k3
= √ + O( 2 ).
1
k 3 k 13 k 13 k 13
4 k k3
1 4 9 16
2 2 2 2
k1 k8 k27 k643
3 3 3
3
k 3 k 33 k 33 k 33
1
4 164 814 256 4
k3 k3 k3 k3

13
For large enough k, we have 0 6 1 6 √1
. By similar calculation, we can bound 2 , 3 , 4 by 1
√ .
2√ k 2 k
Overall, we have 1 + 2 + 3 + 4 6 2/ k

We define a “noisy” version of Db (b ∈ {0, 1}) below.

Definition 4.2. For b ∈ {0, 1}, define the distribution D̃b on {0, 1}k as follows:

• First generate x ∈ {0, 1}k according to Db .

• For each i ∈ [k],
(
xi with probability 1 − 1
yi = k2
uniform random bit ui with probability k12

Observation 4.3. D̃0 and D̃1 also have matching moments up to degree 4.

Proof. Since the noise is defined to be an independent uniform random bit, when calculating mo-
ments of y, such as ED̃b [yi1 yi2 ∙ ∙ ∙ yid ], we can substitute yi by (1 − γ)xi + 12 γ. Therefore, a degree
d moment of y can be expressed as a weighted sum of moments of x of degree up to d. Since
D0 and D1 have matching moments up to degree 4, it follows that D̃0 and D̃1 also have the same
property.

The following simple lemma asserts that conditioning the two distributions D̃0 and D̃1 on the
same coordinate xj being fixed to value b results in conditional distributions that still have matching
moments up to degree 3.

Lemma 4.4. Given two distributions P0 , P1 on {0, 1}k with matching moments up to degree d, for
any multi-set S of elements from [k], |S| 6 d − 1, j ∈ [k] and c ∈ {0, 1}.
Y Y
E [ xi | xj = c] = E [ xi | xj = c].
P0 P1
i∈S i∈S

Proof. For the case c = 1 and any b ∈ {0, 1},

Y Y Y
E [xj xi ] = E [ xi | xj = 1]PrP0 [xj = 1] = E [ xi | xj = 1] E [xj ].
Pb Pb Pb P0
i∈S i∈S i∈S

Therefore,
Q Q
Y EP0 [xj i∈S xi ] EP1 [xj i∈S xi ] Y
E[ xi | xj = 1] = = = E [ xi | xj = 1].
P0 EP0 [xj ] EP1 [xj ] P1
i∈S i∈S

For the case c = 0, replace xj with x0j = 1 − xj . It is easy to see that P0 and P1 still have
matching moments and conditioning on xj = 0 is the same as conditioning on x0j = 1. Hence we
can reduce to the case c = 1.

14
4.2 The Dictatorship Test

Let R be a positive integer. Based on the distributions D0 and D1 , we define the dictatorship test
as follows:

1. Generate a random bit b ∈ {0, 1}.

(j)
2. Generate x ∈ {0, 1}kR (which is also written as {xi }i∈[k],j∈[R] ) from DbR .
3. For each i ∈ [k], j ∈ [R],
(
(j)
(j) xi with probability 1 − 1
;
yi = k2
random bit with probability k12 .

4. Output the labelled example (y, b). Equivalently, if h denotes the halfspace, ACCEPT
if h(y) = b.

We can also view y as being generated as follows: i) With probability 12 , generate a negative
sample from distribution D̃0R ; ii) With probability 12 , generate a positive sample from distribution
D̃1R .
The dictatorship test has the following completeness and soundness properties.
(j)
Theorem 4.5. (completeness) For any j ∈ [R], h(y) = ∨ki=1 yi passes with probability > 1 − √3 .
k

Theorem 4.6. (soundness) Fix τ = k17 and t = τ12 (3 ln(1/τ ) + ln R) + d4k 2 ln ked τ42 ln(1/τ )e. Let
h(x) = pos(hw, yi − θ) be a halfspace such that Ht (wi ) ∩ Ht (wj ) = ∅ for all i, j ∈ [k]. Then the
halfspace h(y) passes the dictatorship test with probability at most 12 + O( k1 ).

Proof. (Theorem 4.5) If x is generated from D0R , we know that with probability at least 1 − √2 , all
k
(j) (j) (j)
the bits in {x1 , x2 , . . . , xk } are set to 0. By union bound, with probability at least 1 − √2
k
− k1 ,
(j) (j) (j) (j)
{y1 , y2 , . . . , yk } are all set to 0, in which case the test passes as ∨ki=1 yi = 0. If x is generated
(j) (j) (j)
from D1R , we know that with probability at least 1 − √1k , one of the bits in {x1 , x2 , . . . , xk } is set
(j) (j) (j)
to 1 and by union bound one of {y1 , y2 , . . . , yk } is set to 1 with probability at least 1 − √1k − k1 ,
(j)
in which case the test passes since ∨ki=1 yi = 1. Overall, the test passes with probability at least
1 − √3 .
k

4.3 Proof of Soundness (Theorem 4.6)

We will prove the contrapositive statement of Theorem 4.6: if some h(y) passes the above dictator-
ship test with high probability, then we can decode for each wi (i ∈ [k]), a small list of coordinates
and at least two of the lists will intersect.
The proof is based on two key lemmas (Lemmas 4.7, 4.8). The first lemma states that if a
halfspace passes the test with good probability, then two of its critical index sets Cτ (wi ), Cτ (wj )

15
must intersect. This would immediately imply Theorem 4.6 if cτ is less than t. The second lemma
states that every halfspace can be approximated by another halfspace with critical index less than
t; so we can assume that cτ is small without loss of generality.
Let h(y) be a halfspace function on {0, 1}kR given by h(y) = pos(hw, yi − θ). Equivalently,
h(y) can be written as
X X
h(y) = pos hw(j) , y (j) i − θ = pos hwi , yi i − θ ,
j∈[R] i∈[k]

where w(j) ∈ Rk and wi ∈ RR .

Lemma 4.7. (Common Influential Coordinates) For τ = 1

k7
, let h(y) be a halfspace such that for
all i 6= j ∈ [k], we have Cτ (wi ) ∩ Cτ (wj ) = ∅ . Then
1

E [h(y)] − E [h(y)] 6 O .
D̃0R D̃1R k

Proof. Fix the following notation,

li = Truncate(wi , Cτ (wi )) si = wi − wiC

yiC = Truncate(yi , Cτ (wi )) y C = y1C , y2C , . . . , ykC
s = s 1 , s2 , . . . , s k l = l1 , l2 , . . . , lk .

We can rewrite the halfspace h(y) as h(y) = pos hl, y C i + hs, yi − θ . Let us first normalize the
P
halfspace h(y) so that i∈[k] kli k2 = 1. We now condition on a possible fixing of the vector y C .
Under this conditioning and for y chosen randomly from the distribution D̃0R , define the family of
ensembles A = A{1} , . . . , A{R} as follows:
(j)
A{j} = {yi | i ∈ [k] for which j ∈
/ Cτ (wi )}

Similarly define the ensemble B = B {1} , . . . , B {R} using y chosen randomly from the distribution
(j) (j)
D̃1R . Further let us denote l{j} = (l1 , . . . , lk ). Now we apply the invariance principle (Theorem
3.10) to the ensembles A, B and the linear function l. For each j ∈ [R], there is at most one
coordinate i ∈ [k] such that j ∈ Cτ (wi ). Thus, conditioning on y C amounts to fixing of at most
(j) (j)
one variable yi in each column {yi }i∈[k] . By Lemma 4.4, since D̃0 and D̃1 have matching moments
up to degree 4, we get that A{j} and B {j} have matching moments up to degree 3. Also notice
(j) (j)
that maxj∈[R],i∈[k] |li | 6 τ kli k2 6 τ klk2 (as li is a τ -regular) and each yi is set to be a random
unbiased bit with probability k12 ; by Lemma 3.3, the linear function l and the ensembles A, B
satisfy the following spread property for every θ0 ∈ R:
h i
PrA l(A) ∈ [θ0 − α, θ 0 + α] 6 c(α)
h i
PrB l(B) ∈ [θ0 − α, θ 0 + α] 6 c(α),

16
1
where c(α) 6 8αk + 4τ k + 2e− 2τ 2 k4 (by setting γ = 1
k2
and |b − a| = 2α in Lemma 3.3). Using the
invariance principle (Theorem 3.10) this implies:
h X i

E pos hs, y C i + hl{j} , A{j} i − θ y C −
A
j∈[R]
h X i

E pos hs, y C i + hl{j} , B {j} i − θ y C
B
j∈[R]
1X
6O kl{i} k41 + 2c(α) (1)
α4
i∈[R]

By definition of the critical index, we have max j∈[R] lij 6 τ kli k2 . Using this, we can bound
P {i} k4 as follows:
i∈[R] kl 1

X X X (j)
X (j)

kl{j} k41 6 k 4 kli k4 6 k 4 max |li |2 kli k22
j∈[R]
j∈[R] i∈[k] j∈[R] i∈[k]
X 1
6 k4 τ 2 kli k22 6 k 4 τ 2 klk22 6 .
k 10
i∈[k]

In the final inequality in above calculation, we used the fact that τ = k17 and klk2 = 1. Let us
choose α = k12 and (1) is therefore bounded by O(1/k) for all settings of y C . Averaging over all
settings of y C we get that
1

E [h(y)] − E [h(y)] 6 O .
D̃0R D̃1R k

The above lemma asserts that unless some two vectors wi , wj have a common influential co-
ordinate, the halfspace h(y) cannot distinguish between D̃0R and D̃1R . Unlike with the traditional
notion of influence, it is unclear whether the number of coordinates in Cτ (wi ) is small. The following
lemma yields a way to get around this.
Lemma 4.8. (Bounding the number of influential coordinates) Let t be set as Pin Theorem 4.6.
Given a halfspace h(y) and r ∈ [k] such that |Cτ (wr )| > t, define h̃(y) = pos( i∈[k] hw̃i , yi i − θ)
as follows: w̃r = Truncate(wr , Ht (wr )) and w̃i = wi for all i 6= r. Then,
1 1

E [h̃(y)] − E [h(y)] 6 2 and E [h̃(y)] − E [h(y)] 6 2 .
R
D̃0 R
D̃0 k R
D̃1 R
D̃1 k

(1) (2) (R)

Proof. Without loss of generality, we assume r = 1 and |w1 | > |w1 | > ∙ ∙ ∙ > |w1 |. In particular,
this implies Ht (w1 ) = {1, . . . , t}. Set T = d4k 2 ln ke. Define the subset G of Ht (w1 ) as
G = {gi | gi = 1 + id(4/τ 2 ) ln(1/τ )e, 0 6 i 6 T }.
(g ) (gi+1 )
Therefore, by Lemma 3.2, |w1 i | is a geometrically decreasing sequence such that |w1 | 6
(g )
|w1 i |/3. Let H = Ht (w1 ) \ G. Fix the following notation:
w1G = Truncate(w1 , G), w1H = Truncate(w1 , H), w1>t = Truncate(w1 , {t + 1, . . . , n}).

17
Similarly, define the vectors y1G , y1H , y1>t . We now rewrite the halfspace functions h(y) and h̃(y)
as:
X k
h(y) = pos hwi , yi i + hw1G , y1G i + hw1H , y1H i + hw1>t , y1>t i − θ
i=2

X
k
h̃(y) = pos hwi , yi i + hw1G , y1G i + hw1H , y1H i − θ .
i=2

Notice that for any y, h(y) 6= h̃(y) implies

k
X
hwi , yi i + hw1G , y1G i + hw1H , y1H i − θ 6 |hw>t , y >t i|. (2)
1 1
i=2

By Lemma 3.2, we know that

(gT ) 2 τ2 τ2 R >t 2
|w1 | > kw1>t k22 > kw1>t k22 > kw1 k2 .
(1 − τ 2 )t−gT (1 − τ 2 )
1
τ2
(3 ln(1/τ )+ln R) τ

√ (gT ) (gT )
Using the fact that Rkw1>t k22 > kw1>t k21 , we can get that kw1>t k1 6 τ |w1 | 6 16 |w1 |. Combin-
ing the above inequality with (2) we see that,

h i h Xk i
Pr h(y) 6= h̃(y) 6 Pr | hwi , yi i + hw1G , y1G i + hw1H , y1H i − θ| 6 |hw1>t , y1>t i|
D̃0R D̃0R i=2
h Xk
|w1(gT ) | i
6 Pr hwi , yi i + +
hw1G , y1G i − θ 6
hw1H , y1H i
D̃0R 6
i=2
h 1 (g ) 1 (g ) i
= Pr hw1G , y1G i ∈ [θ0 − |w1 T |, θ 0 + |w1 T |]
D̃0R 6 6
P
where θ0 = − ki=2 hwi , yi i − hw1H , y1H i + θ. For any fixing of the value of θ0 ∈ R, it induces a
certain distribution on y1G . However, the k12 noise introduced in y1G is completely independent.
This corresponds to the setting of Lemma 3.7, and hence we can bound the above probability by
T
1 − 2k12 6 k12 . The result follows from averaging over all values of θ0 .

With the two lemmas above, we now prove the soundness property.

Proof. (Theorem 4.6) The probability of success of h(y) is given by 12 + 12 ED̃R [h(y)] − ED̃R [h(y)] .
1 0

Therefore, it suffices to show that ED̃R [h(y)] − ED̃R [h(y)] = O( k1 ).
0 1

Define I = {r | Cτ (wr ) > t}. We discuss the following two cases.

1. I = ∅; i.e., ∀i ∈ [k], Cτ (wi ) 6 t.
Then for all i, j, Ht (wi )∩Ht (wj ) = ∅ implies Cτ (wi )∩Cτ (wj ) =

∅. By Lemma 4.7, we thus have ED̃R [h(y)] − ED̃R [h(y)] = O( k1 ).
0 1

18
1. Sample an edge e = (v1 , . . . , vk ) ∈ E.
2. Generate a random bit b ∈ {0, 1}.
3. Sample x ∈ {0, 1}kR from D̃bR .
4. Define y ∈ {0, 1}|V |×R as follows:

(a) For each v ∈

/ {v1 , . . . , vk }, yv = 0.
(j) (π vi ,e (j))
(b) For each i ∈ [k] and j ∈ [R], yvi = xi .

5. Output the example (y, b).

Figure 1: Reduction from k-Unique Label Cover

2. I 6= ∅. Then for all r ∈ I, we set w̃r = Truncate(wr , Ht (wr )) and replace wr with w̃r in h to
get a new halfspace h0 . Since such replacements occur at most k times and by Lemma 4.8 every
replacement changes the output of the halfspace on at most k12 fraction of examples, we can bound
the overall change by k × k12 = k1 . That is
1 1

E [h0 (y)] − E [h(y)] 6 , E [h0 (y)] − E [h(y)] 6 . (3)
D̃0R D̃0R k D̃1R D̃1R k

Also notice that for h0 and all r ∈ [k], the critical

index of w̃r (i.e., |Cτ (w̃r )|) is less than t. This

reduces the problem to Case 1, and we conclude ED̃R [h0 (y)] − ED̃R [h0 (y)] = O(1/k). Along with
0 1
(3) this finishes the proof of Theorem 4.6.

4.4 Reduction from k-Unique Label Cover

With the dictatorship test defined, we now describe briefly a reduction from k-Unique Label
Cover problem to agnostic learning of monomials, thus showing Theorem 1.1 under the Unique
Games Conjecture (Conjecture 2.2). Although our final hardness result only assumes P 6= NP, we
describe the reduction to k-Unique Label Cover for the purpose of illustrating the main idea of
our proof.
Let L(G(V, E), R, R, {π v,e |v ∈ V, e ∈ E}) be an instance of k-Unique Label Cover. The re-
duction is defined in Figure 4.4. It will produce a distribution over labeled examples: (y, b) where
(i)
y ∈ {0, 1}|V |×R and label b ∈ {0, 1}. We will index the coordinates of y ∈ {0, 1}|V |×R by yw (for
(1) (2) (R)
w ∈ V, i ∈ R) and denote yw (for w ∈ V ) to be the vector (yw , yw , . . . , yw ).

3
Proof of Theorem 1.1 assuming Unique Games Conjecture Fix k = 10
2
, η = 100 and a
1
positive integer R > d(2k) η2 e for which Conjecture 2.2 holds.

19
Completeness: Suppose that Λ : V → [R] is a labeling that strongly satisfies 1 − kη fraction
W (Λ(v))
of the edges. Consider disjunction h(y) = v∈V yv . For at least 1 − kη fraction of edges
e = (v1 , v2 , . . . , vk ) ∈ E, π v1 ,e (Λ(v1 )) = ∙ ∙ ∙ = π vk ,e (Λ(vk )) = r. Let us fix such a choice of edge e in
step 1. As all coordinates of y outside of {yv1 , . . . , yvk } are set to 0 in step 4(a), the disjunction
(Λ(v )) (r)
reduces to ∨i∈[k] yvi i = ∨i∈[k] xi . By Theorem 4.5, such a disjunction agrees with every (y, b)
with probability at least 1 − √3 . Therefore h(y) agrees with a random example with probability
k
at least (1 − √3 )(1 − kη) > 1 − √3 − kη > 1 − .
k k
P
Soundness: Suppose there exists a halfspace h(y) = v∈V hwv , yv i that agrees with more than
2 + > 2 + √1 fraction of the examples. Set t = k 14 (3 ln(k 7 ) + ln R) + d4k 14 ln k 7 e ∙ d4k 2 ln ke =
1 1
k
O k 16 ln R (same as in Theorem 4.6). Define the labeling Λ using the following strategy : for each
vertex v ∈ V randomly pick a label from Ht (wv ).
By an averaging argument, for at least 2 fraction of the edges e ∈ E generated in step 1 of the
reduction, h(y) agrees with the examples corresponding to e with probability at least 12 + 2 . We
will refer to such edges as good. By Theorem 4.6 for each good edge e ∈ E, there exists i, j ∈ [k],
such that π vi ,e Ht (wvi ) ∩ π vj ,e Ht (wvj ) 6= ∅. Therefore the edge e ∈ E is weakly satisfied by the
labeling Λ with probability at least t12 . Hence, in expectation the labeling Λ weakly satisfies at least
2k2
2 ∙ t2 = Ω( k33 ln2 R ) > Rη/4 fraction of the edges (by the choice of R and t).
1 1

5 Reduction from Label Cover

In this section, we describe a reduction from a k-Label Cover with an additional smoothness
property to the problem of agnostic learning of disjunctions by halfspaces. This will give us Theo-
rem 1.1 without assuming the Unique Games Conjecture.

5.1 Smooth k-Label Cover

Our reduction use the following hardness result for k-Label Cover (Definition 2.1) with the
additional smoothness property.

Theorem 5.1. There exists a constant γ > 0 such that for any integer parameter J, u > 1, it is NP-
hard to distinguish between the following two types of k-Label Cover L(G(V, E), M, N, {π v,e |e ∈
E, v ∈ e}) instances with M = 7(J+1)u and N = 2u 7Ju :

1. (Strongly satisfiable instances) There is some labeling that strongly satisfies every hyperedge.

2. (Instances that are not 2k 2 2−γu -weakly satisfiable) There is no labeling that weakly satisfies
at least 2k 2 2−γu fraction of the hyperedges.

In addition, the k-Label Cover instances have the following properties:

• (Smoothness) for a fixed vertex v and a randomly picked hyperedge containing v,

∀i, j ∈ [M ], Pr[π v,e (i) = π v,e (j)] 6 1/J.

20
• Pick a hyperedge e = (v1 , v2 , . . . , vk ) ∈ E with corresponding projections π v1 ,e , . . . , π vk ,e :
[M ] → [N ].

• Generate a random bit b ∈ {0, 1}.

• Sample x ∈ {0, 1}kN from DbN .

• Generate y ∈ {0, 1}|V |×M as follows:

1. For each v ∈
/ e, yv = 0.
2. For each i ∈ [k], set yvi ∈ {0, 1}M as follows:
( vi ,e
(π (j))
xi with probability 1 − 1
yv(j) = k2
i
random bit with probability k12

• Output the example (y, b) or equivalently ACCEPT if h(y) = b.

Figure 2: Reduction from k-Label Cover

• For any mapping π v,e and any number i ∈ [N ], we have |(π v,e )−1 (i)| 6 d = 4u ; i.e., there are
at most d = 4u elements in [M ] that are mapped to the same number in [N ].

The proof of the above theorem can be found in Appendix D.

In the rest of the paper, we will set u = k and therefore d = 4k . Also we set the smoothness
parameter J = d17 = 417k .

5.2 Reduction from Smooth k-Label Cover

The starting point is a smooth k-Label Cover L(G(V, E), M, N, {π v,e |e ∈ E, v ∈ e}) with M =
7(J+1)u and N = 2u 7Ju as described in Theorem 5.1. Figure 5.2 illustrates the reduction from
k-Label Cover L(G(V, E), N, M, {π v,e |e ∈ E, v ∈ e}) that given an instance of k-Label Cover
L produces a random labeled example. We refer to the obtained distribution on examples as E.

5.3 Proof of Theorem 1.1

We claim that our reduction has the following completeness and soundness properties.
Theorem 5.2. • Completeness: If L is a strongly-satisfiable instance of smooth k-Label
Cover, then there is a disjunction that agrees with a random example from E with probability
at least 1 − O( √1k ).

• Soundness: If L is not 2k 2 2−γk -weakly satisfiable and is smooth with parameters J = 417k
and d = 4k , then there is no halfspace that agrees with a random example from E with
probability more than 12 + O( √1k ).

21
Combining the above theorem with Theorem 5.1 we get that for k = O(1/2 ), we obtain our
main result: Theorem 1.1.
It remains to check the correctness of the completeness and soundness claims in Theorem 5.2.
First let us prove the completeness property.

Proof. (Proof of Completeness) Let Λ be the labeling that strongly satisfies L. Consider disjunction
W (Λ(v))
h(y) = v∈V yv . Let e = (v1 , v2 , . . . , vk ) be any hyperedge and let Ee be the distribution E
Λ(v ) π vi ,e (Λ(v ))
restricted to the examples generated for e. With probability at least 1 −1/k, yvi i = xi i
for
every i ∈ [k]. As e is strongly satisfied by Λ, for all i, j ∈ [k], π (Λ(vi )) = π (Λ(vj )). Therefore,
v i ,e v j ,e

as in the proof of Theorem 4.5,√ we obtain that h(y) agrees with a random example from Ee with
probability at least 1 − O(1/ k). Labeling Λ strongly satisfies all edges and therefore √ we obtain
that h(y) agrees with a random example from E with probability at least 1 − O(1/ k).

The more complicated part is the soundness property which we prove in Section 5.4.

5.4 Soundness Analysis

Proof Idea The main idea is similar to the proof of Theorem 4.6 although it is more technically
involved. Notice that the reduction in Figure 5.2 produces examples such that yvj1i , yvj2i are “almost
identical” copies when π vi ,e (j1 ) = π vi ,e (j2 ). Further for different edges e, the coordinates of y will
be grouped in different ways, such that each group will have almost identical copies.
To handle these additional complications, the first step of the proof is to show that almost all
the hyperedges in smooth k-Label Cover satisfy a certain “niceness” property. After that we
generalize the proofs of Lemma 4.7 and Lemma 4.8 under the weaker assumption that most of the
hyperedges are “nice”.
The formal definition of “niceness” and the proof that most of the edges are “nice” appear in
Section 5.4.1. The generalization of Lemma 4.7 appears in Section 5.4.2. The generalization of
Lemma 4.8 appears in Section 5.4.3. All these results are put together into a proof of Theorem 5.2
in Section 5.4.4.

5.4.1 Most of the edges are “nice”

Let h(y) be a halfspace that agrees with more than 12 + √1k -fraction of the examples. Suppose,
X
h(y) = pos hwv , yv i − θ .
v∈V

Let τ = 1
k13
and let
sv = Truncate(wv , Cτ (wv )), l v = w v − sv .
Definition 5.3. A vertex v ∈ V is said to be β-nice with respect to a hyperedge e ∈ E containing
it if
X X 4
|lv(j) | 6 βklv k42 ,
i∈[N ] j∈π −1 (i)

22
where π : [M ] → [N ] is the projection associated with vertex v and hyperedge e. A hyperedge
e = (v1 , v2 , . . . , vk ) is β-nice, if for every i ∈ [k], the vertex vi is β-nice with respect to e.

Lemma 5.4. The fraction of 2τ -nice hyperedges in E is at least 1 − O(1/k).

(i)
(lv )2
Proof. By definition, we know that lv is τ -regular vector. Denote Iv = {i | kl v k22
> 1
d8
}. By
definition |I| 6 d8 .
Notice there are at most pairs of values in I × I. By the smoothness
d16
16
property of the k-Label Cover instance, for any vertex v, at least 1− dJ fraction of the hyperedges
incident on v have the following property: for any i, j ∈ Iv , π v,e (i) 6= π v,e (j). If all the vertices in
a hyperedge have this property we call it a good hyperedge. By an averaging argument, we know
16
that among all hyperedges at least 1 − kdJ = 1 − 4kk > 1 − O( k1 ) fraction is good.
We will show all these good hyperedges are also 2τ -nice. For a given good hyperedge e, a vertex
(i)
(lv )2
v ∈ e, π = π v,e and i ∈ [N ], there is at most one j ∈ π −1 (i) such that kl v k22
> 1
d8
.
Based on the above property, we will show
X X 4
|lv(j) | 6 2τ klv k42 .
i∈[N ] j∈π −1 (i)

Notice that
X X 4 X X (j ) (j ) (j ) (j )
|lv(j) | = l v 1 lv 2 lv 3 lv 4 (4)
i∈[N ] j∈π −1 (i) i∈[N ] j1 ,j2 ,j3 ,j4 ∈π −1 (i)

and the sum of all the terms with j1 = j2 = j3 = j4 is klv k44 .

(j ) (j ) (j ) (j )
For all other terms |lv 1 lv 2 lv 3 lv 4 with j1 , j2 , j3 , j4 that are not all equal, there is at least one
(j ) (j ) (j ) (j ) (j )
|lv r | (r ∈ [4]) smaller than kl v4k2 . Therefore, |lv 1 lv 2 lv 3 lv 4 can be bounded by
d

klv k2 X
|lv(j1 ) |3 + |lv(j2 ) |3 + |lv(j3 ) |3 + |lv(j4 ) |3 .
d4
j1 ,j2 ,j3 ,j4

Overall, expression (4) can be bounded by

klv k2 X X
klv k44 + |lv(j1 ) |3 + |lv(j2 ) |3 + |lv(j3 ) |3 + |lv(j4 ) |3
d4
i∈[N ] j1 ,j2 ,j3 ,j4 ∈π −1 (i)
klv k2 3 X (j) 3
6τ 2 klv k42 + 4d |lv | (since |π −1 (i)| 6 d, each lv(j) appears at most 4d3 times)
d4
j∈[M ]
τ
6(τ + 4 )klv k42
2
(lv is τ -regular vector, so |lvj | 6 τ klv k2 for all j ∈ [M ] )
d
62τ klv k2 .
4

23
Let us fix a 2τ -nice hyperedge e = (v1 , . . . , vk ). As before let Ee denote the distribution on
examples restricted to those generated for hyperedge e. We will analyze the probability that the
halfspace h(y) agrees with a random example from Ee .
Let π v1 ,e , π v2 ,e , . . . , π vk ,e : [M ] → [N ] denote the projections associated with the hyperedge e.
For the sake of brevity, we shall write wi , yi , li instead of wvi , yvi , lvi . For all j ∈ [N ] and i ∈ [k],
define
{j}
yi = Truncate(yi , (π vi ,e )−1 (j)).
{j} {j} {j}
Similarly, define vectors wi , li and si .
Notice that for every example (y, b) in the support of Ee , yv = 0 for every vertex v ∈
/ e.
Therefore, on restricting to examples from Ee we can write:
X
h(y) = pos hwi , yi i − θ .
i∈[k]

5.4.2 Common Influential Variables (generalization of Lemma 4.7)

Lemma 5.5. Let h(y) be a halfspace such that for all i 6= j ∈ [k], we have π vi ,e (Cτ (wi )) ∩
π vj ,e (Cτ (wj )) = ∅. Then
1

E[h(y)|b = 0] − E[h(y)|b = 1] 6 O . (5)
Ee Ee k

Proof. Fix the following notation:

yiC = Truncate(yi , Cτ (wi )) y C = y1C , y2C , . . . , ykC

s = s 1 , s2 , . . . , s k l = l1 , l2 , . . . , lk .

We can rewrite the halfspace h(y) as h(y) = pos hs, y C i + hl, yi − θ . Let us first normalize the
P
weights of h(y) so that i∈[k] kli k22 = 1. Let us condition on a possible fixing of the vector y C .
Under this conditioning and also for b = 0, define the family of ensembles A = A{1} , . . . , A{N } as
follows:
n o
(r)
A{j} = yi | i ∈ [k], r ∈ [M ] such that π vi ,e (r) = j and r ∈ / Cτ (wi )

Similarly define the ensemble B = B {1} , . . . , B {N } for the conditioning b = 1. Now we shall apply
the invariance principle (Theorem 3.10) to the ensembles A, B and the linear function l(y):
X
l(y) = hl{j} , y {j} i.
j∈[N ]

As we prove in Claim 5.6 below, the ensembles A, B have matching moments up to degree 3.
Furthermore, by Lemma 3.3, the linear function l and the ensembles A, B satisfy the following
spread property:
h i h i
PrA l(A) ∈ [θ0 − α, θ 0 + α] 6 c(α) PrB l(B) ∈ [θ0 − α, θ 0 + α] 6 c(α)

24
1
for all θ0 ∈ R, where c(α) = 8αk + 4τ k + 2e− 2k4 τ 2 (by setting γ = 1
k2
and |b − a| = 2α in Lemma
3.3).
Using the invariance principle (Th. 3.10), this implies:
       

X X
E pos hs, y C i + {j} {j} 
hl , A i − θ |y C  
− E pos hs, y i +
C {j} {j}
hl , B i − θ |y  C 
A B
j∈[N ] j∈[N ]
1 X
6 O( 4 ) kl{j} k41 + 2c(α). (6)
α
j∈[N ]

Take α to be 1
k2
and recall that τ = 1
k13
. In Claim 5.7 below we show that
X
kl{j} k41 6 2τ k 4 .
j∈[N ]

The above inequality holds for an arbitrary conditioning of the values of y C . Hence, by averaging
over all settings of y C we prove (5).

Claim 5.6. The ensembles A and B have matching moments up to degree 3.

(j) (π vi ,e (j))
Let us suppose for a moment that y was generated by setting yvi = xi , that is without
adding any noise. By Lemma 4.1, the first four moments of random variable y conditioned on
b = 0 agree with the first moments of random variable y conditioned on b = 1. As we showed in
Observation 4.3, even with noise, the first four moments of y remain the same when conditioned
on b = 0 and b = 1. Finally, π vi ,e (Cτ (wi )) ∩ π vj ,e (Cτ (wj )) = ∅ for all i 6= j ∈ [k]. Hence for each
j ∈ [N ], conditioning on y C fixes bits in at most one row of A{j} . Formally, for every j ∈ [N ],
{j}
there exists at most one i ∈ [k] such that yi and y C have shared variables. Therefore, by Lemma
4.4, A and B have matching moments up to degree 3.
Claim 5.7. X
kl{j} k41 6 2τ k 4 .
j∈[N ]

P {j}
Proof. Since kl{j} k1 = i∈[k] kli k1 ,
we can write
X X X {j} X X {j}
kl{j} k41 6 k4 kli k41 = k 4 kli k41 . (7)
j∈[N ] j∈N i∈[k] i∈[k] j∈[N ]

P {j}
As e = (v1 , . . . , vk ) is a 2τ -nice hyperedge, we have j∈[N ] kli k41 6 2τ kli k42 . By normalization of
P
l, we know i∈[k] kli k22 = 1. Substituting this into inequality (7) we get the claimed bound.

5.4.3 Bounding the Number of Influential Coordinates (generalization of Lemma 4.8)

P
Lemma 5.8. Given a halfspace h(y) = pos( i∈[k] hwi , yi i − θ) and r ∈ [k] such that |Cτ (wr )| > t
P
for t = 1
τ2
(d4k ln(2k)ed4 ln(1/τ )e+ln(1/τ )+10 ln d) = O(k 29 ), define h̃(y) = pos( i∈[k] hw̃i , yi i−θ̃)
2

as follows:

25
• w̃r = Truncate(wr , Ht (wr )) and w̃i = wi for all i 6= r.

• θ̃ = θ − E[har , yr i|b = 0], for a = w − w̃.

Then,
1 1

E[h̃(y)|b = 0] − E[h(y)|b = 0] 6 2 , E[h̃(y)|b = 1] − E[h(y)|b = 1] 6 2 .
Ee Ee k Ee Ee k

Proof. It is easy to see that the matching moments condition implies that

EEe [har , yr i|b = 0] = EEe [har , yr i|b = 1].

Let us show the inequality for the case b = 0, the other inequality can be derived in an identical
way. Let Ee,0 denote distribution Ee conditioned on b = 0. Without loss of generality, we may
(1) (2) (M )
assume that r = 1 and |w1 | > |w1 | . . . > |w1 |. In particular, this implies Ht (w1 ) = {1, . . . , t}.
Define

μr = EEe,0 [har , yr i], μ{i} {i} {i}

r = EEe,0 [har , yr i].

Let us set T = d4k 2 ln(2k)e and define the subset G = {g1 , . . . , gT } of Ht (w1 ) as follows:

G = {gi | gi = 1 + id(4/τ 2 ) ln(1/τ )e, 0 6 i 6 T }.

(g ) (gi+1 )
Therefore, by Lemma 3.2, |w1 i | is a geometrically decreasing sequence such that |w1 | 6
(g )
|w1 i |/3. Let H = Ht (w1 ) \ G. Fix the following notation:

w1G = Truncate(w1 , G), w1H = Truncate(w1 , H), w1>t = Truncate(w1 , {t + 1, . . . , n}).

Similarly, define the vectors y1G , y1H , y1>t . By definition, we have a1 = w1>t . Rewriting the halfspace
functions h(y), h̃(y) :
X
k
h(y) = pos hwi , yi i + hw1G , y1G i + hw1H , y1H i + ha1 , y1>t i − θ ,
i=2
X
k
h̃(y) = pos hwi , yi i + hw1G , y1G i + hw1H , y1H i + μ1 − θ .
i=2

By Claim 5.9 below, with probability at most 1

d = 1
4k
, we have |ha1 , y1 i−μ1 | > d4 ka1 k2 . Suppose
(g ) (g )
|ha1 , y1 i − μ1 | < d4 ka1 k2 , then Claim 5.10 below gives |ha1 , y1 i − μ1 | < 1/d6 |w1 T | < 13 |w1 T |.
Thus, we can write
h i h 1 (g ) 1 (g ) i 1
PrEe,0 h(y) 6= h̃(y) 6 PrEe,0 hw1G , y1G i ∈ [θ0 − |w1 T |, θ 0 + |w1 T |] + k .
3 3 4
P
where θ0 = − ki=2 hwi , yi i − hw1H , y1H i − μ1 + θ. For any fixing of the value of θ0 ∈ R, induces
a certain distribution on y1G . However, the k12 noise introduced in y1G is completely independent.
This corresponds to the setting of Lemma 3.7, and hence we can bound the above probability by
2
(1 − 1/(2k 2 ))T + 1/4k 6 (1 − 1/(2k 2 ))4k ln (2k) + 1/4k 6 1/k 2 .

26
Claim 5.9. h i 1
PrEe,0 |ha1 , y1 i − μ1 | > d4 ka1 k2 6 .
d

Proof. Write [M ] as the union of disjoint sets R1 ∪ R2 ∪ ∙ ∙ ∙ ∪ RN where Ri = (π v1 ,e )−1 (i). Notice
every Ri has size at most d, therefore
X Ri
X
VarEe,0 ha1 , y1 i = VarEe,0 haR1 , y1 i 6
i
dkaR
1 k2 = dka1 k2 .
i 2 2

i∈[N ] i∈[N ]

By applying Chebyshev’s inequality (Th. A.3), we have

2d 1
PrEe,0 |ha1 , y1 i − μ1 | > d4 ka1 k2 6 8 6 .
d d

Claim 5.10. By the choice of the parameters T and t,

1 (g )
ka1 k2 6 |w T |.
d10 1

Proof. By Lemma 3.2,

(gT ) 2 τ τ
|w1 | > ka1 k22 > ka1 k22 > d10 ka1 k22 .
(1 − τ 2 )t−gT ) (1 − τ 2)
1
τ2
(ln(1/τ )+10 ln d)

5.4.4 Proof of Soundness

Recall that we chose τ = 1/k 13 and t = O(k 29 ).

Lemma 5.11. Fix a hyperedge e which is 2τ -nice. If for all i =
6 j ∈ [k], π vi ,e H (w ) ∩
t i
π vj ,e Ht (wj ) = ∅ then the probability that halfspace h(y) agrees with a random example from
Ee is at most 12 + O( k1 ).

Proof. The proof is similar to the proof of Theorem 4.6. Define I = {r | Cτ (wr ) > t}. We divide
the problem into the following two cases.

1. I = ∅; i.e., for all i ∈ [k], Cτ (wi ) 6 t. Then for any i 6= j ∈ [k], Ht (wi ) ∩ Ht (wj ) = ∅ implies
Cτ (wi ) ∩ Cτ (wj ) = ∅. By Lemma 5.5, we have
1

E[h(y)|b = 0] − E[h(y)|b = 1] 6 O .
Ee Ee k

2. I 6= ∅. Then for all r ∈ I, we set w̃r = Truncate(wr , Ht (wr )) and define a new halfspace h0 by
replacing wr with w̃r in h. Since such replacements occur at most k times and, by Lemma

27
5.8, every replacement changes the output of the halfspace on at most k12 fraction of examples
from Ee , we can bound the overall change by k × k12 = k1 . That is
1 1

E [h (y)] − E [h(y)] 6 ,
0
E [h (y)] − E [h(y)] 6 .
0
(8)
Ee,0 Ee,0 k E e,1 E e,1 k
For the halfspace h0 and for all r ∈ [k], we have |Cτ (w̃r )| 6 t, thus reducing to Case 1.
Therefore 1

E [h0 (y)] − E [h0 (y)] 6 O . (9)
Ee,o Ee,1 k
Combining (8) and (9), we get
1

E [h(y)] − E [h(y)] 6 O .
Ee,0 Ee,1 k

In other words, the probability that halfspace h(y) agrees with a random example from Ee is at
most 12 + O( k1 ).

We first recall the soundness statement:

Proposition 5.12. If L is not a 2k 2 2−γk -weakly satisfiable instance of smooth k-Label Cover,
then there is no halfspace that agrees with a random example from E with probability more than
2 +
1 √1 .
k

Proof. The proof is by contradiction. We can define the following labeling strategy: for each vertex
v, uniformly randomly pick a label from Ht (wv ). We know that the size of Ht (wvi ) is t = O(k 29 ).
Suppose there exists a halfspace that agrees with a random example from E with probability
more than 12 + √1k . Then by an averaging argument, for at least 2√1 k -fraction of the hyperedges e,
h(y) agrees with a random example from Ee with probability at least 1
2 + 1
√
2 k
. We refer to these
edges as good.
Since there is at most O(1/k)-fraction of the hyperedges that are not 2τ -nice we know that
at least 4√1 k -fraction of the hyperedges are 2τ -nice and good. By Lemma 5.11, for each 2τ -nice
and good hyperedge e there exist two vertices vi , vj ∈ e such that π vi ,e (Ht (wi )) and π vj ,e (Ht (wj ))
intersect. Then there is a t12 probability that the labeling strategy we defined will weakly satisfy
hyperedge e.
Overall this strategy is expected to weakly satisfy at least 1 1
√
4 k t2
= Ω( k159 ) fraction of the
2k2
hyperedges. This is a contradiction since L is not 2γk
-weakly satisfiable.

References
[1] E. Amaldi and V. Kann. On the approximability of minimizing nonzero variables or unsatisfied
relations in linear systems. Theoretical Computer Science, 109:237–260, 1998. 3
[2] D. Angluin and P. Laird. Learning from noisy examples. Machine Learning, 2:343–370, 1988.
3

28
[3] S. Arora, L. Babai, J. Stern, and Z. Sweedyk. The hardness of approximate optima in lattices,
codes, and systems of linear equations. J. Comput. Syst. Sci., 54(2):317–331, 1997. 3

[4] P. Auer and M. K. Warmuth. Tracking the best disjunction. Machine Learning, 32(2):127–150,
1998. 4

[5] S. Ben-David, N. Eiron, and P. M. Long. On the difficulty of approximately maximizing

agreements. J. Comput. Syst. Sci., 66(3):496–514, 2003. 3

[6] A. Blum, A. Frieze, R. Kannan, and S. Vempala. A polynomial-time algorithm for learning
noisy linear threshold functions. Algorithmica, 22(1-2):35–52, 1998. 3

[7] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Occam’s razor. Inf. Process.
Lett., 24(6):377–380, 1987. 2

[8] N. Bshouty and L. Burroughs. Bounds for the minimum disagreement problem with applica-
tions to learning theory. In Proceedings of COLT, pages 271–286, 2002. 3

[9] N. Bshouty and L. Burroughs. Maximizing agreements and coagnostic learning. Theoretical
Computer Science, 350(1):24–39, 2006. 3

[10] T. Bylander. Learning linear threshold functions in the presence of classification noise. In
Proceedings of COLT, pages 340–347, 1994. 3

[11] S. Chatterjee. A simple invariance theorem. arxiv:math/0508213v1., 2005. 11

[12] E. Cohen. Learning noisy perceptrons by a perceptron in polynomial time. In IEEE FOCS,
pages 514–523, 1997. 3

[13] I. Diakonikolas, P. Gopalan, R. Jaiswal, R. A. Servedio, and E. Viola. Bounded independence

fools halfspaces. In FOCS, pages 171–180, 2009. 6, 7, 9, 10

[14] V. Feldman. Optimal hardness results for maximizing agreements with monomials. In IEEE
CCC, pages 226–236, 2006. 3

[15] V. Feldman, P. Gopalan, S. Khot, and A. K. Ponnuswami. On agnostic learning of parities,

monomials, and halfspaces. SIAM J. Comput., 39(2):606–645, 2009. 3

[16] S. Galant. Perceptron based learning algorithms. IEEE Trans. on Neural Networks, 1(2),
1990. 4

[17] M. Garey and D. S. Johnson. Computers and Intractability. 1979. 3

[18] C. Gentile and M. K. Warmuth. Linear hinge loss and average margin. In Proceedings of NIPS,
pages 225–231, 1998. 4

[19] P. Gopalan, S. Khot, and R. Saket. Hardness of reconstructing multivariate polynomials over
finite fields. SIAM J. Comput., 39(6):2598–2621, 2010. 36

[20] V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. SIAM J.
Comput., 39(2):742–765, 2009. 3

29
[21] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other
learning applications. Information and Computation, 100(1):78–150, 1992. 2

[22] K. Hoffgen, K. van Horn, and H. U. Simon. Robust trainability of single neurons. J. Comput.
Syst. Sci., 50(1):114–125, 1995. 3

[23] D. S. Johnson and F. P. Preparata. The densest hemisphere problem. Theoretical Computer
Science, 6:93–107, 1978. 3

[24] A. Kalai, A. Klivans, Y. Mansour, and R. Servedio. Agnostically learning halfspaces. SIAM
Journal on Computing, 37(6):1777–1805, 2008. 3, 4

[25] M. Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM,
45(6):983–1006, 1998. 3

[26] M. Kearns, R. Schapire, and L. Sellie. Toward efficient agnostic learning. Machine Learning,
17:115–141, 1994. 2, 3

[27] M. J. Kearns and M. Li. Learning in the presence of malicious errors. SIAM J. Comput.,
22(4):807–837, 1993. 3

[28] M. J. Kearns and R. E. Schapire. Efficient distribution-free learning of probabilistic concepts.

J. Comput. Syst. Sci., 48(3):464–497, 1994. 3

[29] S. Khot. On the power of unique 2-Prover 1-Round games. In ACM STOC, pages 767–775,
May 19–21 2002. 4, 5

[30] S. Khot. New techniques for probabilistically checkable proofs and inapproximability results
(thesis). Princeton University Technical Reports, TR-673-03, 2003. 8, 36

[31] S. Khot, G. Kindler, E. Mossel, and R. O’Donnell. Optimal inapproximability results for
MAX-CUT and other 2-variable CSPs? SIAM J. Comput, 37(1):319–357, 2007. 8

[32] S. Khot and R. Saket. Hardness of minimizing and learning DNF expressions. In IEEE FOCS,
pages 231–240, 2008. 2, 3

[33] W. S. Lee, P. L. Bartlett, and R. C. Williamson. On efficient agnostic learning of linear

combinations of basis functions. In Proceedings of COLT, pages 369–376, 1995. 3

[34] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold
algorithm. Machine Learning, 2:285–318, 1987. 2

[35] K. Matulef, R. O’Donnell, R. Rubinfeld, and R. A. Servedio. Testing halfspaces. In SODA,

pages 256–264, 2009. 6, 9

[36] E. Mossel. Gaussian bounds for noise correlation of functions. IEEE FOCS, 2008. 11

[37] E. Mossel, R. O’Donnell, and K. Oleszkiewicz. Noise stability of functions with low influences:
Invariance and optimality. In IEEE FOCS, 2005. 11, 35

[38] R. O’Donnell and R. A. Servedio. The chow parameters problem. In ACM STOC, pages
517–526, 2008. 6, 9

30
[39] R. Rivest. Learning decision lists. Machine Learning, 2(3):229–246, 1987. 2

[40] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization
in the brain. Psychological Review, 65:386–407, 1958. 2

[41] R. A. Servedio. Every linear threshold function has a low-weight approximator. Comput.
Complex., 16(2):180–209, 2007. 6, 9

[42] L. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
2

[43] V. Vapnik. Statistical Learning Theory. 1998. 2

Appendix

A Probabilistic Inequalities

In the discussion below we will make use of the following well-known inequalities.

Theorem A.1. (Hoeffding’s Inequality) Let x(1) , . . . , x(n) be independent

P real random variables
such that x(i) ∈ [a(i) , b(i) ]. Then the sum of these variables S = ni=1 x(i) satisfies
n2 t2
− Pn
Pr[|S − E[S]| > nt] 6 2e i=1
(b(i) −a(i) )2
.

Theorem A.2. (Berry-Esseen

P Theorem) Let x1 , x2 , . . . , xn be i.i.d. random unbiased {−1, 1} vari-
ables. Also assume that ni=1 c2i = 1 and maxi {|ci |} 6 α. Let g denote a unit Gaussian variable
N (0, 1). Then for any t ∈ R,
hX i

Pr ci xi 6 t − Pr[g 6 t] 6 α.

Theorem A.3. (Chebyshev’s Inequality) Let X be a random variable with expected value u and
variance σ 2 . Then for any real number t > 0,

Pr[|X − μ| > t ∙ σ] 6 1/t2 .

B Proof of Lemma 3.3

Recall that each y (i) is generated by the following manner:

(
x(i) with probability 1 − γ
y =
(i)
(10)
random bit with probability γ.

Let us define a random vector z ∈ {0, 1}n based on y. For y generated, if y (i) is generated as a
copy of x(i)Pin (10), then z (i) = 0; if y (i) is generated as a random bit in (10), then z (i) = 1. Let us
write S = ni=1 w(i) y (i) . Our proof is based on two claims.

31
Pn (i) 2 (i) > γ/2] > 1 − 2e− 2τ 2 .
γ2
Claim B.1. For a τ -regular vector w, Pr[ i=1 |w | z

Claim B.2. For a τ -regular vector w, given any a0 < b0 ∈ R and any fixing of z (1) , z (2) , . . . , z (n) ,
P 0 0
if ni=1 (w(i) )2 z (i) = σ 2 > 0, then Pr[S ∈ [a0 , b0 ]] 6 2|b σ−a | + 2τ
σ .

P
Given the above two claims are correct, define event V to be { ni=1 (w(i) )2 z (i) > γ2 } and use
1[a,b] (x) : R → {0, 1} to denote the indicator function of whether x falls into interval [a, b].

Pr[S ∈ [a, b]] = E[1[a,b] (S)] = Pr[V ] E[1[a,b] (S) | V ] + Pr[¬V ] E[1[a,b] (S) | ¬V ]

By Claim B.1,
γ2
Pr[¬V ] E[1[a,b] (S) | ¬V ] 6 Pr[¬V ] 6 2e− 2τ 2 .
By Claim B.2,
4(b − a) 4τ
Pr[V ] E[1[a,b] (S) | V ] 6 √ +√ .
γ γ

Overall,
4(b − a) 4τ γ2
Pr [S ∈ [a, b]] 6 √ + √ + 2e− 2τ 2 .
γ γ

It remains to verify Claim B.1 and Claim B.2.

To prove Claim B.1, we need to apply the Hoeffding’s inequality (see Theorem A.1).
Notice that (w(i) )2 z (i) ∈ [0, (w(i) )2 ] and applying Hoeffding’s Inequality, we know
" n " n # #
X X −2n2 t2
(i) 2 (i)
Pn
> nt 6 2e i=1 (w ) .
(i) 4
Pr (w ) z − E
(i) 2 (i)
(w ) z

i=1 i=1
Pn Pn Pn
We know E[ i=1 (w(i) )2 z (i) ] = γ and i=1 ((w
(i) )2 )2 6 maxi (w(i) )2 i=1 (w ) 6 τ . If we
(i) 2 2

take nt = γ/2, we have

" n #
X γ γ2

Pr (w ) z − γ >
(i) 2 (i)
6 2e− 2τ 2 .
2
i=1

γ2 Pn
Therefore, with probability at least 1 − 2e− 2τ 2 , i=1 (w
(i) )2 z (i) > γ2 .
To prove Claim
P B.2, we need use Berry-Esseen
P Theorem (See Theorem A.2). Let us split S into
two parts: S = zi =1 wi yi and S = zi =0 wi yi . Since S = S 0 + S 00 and S 0 is independent of S 00 ,
0 00
2|b0 −a0 |
it suffices to show that Pr [S 0 ∈ [a0 , b0 ]] 6 √
σ
+ 2τ
σ for any a , b ∈ R. Define y
0 0 0(i) = 2y (i) − 1 and

note that y 0(i) a {−1, 1} variable. By rewriting S 0 using this definition, we have
X X 1 + y 0(i)
S0 = w(i) y (i) = w(i) .
2
z (i) =1 z (i) =1

32
Then  
X
Pr S 0 ∈ [a0 , b0 ] = Pr  w(i) y 0(i) ∈ [a00 , b00 ] , (11)
z (i) =1
P P
where a00 = 2a0 − z (i) =1 w(i) and b00 = 2b0 − z (i) =1 w(i) . We can further rewrite the above term
as
   
X X
Pr  w(i) y 0(i) 6 b00  − Pr  w(i) y 0(i) 6 a00 
z (i) =1 z (i) =1
   
w X (i) y 0(i) b 00 X w (i) y 0(i) a 00
= Pr  qP 6 qP −Pr  qP 6 qP .
(i)
z =1 z (i) =1 (w (i) )2
z (i) =1 (w (i) )2 (i)
z =1 z (i) =1 (w (i) )2
z (i) =1 (w (i) )2

We can now apply Berry-Esseen’s theorem. Notice that for all the i such that z (i) = 1, y 0(i) is
(i)
distributed as an independent unbiased random {−1, 1} variable. Also maxz (i) =1 qP |w | (i) 2 6
z (i) =1
(w )
τ
qP .
z (i) =1
(w(i) )2

By Berry-Esseen’s theorem, we know that expression (11) is bounded by

   
b 00 a 00 2τ
Pr N (0, 1) 6 qP  − Pr N (0, 1) 6 q
P
+ q
P .
(i)
z =1 (w (i) )2
(i)
z =1 (w (i) )2
(i)
z =1 (w (i) )2

Using the fact that a unit Gaussian variable falls in any interval of length λ with probability at
most λ and noticing that b00 − a00 = 2(b0 − a0 ), we can bound the above quantity by

2|b0 − a0 | 2τ 2|b − a| 2τ
qP + qP = + .
(w (i) )2 (w (i) )2 σ σ
z (i) =1 z (i) =1

C Proof of Invariance Principle (Th. 3.10)

We restate our version of the invariance principle here for convenience.

Theorem 3.10 restated (Invariance Principle) Let A = {A{1} , . . . , A{R} }, B = {B {1} , . . . , B {R} }
(i) (i) (i) (i)
be families of ensembles of random variables with A{i} = {a1 , . . . , aki } and B {i} = {b1 , . . . , bki },
satisfying the following properties:

33
Given a set of vectors l = {l{1} , . . . , l{R} }(l{i} ∈ Rki ), define the linear function l : Rk1 ×∙ ∙ ∙×RkR →
R as X
l(x) = hl{i} , x{i} i
i∈[R]

Then for a K-bounded function Ψ : R → R we have

h i h i X

E Ψ l(A) − θ − E Ψ l(B) − θ 6 K kl{i} k41 . (12)
A B
i∈[R]

for all θ > 0. Further, define the spread function c(α) corresponding to the ensembles A, B and the
linear function l as follows,

(Spread Function: )For 1/2 > α > 0, let

h i h i
c(α) = max sup PrA l(A) ∈ [θ − α, θ + α] , sup PrB l(B) ∈ [θ − α, θ + α]
θ θ

then for all θ̃,

h i h i P

E pos l(A) − θ̃ − EB pos l(B) − θ̃ 6 O α14 i∈[R] kl
{i} k4 + 2c(α).
1 (13)
A

Proof. Let us prove equation (12) first. Let Xi = {B {1} , . . . , B {i−1} , B {i} , A{i+1} , . . . , A{R} }.
We know that

E[Ψ(l(A) − θ)] − E[Ψ(l(B) − θ)] = E [Ψ(l(X0 ) − θ)] − E [Ψ(l(XR ) − θ)]

A B X0 XR
R
X
= E [Ψ(l(Xi−1 ) − θ)] − E [Ψ(l(Xi ) − θ)].
Xi−1 Xi
i=1

Therefore, it suffices to prove

E [Ψ(l(Xi−1 ) − θ)] − E [Ψ(l(Xi ) − θ)] 6 Kkl{i} k41 . (14)
Xi−1 Xi

Let Yi = {B {1} , . . . , B {i−1} , A{i+1} , . . . , A{R} } and we have Xi = {Yi , B {i} } and Xi−1 =
{Yi , A{i} }. Then

E [Ψ(l(Xi−1 ) − θ)] − E [Ψ(l(Xi ) − θ)] = E E [Ψ(l(Xi−1 ) − θ)] − E [Ψ(l(Xi ) − θ)] . (15)
Xi−1 Xi Yi A {i} B {i}

Notice that
X X
l(Xi−1 ) − θ = hl{i} , A{i} i + hl{j} , B {j} i + hl{j} , A{j} i − θ
16j6i−1 i+16j6R

and X X
l(Xi ) − θ = hl{i} , B {i} i + hl{j} , B {j} i + hl{j} , A{j} i − θ.
16j6i−1 i+16j6R

34
P P
Take θ0 = 16j6i−1 hl
{j} , B {j} i + i+16j6R hl
{j} , A{j} i − θ, We can further rewrite equation
(15) as
E E [Ψ(hl{i} , A{i} i + θ0 )] − E [Ψ(hl{i} , B {i} i + θ0 )] . (16)
Yi A {i} B {i}

Using the Taylor expansion of Ψ, we have that the inner expectation of equation (16) is equal
to

E [Ψ(θ0 ) + Ψ0 (θ0 )hl{i} , A{i} i + Ψ (θ ) (hl{i} , A{i} i)2 + Ψ (θ ) (hl{i} , A{i} i)3 + Ψ (δ1 ) (hl{i} , A{i} i)4 ]
00 0 000 0 0000

A {i} 2 6 24
Ψ 00 (θ 0 ) Ψ 000 (θ 0 ) Ψ 0000 (δ )
2
− E [Ψ(θ0 ) + Ψ0 (θ0 )hl{i} , B {i} i + (hl{i} , B {i} i)2 + (hl{i} , B {i} i)3 + (hl{i} , B {i} i)4 ].
B {i} 2 6 24
(17)

for some δ1 , δ2 ∈ R.
Using the fact that A{i} and B {i} have matching moments up to degree 3, we can upper bound
equation (17) by

E [ Ψ (δ1 ) (hl{i} , A{i} i)4 ] − E [ Ψ (δ2 ) (hl{i} , B {i} i)4 ] 6 K |l{i} |41 .
0000 0000
A {i} 24 B {i} 24 12

In the last inequality, we use the fact that Ψ is K-bounded and hl{i} , A{i} i 6 kl{i} k1 , hl{i} , B {i} i 6
kl{i} k1 since all random variables in A, B are bounded by 1.
Overall, we bound the inner expectation of equation (16) by 12 kl k1 . This implies equation
K {i} 4

(16) and therefore equation (14) is bounded by 12 kl k1 , establishing equation (12).

K {i} 4

To prove equation (13), we need to use the following lemma.

Lemma C.1. ([37], Lemma 3.21) There exists an absolute constant C such that ∀0 < λ < 12 , there
exists λC4 -bounded function Φλ : R → [0, 1] which approximates the pos(x) function in the following
sense: Φλ (t) = 1 for all t > λ; Φλ (t) = 0 for t < −λ.

By the above lemma, we can find a αC4 -bounded function Φα such that Φα (l(A) − θ) is equal to
pos(l(A) − θ) except when l(A) ∈ [θ − α, θ + α] and Φα (l(B) − θ) is equal to pos(l(B) − θ) except
when l(B) ∈ [θ − α, θ + α]. Also for any x ∈ R, |pos(x) − Φα (x)| 6 1 as pos(x) and Φα (x) are both
in [0, 1].
Overall, we have
h i h i h i h i

E pos l(A) − θ − E pos l(B) − θ 6 E pos l(A) − θ − E Φα l(A) − θ
A B A A
h i h i h i h i

+ E Φα l(A) − θ − E Φα l(B) − θ + E Φα l(B) − θ − E pos l(B) − θ
A B B B
C X {i} 4
6 kl k1 + 2c(α).
α4
i∈[R]

35
D Hardness of Smooth k-Label Cover

First we state the bipartite smooth Label Cover given by Khot [30]. Our reduction is similar to
the one in [19] but in addition requires proving the smoothness property.

Definition D.1. A Label Cover problem L(G(W, V, E), M, N, {π v,w |(w, w) ∈ E}) consists of a
bipartite graph G(V, W, E) with bipartition V and W . M, N are two positive integers such that
M > N . There are projection functions π v,w : [M ] → [N ] associated with each edge (w, v) ∈ E
where v ∈ V, w ∈ W . All vertices in W have the same degree (i.e., W -side regular). For any
labeling Λ : V → [M ] and Λ : W → [N ], an edge is said to be satisfied if π v,w (Λ(v)) = Λ(w). We
define Opt(L) to be the maximum fraction of edges satisfied by any labeling.

Theorem D.2. There is an absolute constant γ > 0 such that for all integer parameters u and J, it
is NP-hard to distinguish the following two cases: A Label Cover problem L(G(W, V, E), N, M, {π v,w |(w, v) ∈
E}) with M = 7(J+1)u and N = 2u 7Ju having

• Opt(L) = 1 or

• Opt(L) 6 2−2γu .

In addition, the Label Cover has the following properties:

• for each π v,w and any i ∈ [N ], we have |(π v,w )−1 (i)| 6 4u ;

• for a fixed vertex w and a randomly picked neighbor v of w,

∀i, j ∈ [M ], Pr[π v,w (i) = π v,w (j)] 6 1/J.

Below we prove Theorem 5.1.

Proof. Given an instance of bipartite Label Cover L(G(V, W, E), M, N, {π v,w |(w, v) ∈ E}), we can
convert it to a smooth k-Label Cover instance L0 as follows. The vertex set of L0 is V and we
generate the hyperedge set E 0 and projections associated with the hyperedges in the following way:

1. pick a vertex w ∈ W ;

2. pick a k-tuple of w’s neighbors v1 , . . . , vk and add a hyperedge e = (v1 , . . . , vk ) to E 0 with

projections π vi ,e = π vi ,w for each i ∈ [k].

Completeness: If Opt(L) = 1, then there exists a labeling Λ such that for every edge (w, v) ∈ E,
π v,w (Λ(v)) = Λ(w). We can simply take the restriction of labeling Λ on V for the smooth k-
Label Cover instance L0 . For any hyperedge e = (v1 , v2 , . . . , vk ) generated by w ∈ W , we know
π vi ,e (Λ(vi )) = Λ(w) = π vj ,e (Λ(vj )) for any i, j ∈ [k].

36
Soundness: If Opt(L) 6 2−2γu , then we can weakly satisfy at most 2k 2 2−γu -fraction of the
hyperedges in L0 . This can be proved via contrapositive argument. Suppose there is a labeling
strategy Λ (defined on V ) for the smooth k-Label Cover that weakly satisfies α > 2k 2 2−γu
fraction of the hyperedges. Extend the labelling to W as follows: For each vertex w ∈ W and a
neighbor v ∈ V , let πv,w (Λ(v)) be the label recommended by v to w. Simply assign for every vertex
w ∈ W , the label most recommended by its neighbours.
By the fact that Λ weakly satisfies α-fraction of hyperedges in L0 , we know that if we pick a
vertex w and randomly pick two of its neighbors v1 , v2 then
α 2α
Pr [π v1 ,w (Λ(v1 )) = π v2 ,w (Λ(v2 ))] > k
> 2.
k
2

By an averaging argument, at least kα2 -fraction of the vertices w ∈ W , will have the following
property: among all the possible pairs of w’s neighbors, at least kα2 -fraction of pairs recommend the
same label for w. Let us call such a w to be a nice. It is easy to see that for every nice w, the
most recommended label is actually recommended by at least kα2 fraction of its neighbours. Hence,
the extended labelling satisfies at least α/k 2 fraction of edges incident at each nice w ∈ W . Using
2
W -side regularity, we conclude that the extended labelling satisfies αk4 = 4 ∙ 2−2γu -fraction the edges
of L – a contradiction.

Smoothness of L0 : For any given vertex v in L0 , we want so show that if we randomly pick an
hyperedge e0 containing v, then for the projection π v,e as defined in L0 ,
0 0 1
∀i, j ∈ [M ], Pr[π v,e (i) = π v,e (j)] 6 .
J
0
To see this, notice that all vertices in W have the same degree; picking a projection π v,e using
the above procedure is the same as randomly picking a neighbor w of v and using the projection
π v,w defined in L. Therefore,
0 0 1
∀i, j ∈ [M ], Pr[π v,e (i) = π v,e (j) = Pr[π v,w (i) = π v,w (j)] 6 .
J

ECCC ISSN 1433-8092

http://eccc.hpi-web.de

ML - Questions & Answer
No ratings yet
ML - Questions & Answer
45 pages
ML Unit 2
No ratings yet
ML Unit 2
66 pages
Michael Importance Weighting
No ratings yet
Michael Importance Weighting
30 pages
Projective DNF Rev
No ratings yet
Projective DNF Rev
26 pages
DistributionLearn (Shai)
No ratings yet
DistributionLearn (Shai)
47 pages
ML Unit-2 Material Add-On
No ratings yet
ML Unit-2 Material Add-On
82 pages
An Algebraic Perspective On Boolean Function Learning: Ricard Gavald' A Denis TH Erien April 5th, 2010
No ratings yet
An Algebraic Perspective On Boolean Function Learning: Ricard Gavald' A Denis TH Erien April 5th, 2010
33 pages
On Lattices, Learning With Errors, Random Linear Codes, and Cryptography
No ratings yet
On Lattices, Learning With Errors, Random Linear Codes, and Cryptography
37 pages
Colt Tutorial
No ratings yet
Colt Tutorial
43 pages
Realizable Learning Is All You Need: Max Hopkins Daniel M. Kane Shachar Lovett Gaurav Mahajan
No ratings yet
Realizable Learning Is All You Need: Max Hopkins Daniel M. Kane Shachar Lovett Gaurav Mahajan
62 pages
Docmine: Spare Parts Catalog
No ratings yet
Docmine: Spare Parts Catalog
83 pages
Robustly Learning Monotone Generalized Linear Models Via Data Augmentation
No ratings yet
Robustly Learning Monotone Generalized Linear Models Via Data Augmentation
55 pages
AI Unit 4
No ratings yet
AI Unit 4
91 pages
Foundations of Machine Learning: Module 7: Computational Learning Theory
No ratings yet
Foundations of Machine Learning: Module 7: Computational Learning Theory
64 pages
ML UNIT-3 Notes PDF
No ratings yet
ML UNIT-3 Notes PDF
23 pages
Learning DFA From Simple Examples: S S S NP P NP N N
No ratings yet
Learning DFA From Simple Examples: S S S NP P NP N N
22 pages
Agnostic Learning of Disjunctions On Symmetric Distributions
No ratings yet
Agnostic Learning of Disjunctions On Symmetric Distributions
11 pages
More Efficient PAC-learning of DNF With Membership Queries Under The Uniform Distribution
No ratings yet
More Efficient PAC-learning of DNF With Membership Queries Under The Uniform Distribution
30 pages
Mathematics of Learning Dealing With Data Notices-Ams2003refs
No ratings yet
Mathematics of Learning Dealing With Data Notices-Ams2003refs
19 pages
1 Review of Definitions: COS 511: Theoretical Machine Learning
No ratings yet
1 Review of Definitions: COS 511: Theoretical Machine Learning
9 pages
AI sp12 Final Solutions
No ratings yet
AI sp12 Final Solutions
19 pages
On Total Functions, Existence Theorems and Computational Complexity 1-S2.0-030439759190200l-Main
No ratings yet
On Total Functions, Existence Theorems and Computational Complexity 1-S2.0-030439759190200l-Main
8 pages
Learning Rates For Q-Learning
No ratings yet
Learning Rates For Q-Learning
25 pages
Learnability Can Be Undecidable-Nicolelis
No ratings yet
Learnability Can Be Undecidable-Nicolelis
5 pages
Tutorial
No ratings yet
Tutorial
81 pages
Chapter 3 - Introduction Via Linear Regression
No ratings yet
Chapter 3 - Introduction Via Linear Regression
20 pages
10 Hardness2
No ratings yet
10 Hardness2
5 pages
09 Hardness1
No ratings yet
09 Hardness1
6 pages
Optimal Learning in Neural Network Memories?: Le'Lter To The Editor
No ratings yet
Optimal Learning in Neural Network Memories?: Le'Lter To The Editor
7 pages
Sol All
No ratings yet
Sol All
66 pages
Unit 1
No ratings yet
Unit 1
43 pages
2024-On Lattices, Learning With Errors, Random Linear Codes, and Cryptography
No ratings yet
2024-On Lattices, Learning With Errors, Random Linear Codes, and Cryptography
37 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
07 Agnostic Pac
No ratings yet
07 Agnostic Pac
5 pages
1 The Probably Approximately Correct (PAC) Model: COS 511: Theoretical Machine Learning
No ratings yet
1 The Probably Approximately Correct (PAC) Model: COS 511: Theoretical Machine Learning
6 pages
Pac Learning
No ratings yet
Pac Learning
30 pages
Machine Leaning 3
No ratings yet
Machine Leaning 3
44 pages
Learning Monotone DNF From A Teacher That Almost Does Not Answer Membership Queries
No ratings yet
Learning Monotone DNF From A Teacher That Almost Does Not Answer Membership Queries
9 pages
Faster Algorithms For Agnostically Learning Disjunctions and Their Implications
No ratings yet
Faster Algorithms For Agnostically Learning Disjunctions and Their Implications
27 pages
Learning Functions Represented As Multiplicity Automata: Amos Beimel
No ratings yet
Learning Functions Represented As Multiplicity Automata: Amos Beimel
25 pages
Final Exam, 10701 Machine Learning, Spring 2009: Max. Score Score 1 2 3 4 5 6 7 8 9 10
No ratings yet
Final Exam, 10701 Machine Learning, Spring 2009: Max. Score Score 1 2 3 4 5 6 7 8 9 10
25 pages
336 Lecture9 2007
No ratings yet
336 Lecture9 2007
4 pages
Lecture 01
No ratings yet
Lecture 01
11 pages
Lect1116-18 Active Learning
No ratings yet
Lect1116-18 Active Learning
6 pages
HW 1 Eeowh 3
No ratings yet
HW 1 Eeowh 3
6 pages
Machine Learning - Computational Learning Theory PDF
No ratings yet
Machine Learning - Computational Learning Theory PDF
7 pages
10-601 Machine Learning
No ratings yet
10-601 Machine Learning
7 pages
Time-Space Tradeoffs For Learning From Small Test Spaces: Learning Low Degree Polynomial Functions
No ratings yet
Time-Space Tradeoffs For Learning From Small Test Spaces: Learning Low Degree Polynomial Functions
28 pages
How Many Samples To Learn A Finite Class?
No ratings yet
How Many Samples To Learn A Finite Class?
4 pages
Module 4 - Bayesian Learning
No ratings yet
Module 4 - Bayesian Learning
36 pages
HW 1
No ratings yet
HW 1
6 pages
Understanding Machine Learning Solution Manual: 2 Gentle Start
No ratings yet
Understanding Machine Learning Solution Manual: 2 Gentle Start
67 pages
NIPS 1999 Support Vector Method For Novelty Detection Paper
No ratings yet
NIPS 1999 Support Vector Method For Novelty Detection Paper
7 pages
Computational Learning Theory Mini Project Michaelmas Term 2022
No ratings yet
Computational Learning Theory Mini Project Michaelmas Term 2022
8 pages
Ex 83622 2025 1
No ratings yet
Ex 83622 2025 1
2 pages
hw1 PDF
No ratings yet
hw1 PDF
6 pages
Csup AL
No ratings yet
Csup AL
5 pages
Mains Voltage Compensation
No ratings yet
Mains Voltage Compensation
6 pages
Understanding The Self Course Outline
100% (1)
Understanding The Self Course Outline
2 pages
MS For Survey Works (Draft) R5
No ratings yet
MS For Survey Works (Draft) R5
47 pages
ReadyIAS AW Toolkit
No ratings yet
ReadyIAS AW Toolkit
41 pages
Testing MCQ
No ratings yet
Testing MCQ
59 pages
The Hydrologic Budget
100% (1)
The Hydrologic Budget
6 pages
DelcoRemy DiagnosticManual Updated Digital
No ratings yet
DelcoRemy DiagnosticManual Updated Digital
32 pages
FreemanWhite Hybrid Operating Room Design Guide PDF
No ratings yet
FreemanWhite Hybrid Operating Room Design Guide PDF
11 pages
Contribution of Renewable Energy On Total Energy Capacity
No ratings yet
Contribution of Renewable Energy On Total Energy Capacity
6 pages
9-Mm Pistol Pmi Training: REF: FM 23 - 35
No ratings yet
9-Mm Pistol Pmi Training: REF: FM 23 - 35
30 pages
Power Electronics For Electric Vehicles
No ratings yet
Power Electronics For Electric Vehicles
51 pages
Mitutoyo - Przenośny Twardościomierz Leeb HH-411 - 2006 EN
No ratings yet
Mitutoyo - Przenośny Twardościomierz Leeb HH-411 - 2006 EN
2 pages
Eco 1
No ratings yet
Eco 1
24 pages
3 Zone Fence Integrity Monitor
No ratings yet
3 Zone Fence Integrity Monitor
2 pages
Ed Ruscha's One Way Street
No ratings yet
Ed Ruscha's One Way Street
16 pages
Mining Industry Business Plan by Slidesgo
No ratings yet
Mining Industry Business Plan by Slidesgo
58 pages
Inner Ring
No ratings yet
Inner Ring
16 pages
Shahzad 2014
No ratings yet
Shahzad 2014
21 pages
Algebra and More For Analytics
No ratings yet
Algebra and More For Analytics
29 pages
Guiding Principle:: Title: Training Guide For Dcws On Self Help Assessment
No ratings yet
Guiding Principle:: Title: Training Guide For Dcws On Self Help Assessment
33 pages
8.design and Analysis of A Conformal MIMO Ingestible Bolus Sensor Antenna For Wireless Capsule Endoscopy For Animal Husbandry
No ratings yet
8.design and Analysis of A Conformal MIMO Ingestible Bolus Sensor Antenna For Wireless Capsule Endoscopy For Animal Husbandry
9 pages
Mutations
No ratings yet
Mutations
48 pages
Foot-Surface-Structure Analysis Using A Smartphone-Based 3D Foot Scanner
No ratings yet
Foot-Surface-Structure Analysis Using A Smartphone-Based 3D Foot Scanner
7 pages
MAQ TNC AC Test
No ratings yet
MAQ TNC AC Test
1 page
English Formulas (Elements For Composing Written Responses)
No ratings yet
English Formulas (Elements For Composing Written Responses)
2 pages
Engine Code Won't Clear in My 2008 Saturn Vue - Google Search
No ratings yet
Engine Code Won't Clear in My 2008 Saturn Vue - Google Search
1 page
Phy340-Tutorial 2
No ratings yet
Phy340-Tutorial 2
2 pages
Premier League Data - Activity Questions: Part A: Sorting and Filtering
No ratings yet
Premier League Data - Activity Questions: Part A: Sorting and Filtering
3 pages
Gyan Sagar College of Engineering, SAGAR, (M.P.)
No ratings yet
Gyan Sagar College of Engineering, SAGAR, (M.P.)
5 pages
Impact of Crack Cocaine Use On The Occurrence of Oral Lesions and Micronuclei
No ratings yet
Impact of Crack Cocaine Use On The Occurrence of Oral Lesions and Micronuclei
8 pages
Oral and Maxillofacial Manifestations in Patients With Drug Addiction
No ratings yet
Oral and Maxillofacial Manifestations in Patients With Drug Addiction
8 pages
Stimulants of Abuse and Oral Health
No ratings yet
Stimulants of Abuse and Oral Health
5 pages
Cocaine and Oral Health: Verifiable CPD Paper
No ratings yet
Cocaine and Oral Health: Verifiable CPD Paper
5 pages
Dear The Weight
From Everand
Dear The Weight
Masud Rana
No ratings yet
Nonlinear Transformations of Random Processes
From Everand
Nonlinear Transformations of Random Processes
Ralph Deutsch
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Satplan: Fundamentals and Applications
From Everand
Satplan: Fundamentals and Applications
Fouad Sabry
No ratings yet