Agnostic Learning of Monomials by Halfspaces Is Hard
Agnostic Learning of Monomials by Halfspaces Is Hard
185 (2010)
December 1, 2010
Abstract
We prove the following strong hardness result for learning: Given a distribution of labeled
examples from the hypercube such that there exists a monomial consistent with (1 − ) of
the examples, it is NP-hard to find a halfspace that is correct on (1/2 + ) of the examples,
for arbitrary constants > 0. In learning theory terms, weak agnostic learning of monomials
is hard, even if one is allowed to output a hypothesis from the much bigger concept class of
halfspaces. This hardness result subsumes a long line of previous results, including two recent
hardness results for the proper learning of monomials and halfspaces. As an immediate corollary
of our result we show that weak agnostic learning of decision lists is NP-hard.
Our techniques are quite different from previous hardness proofs for learning. We define
distributions on positive and negative examples for monomials whose first few moments match.
We use the invariance principle to argue that regular halfspaces (all of whose coefficients have
small absolute value relative to the total `2 norm) cannot distinguish between distributions
whose first few moments match. For highly non-regular subspaces, we use a structural lemma
from recent work on fooling halfspaces to argue that they are “junta-like” and one can zero
out all but the top few coefficients without affecting the performance of the halfspace. The
top few coefficients form the natural list decoding of a halfspace in the context of dictatorship
tests/Label Cover reductions.
We note that unlike previous invariance principle based proofs which are only known to give
Unique-Games hardness, we are able to reduce from a version of Label Cover problem that
is known to be NP-hard. This has inspired follow-up work on bypassing the Unique Games
conjecture in some optimal geometric inapproximability results.
∗
An extended abstract appeared in the Proceedings of the 50th IEEE Symposium on Foundations of Computer
Science, 2009.
†
IBM Almaden Research Center, San Jose, CA. [email protected].
‡
Computer Science Department, Carnegie Mellon University, Pittsburgh, PA. [email protected].
§
College of Computing, Georgia Institute of Technology, Atlanta, GA. [email protected]. Some of this
work was done when visiting Carnegie Mellon University.
¶
IBM Almaden Research Center, San Jose, CA. [email protected]. Most of this work was done when the author
was at Carnegie Mellon University.
ISSN 1433-8092
1 Introduction
Boolean conjunctions (or monomials), decision lists, and halfspaces are among the most basic
concept classes in learning theory. They are all long-known to be efficiently PAC learnable, when
the given examples are guaranteed to be consistent with a function from any of these concept
classes [42, 7, 39]. However, in practice data is often noisy or too complex to be consistently
explained by a simple concept. A common practical approach to such problems is to find a predictor
in a certain space of hypotheses that best fits the given examples. A general model for learning that
addresses this scenario is the agnostic learning model [21, 26]. An agnostic learning algorithm for
a class of functions C using a hypothesis space H is required to perform the following task: Given
examples drawn from some unknown distribution, the algorithm must find a hypothesis in H that
classifies the examples nearly as well as is possible by a hypothesis from C. The algorithm is said
to be a proper learning algorithm if C = H.
In this work we address the complexity of agnostic learning of monomials by algorithms that
output a halfspace as a hypothesis. Learning methods that output a halfspace as a hypothesis such
as Perceptron [40], Winnow [34], Support Vector Machines [43] as well as most boosting algorithms
are well-studied in theory and widely used in practical prediction systems. These classifiers are
often applied to labeled data sets which are not linearly separable. Hence it is of great interest to
determine the classes of problems that can be solved by such methods in the agnostic setting. In
this work we demonstrate a strong negative result on agnostic learning by halfspaces. We prove
that non-trivial agnostic learning of even the relatively simple class of monomials by halfspaces is
an NP-hard problem.
Theorem 1.1. For any constant > 0, it is NP-hard to find a halfspace that correctly labels
(1/2 + )-fraction of given examples over {0, 1}n even when there exists a monomial that agrees
with a (1 − )-fraction of the examples.
Note that this hardness result is essentially optimal since it is trivial to find a hypothesis with
agreement rate 1/2 — output either the function that is always 0 or the function that is always
1. Also note that Theorem 1.1 measures agreement of a halfspace and a monomial with the given
set of examples rather than the probability of agreement of h with an example drawn randomly
from an unknown distribution. Uniform convergence results based on the VC dimension imply that
these settings are essentially equivalent (see for example [21, 26]).
The class of monomials is a subset of the class of decision lists which in turn is a subset of the
class of halfspaces. Therefore our result immediately implies an optimal hardness result for proper
agnostic learning of decision lists.
Previous work
Before describing the details of the prior body of work on hardness results for learning, we note that
our result subsumes all these results with just one exception (the hardness of learning monomials
by t-CNFs [32]). This is because we obtain the optimal inapproximability factor and allow learning
of monomials by the much richer class of halfspaces.
The results of the paper are noteworthy in the broader context of hardness of approximation.
Previously, hardness proofs based on the invariance principle were only known to give Unique-Games
2
hardness. In this work, we are able to harness invariance principles to show NP-hardness result by
working with a version of Label Cover whose projection functions are only required to be unique-
on-average. This could be one potential approach to revisit the many strong inapproximability
results conditioned on the Unique Games conjecture (UGC), with an eye towards bypassing the
UGC assumption. Such a goal was achieved for some geometric problems recently [?]; see Section
2.3.
Agnostic learning of monomials, decision lists and halfspaces has been studied in a number of
previous works. Proper agnostic learning of a class of functions C is equivalent to the ability to
come up with a function in C which has the optimal agreement rate with the given set of examples
and is also referred to as the Maximum Agreement problem for a class of function C.
The Maximum Agreement problem for halfspaces is equivalent to the so-called Hemisphere
problem and is long known to be NP-complete [23, 17]. Amaldi and Kann [1] showed that Maximum
Agreement for halfspaces is NP-hard to approximate within 261 262 factor. This was later improved
by Ben-David et al. [5], and Bshouty and Burroughs [9] to approximation factors 415 418 , and 85 ,
84
3
to a very mild amount of adversarial noise [16, 4, 18]. Our result implies that these positive results
will not hold when the adversarial noise rate is for any constant > 0.
√
Kalai et al. gave the first non-trivial algorithm for agnostic learning monomials in time 2 Õ( n)
[24]. They also gave a breakthrough result for agnostic learning of halfspaces with respect to the
uniform distribution on the hypercube up to any constant accuracy (and analogous results for a
number of other settings). Their algorithms output linear thresholds of parities as hypotheses. In
contrast, our hardness result is for algorithms that output a halfspace (which is a linear threshold
of single variables).
Organization of the paper: We sketch the idea of our proof in Section 2. We define some
probability and analytical tools in Section 3. In Section 4 we define the dictatorship test, which is
an important gadget for the hardness reduction. For the purpose of illustration, we also show
why this dictatorship test already suffices to prove Theorem 1.1 assuming the Unique Games
Conjecture [29]. In Section 5, we describe a reduction from a variant of the Label Cover problem
to prove Theorem 1.1 under the assumption that P 6= NP.
Notation: We use 0 to encode “False” and 1 to encode “True”. We denote pos(t) : R → {0, 1}
as the indicator function of whether t > 0; i.e., pos(t) = 1 when t > 0 and pos(t) = 0 when t < 0.
For x = (x1 , x2 , . . . , xn ) ∈ {0, 1}n , w ∈ Rn , and θ ∈ R, a halfspace h(x) isVa Boolean function of
the form pos(w ∙ x − θ); a monomial (conjunction) is a function of the form i∈S si , where S ⊆ [n]
W si is the literal of xi which can represent either xi or ¬xi ; a disjunction is a function of the form
and
i∈S si . One special case of monomials is the function f (x) = xi for some i ∈ [n], also referred to
as the i-th dictator function.
2 Proof Overview
We prove Theorem 1.1 by exhibiting a reduction from the k-Label Cover problem, which is
a particular variant of the Label Cover problem. The k-Label Cover problem is defined as
follows:
Definition 2.1. For positive integer M, N that M > N and k > 2, an instance of k-Label
Cover L(G(V, E), M, N, {π v,e |e ∈ E, v ∈ e}) consists of a k-uniform connected (multi-)hypergraph
G(V, E) with vertex set V and an edge multiset E; a set of functions {π vi ,e }ki=1 . Every hyperedge
e = (v1 , . . . , vk ) is associated with a k-tuple of projection functions {π vi ,e }ki=1 where π vi ,e : [M ] →
[N ].
A vertex labeling Λ is an assignment of labels to vertices Λ : V → [M ]. A labeling Λ is said to
strongly satisfy an edge e if π vi ,e (Λ(vi )) = π vj ,e (Λ(vj ))) for every vi , vj ∈ e. A labeling Λ weakly
satisfies edge e if π vi ,e (Λ(vi )) = π vj ,e (Λ(vj ))) for some vi , vj ∈ e, vi 6= vj .
The goal in Label Cover is to find a vertex labeling that satisfies as many edges (projection
constraints) as possible.
4
2.1 Hardness assuming the Unique Games conjecture
For the sake of clarity, we first sketch the proof of Theorem 1.1 with a reduction from the k-
Unique Label Cover problem which is a special case of k-Label Cover where M = N and all
the projection functions {π v,e |v ∈ e, e ∈ E} are bijections. The following inapproximability result
[?] for k-Unique Label Cover is equivalent to the Unique Games Conjecture of Khot [29].
Conjecture 2.2. For every constant η > 0 and a positive integer k, there exists an integer R0
such that for all positive integers R > R0 , given an instance L(G(V, E), R, R, {π v,e |e ∈ E, v ∈ e})
it is NP-hard to distinguish between,
• strongly satisfiable instances: there exists a labeling Λ : V → [R] that strongly satisfies 1 − kη
fraction of the edges E.
2k2
• almost unsatisfiable instances: there is no labeling that weakly satisfies Rη/4
fraction of the
edges.
Given an instance L of k-Unique Label Cover, we will produce a distribution D over labeled
examples such that the following holds: if L is a strongly satisfiable instance, then there is a
disjunction that agrees with the label on a randomly chosen example with probability at least 1 − ,
while if L is an almost unsatisfiable instance then no halfspace agrees with the label on a random
example from D with probability more than 12 + . Clearly, such a reduction implies Theorem 1.1
assuming the Unique Games Conjecture but with disjunctions in place of conjunctions. De Morgan’s
law and the fact that a negation of a halfspace is a halfspace then imply that the statement is also
true for monomials (we use disjunctions only for convenience).
Let L be an instance of k-Unique Label Cover on hypergraph G = (V, E) and a set of labels
[R]. The examples we generate will have |V | × R coordinates, i.e., belong to {0, 1}|V |×R . These
coordinates are to be thought of as one block of R coordinates for every vertex v ∈ V . We will
(r)
index the coordinates of x ∈ {0, 1}|V |×R as x = (xv )v∈V,r∈[R] .
For every labeling Λ : V → [R] of the instance, there is a corresponding disjunction over
{0, 1}|V |×R given by, _
h(x) = xv(Λ(v)) .
v
(r)
Thus, using a label r for a vertex v is encoded as including the literal xv in the disjunction. Notice
that an arbitrary halfspace over {0, 1}|V |×R need not correspond to any labeling at all. The idea
would be to construct a distribution on examples which ensures that any halfspace agreeing with at
least 12 + fraction of random examples somehow corresponds to a labeling of Λ weakly satisfying
a constant fraction of the edges in L.
Fix an edge e = (v1 , . . . , vk ). For the sake of exposition, let us assume π vi ,e is the identity
permutation for every i ∈ [k]. The general case is not anymore complicated.
For the edge e, we will construct a distribution on examples De with the following properties:
(r)
• All coordinates xv for a vertex v ∈
/ e are P
fixed to be zero. Restricted to these examples, the
halfspace h can be written as h(x) = pos( i∈[k] hwvi , xvi i − θ).
5
• For any label r ∈ [R], the labeling Λ(v1 ) = . . . = Λ(vk ) = r strongly satisfies the edge e.
(r)
Hence, the corresponding disjunction ∨i∈[k] xvi needs to have agreement > 1 − with the
examples from De .
• There exists a decoding procedure that given a halfspace h outputs a labeling Λh for L such
that, if h has agreement > 12 + with the examples from De , then Λh weakly satisfies the edge
e with non-negligible probability.
For conceptual clarity, let us rephrase the above requirement as a testing problem. Given
a halfspace h, consider a randomized procedure that samples an example (x, b) from the distri-
bution De , and accepts if h(x) = b. This amounts to a test that checks if the function h cor-
responds
P to a consistent labeling. Further, let us suppose the halfspace h is given by h(x) =
pos v∈V hw v , x i
v P− θ . Define the linear function f v : {0, 1}R → R as f (x ) = hw , x i. Then,
v v v v
we have h(x) = pos( v∈V fv (xv ) − θ).
(Λ(v))
For a halfspace h corresponding to a labeling Λ, we will have fv (xv ) = xv – a dictator
function. Thus, in the intended solution every linear function fv associated with the halfspace h is
a dictator function.
Now, let us again restate the above testing problem in terms of these linear functions. For
succinctness, we write fi for the linear function fvi . We need a randomized procedure that does
the following:
• (Completeness) If each of the linear functions fi is the r’th dictator function for some r ∈ [R],
then the test accepts with probability 1 − .
A testing problem of the above nature is referred to as a Dictatorship Testing and is a recurring
theme in hardness of approximation.
Notice that the notion of a linear function being close to a dictator function is not formally
defined yet. In most applications, a function is said to be close to a dictator if it has influential
coordinates. It is easy to see that this notion is not sufficient by itself here. For example, in the
linear function pos(10100 x1 + x2 − 0.5), although the coordinate x2 has little influence on the linear
function, it has significant influence on the halfspace.
We resolve this problem by using the notion of critical index (Definition 3.1) that was introduced
in [41] and has found numerous applications in the analysis of halfspaces [35, 38, 13]. Roughly
speaking, given a linear function f , the idea is to recursively delete its influential coordinates until
there are none left. The total number of coordinates so deleted is referred to as the critical index
of f . Let cτ (wi ) denote the critical index of wi , and let Cτ (wi ) denote the set of cτ (wi ) largest
coordinates of wi . The linear function l is said to be close to the i’th dictator function for every i
6
in Cτ (wi ). A function is far from every dictator if it has critical index 0 – no influential coordinate
to delete.
An important issue is that the critical index of a linear function can be much larger than the
number of influential coordinates and cannot be appropriately bounded. In other words, a linear
function can be close to a large number of dictator functions, as per the definition above. To counter
this, we employ a structural lemma about halfspaces that was used in the recent work on fooling
halfspaces with limited independence [13]. Using this lemma, we are able to prove that if the critical
index is large, then one can in fact zero out the coordinates of wi outside the t largest coordinates
for some large enough t, and the agreement of the halfspace h only changes by a negligible amount!
Thus, we first carry out the zeroing operation for all linear functions with large critical index.
We now describe the above construction and analysis of the dictatorship test in some more
detail. It is convenient to think of the k queries x1 , . . . , xk as the rows of a k × R matrix with {0, 1}
entries. Henceforth, we will refer to matrices {0, 1}k×R and their rows and columns.
k
We construct two distributions D0 , D1 on {0, 1} such that for s ∈ {0, 1}, we have Prx∈Ds ∨i=1 xi =
k
s > 1 − /2 for = ok (1) (this will ensure the completeness of the reduction, i.e., certain disjunc-
tions pass with high probability). Further, the distributions D0 , D1 will be carefully chosen to have
matching first four moments. This will be used in the soundness analysis where we will use an
invariance principle to infer structural properties of halfspaces that pass the test with probability
noticeably greater than 1/2.
We define the distribution D̃sR on matrices {0, 1}k×R by sampling R columns independently
according to Ds , and then perturbing each bit with a small probability /2. We define the following
test (or equivalently, distribution on examples): given a halfspace h on {0, 1}k×R , with probability
1/2 we check h(x) = 0 for a sample x ∈ D̃0R , and with probability 1/2 we check h(x) = 1 for a
sample x ∈ D̃1R .
(j)
Completeness: By construction, each of the R disjunctions ORj (x) = ∨ki=1 xi passes the test
(j)
with probability at least 1 − (here xi denotes the entry in the i’th row and j’th column of x).
Soundness: For the soundness analysis, suppose h(x) = pos(hw, xi − θ) is a halfspace that
passes the test with probability at least 1/2 + . The halfspace h can be written
Pk in two ways by
expanding the inner product hw, xi along rows and columns, i.e., h(x) = pos( i=1 hwi , xi i − θ) =
P
pos( R i=1 hw , x i − θ). Let us denote fi (x) = hwi , xi i.
(i) (i)
First, let us see why the linear functions hwi , xi i must be close to some dictator. Note that we
need to show that two of the linear functions are close to the same dictator.
Suppose each of the linear functions fi is not close to any dictator. In other words, for each
i, no single coordinate of the vector wi is too large (contains more than τ -fraction of the `2 mass
kwi k2 of vector wi ). Clearly, this implies that no single column of the matrix w is too large.
P P
Recall that the halfspace is given by h(x) = pos( j∈[R] hw(j) , x(j) i−θ). Here l(x) = j∈[R] hw(j) , x(j) i−
θ is a degree 1 polynomial into which we are substituting values from two product distributions D0R
and D1R . Further, the distributions D0 and D1 have matching moments up to order 4 by design.
Using the invariance principle, the distribution of l(x) is roughly the same, whether x is from D0R
or D1R . Thus, by the invariance principle, the halfspace h is unable to distinguish between the
distributions D0R and D1R with a noticeable advantage.
7
Further, suppose no two linear functions fi are close to the same dictator, i.e., Cτ (wi )∩Cτ (wj ) =
(j)
∅. In this case, we condition on the values of xi for j ∈ Cτ (wi ). Since Cτ (wi ) ∩ Cτ (wj ) = ∅, this
conditions at most one value in each column. Therefore, the conditional distribution on each column
in cases D0 and D1 still have matching first three moments. We thus apply the invariance principle
using the fact that after deleting the coordinates in Cτ (wi ), all the remaining coefficients of the
weight vector w are small (by definition of critical index). This implies that Cτ (wi ) ∩ Cτ (wj ) 6= ∅
for some two rows i, j and finishes the proof of the soundness claim.
The above consistency-enforcing test almost immediately yields the Unique Games hardness of
weak learning disjunctions by halfspaces via standard methods.
Unlike previous invariance principle based proofs which are only known to give Unique-Games
hardness, we are able to reduce from a version of the Label Cover problem, based on unique
on average projections, that can be shown to be NP-hard. It is of great interest to find other
applications where a weak uniqueness property like the smoothness condition mentioned above
can be used to convert a Unique-Games hardness result to an unconditional NP-hardness result.
Indeed, inspired by the success of this work in avoiding the UGC assumption and using some of
our methods, follow-up work has managed to bypass the Unique Games conjecture in some optimal
geometric inapproximability results [?]. To the best of our knowledge, the results of [?] are the
first NP-hardness proofs showing a tight inapproximability factor that is related to fundamental
8
parameters of Gaussian space, and among the small handful of results where optimality of a non-
trivial semidefinite programming based algorithm is shown under the assumption P 6= NP. We hope
that this paper has thus opened the avenue to convert at least some of the many tight Unique-Games
hardness results to NP-hardness results.
3 Preliminaries
In this section, we define two important tools in our analysis: i) critical index, ii) invariance
principle.
The notion of critical index was first introduced by Servedio [41] and plays an important role in
the analysis of halfspaces in [35, 38, 13].
Definition 3.1. Given any real vector w = (w(1) , w(2) , . . . , w (n) ) ∈ Rn . Reorder the Pcoordinates
by decreasing absolute value, i.e., |w(i1 ) | > |w(i2 ) | > . . . > |w(in ) | and denote σt2 = nj=t |w(ij ) |2 .
For 0 6 τ 6 1, the τ -critical index of the vector w is defined to be the smallest index k such
|w(ik ) | 6 τ σk . If no such k exists (∀k, |w(ik ) | > τ σk ), the τ -critical index is defined to be +∞. The
vector w is said to be τ -regular if the τ -critical index is 1.
A simple observation from [13] is that if the critical index of a sequence is large then the sequence
must contain a geometrically decreasing subsequence.
Lemma 3.2. (Lemma 5.5 in [13]) Given a vector w = (w(i) )ni=1 such that |w(1) | > |w(2) | > . . . >
|w(n) |, if the τ -critical index of the vector w is larger than l, then for any 1 6 i 6 j 6 l + 1,
p p
|w(j) | 6 σj 6 ( 1 − τ 2 )j−i σi 6 ( 1 − τ 2 )j−i |w(i) |/τ.
For a τ -regular weight vector, the following lemma bounds the probability that its weighted
sum falls into a small interval under certain distributions on the points. The proof is in Appendix
B.
P (i) 2
Lemma 3.3. Let w ∈ Rn be a τ -regular vector w, and |w | = 1. D is a distribution over
{0, 1} . Define a distribution D̃ on {0, 1} as follows: to generate y from D̃, first sample x from
n n
9
Intuitively, by the Berry-Esseen Theorem, hw, yi is τ close to the Gaussian distribution if each
y (i)is a random bit; therefore we can bound the probability that hw, yi falls into the interval [a, b].
In above lemma, each y (i) has probability γ to be a random bit, then γ fraction of y (i) is set to be
a random bit and we can similarly bound the probability that hw, yi falls into the interval [a, b].
Definition 3.4. For a vector w ∈ Rn , define set of indices Ht (w) ⊆ [n] as the set of indices
containing the t biggest coordinates of w by absolute value. Suppose its τ -critical index is cτ , define
set of indices Cτ (w) = Hcτ (w). In other words, Cτ (w) is the set of indices whose deletion makes
the vector w to be τ -regular.
Definition 3.5. For a vector w ∈ Rn and a subset of indices S ⊆ [n], define the vector Truncate(w, S) ∈
Rn as:
(
w(i) if i ∈ S
(Truncate(w, S))(i) =
0 otherwise
As suggested by Lemma 3.2, a weight vector with a large critical index has a geometrically
decreasing subsequence. The following two lemmas use this fact to bound the probability that the
weighted sum of a geometrically decreasing sequence of weights falls into a small interval. First,
we restate Claim 5.7 from [13] here.
Lemma 3.6. [Claim 5.7, [13]] Let w = (w(1) , . . . , w (T ) ) be such that |w(1) | > |w(2) | . . . > |w(T ) | > 0
(i) (T ) (T )
and |w(i+1) | 6 | w3 | for 1 6 i 6 T − 1 . Then for any interval I = [α − w 6 , α + w 6 ] of length
|w(T ) |
3 , there is at most one point x ∈ {0, 1}T such that hw, xi ∈ I.
Lemma 3.7. Let w = (w(1) , . . . , w (T ) ) be such that |w(1) | > |w(2) | . . . > |w(T ) | > 0 and |w(i+1) | 6
(i)
| w3 | for 1 6 i 6 T − 1. Let D be a distribution over {0, 1}T . Define a distribution D̃ on {0, 1}T
as follows: To generate y from D̃, sample x from D and set
(
x(i) with probability 1 − γ
y (i) =
random bit with probability γ.
10
3.2 Invariance Principle
While invariance principles have been shown in various settings by [37, 11, 36], we restate a version of
the principle well suited for our application. We present a self-contained proof for it in Appendix C.
Definition 3.8. A function Ψ(x) : R → R for which fourth-order derivatives exist everywhere on
R is said to be K-bounded if |Ψ0000 (t)| 6 K for all t ∈ R.
Definition 3.9. Two ensembles of random variables P = (p1 , . . . , pk ) and Q = (q1 , . . . , qk ) are said
to have Q
matching moments
Q up to degree d if for every multi-set S of elements from [k], |S| 6 d, we
have E[ i∈S pi ] = E[ i∈S qi ].
Theorem 3.10. (Invariance Principle) Let A = {A{1} , . . . , A{R} }, B = {B {1} , . . . , B {R} } be fam-
(i) (i) (i) (i)
ilies of ensembles of random variables with A{i} = {a1 , . . . , aki } and B {i} = {b1 , . . . , bki }, satis-
fying the following properties:
• For each i ∈ [R], the random variables in ensembles (A{i} , B {i} ) have matching moments up
to degree 3. Further all the random variables in A and B are bounded by 1.
• The ensembles A{i} are all independent of each other, similarly the ensembles B {i} are inde-
pendent of each other.
Given a set of vectors l = {l{1} , . . . , l{R} }(l{i} ∈ Rki ), define the linear function l : Rk1 ×∙ ∙ ∙×RkR →
R as X
l(x) = hl{i} , x{i} i
i∈[R]
for all θ > 0. Further, define the spread function c(α) corresponding to the ensembles A, B and the
linear function l as follows,
Roughly speaking, the second part of the theorem states that pos function can be thought of
as 1
α4
-bounded with error parameter c(α).
11
4 Construction of the Dictatorship Test
In this section we describe the construction of the dictatorship test which will be the key ingredient
in the hardness reduction from k-Unique Label Cover.
The dictatorship test is based on following two distributions D0 and D1 defined on {0, 1}k .
Lemma 4.1. For k ∈ N, there exists two probability distributions D0 , D1 on {0, 1}k such that for
x = (x1 , . . . , xk ),
2 1
Prx ∼D0 {every xl is 0} > 1 − √ and Prx ∼D1 {every xl is 0} 6 √ ,
k k
while matching moments up to degree 4, i.e., ∀i, j, m, n ∈ [k]
1. with probability (1 − ), randomly set exactly one of the bit to be 1 and all the other to be 0;
12
From the definition of D0 , D1 , we know that Prx ∼D0 [every xi is 0] > 1 − (1 + 2 + 3 + 4 ) and
Prx ∼D1 [every xi is 0] 6 = √1 .
k
It remains to determine each i . Notice that the moment matching conditions can be expressed
as a linear system over the parameters 1 , 2 , 3 , 4 as follows:
4
X 4
X
i i
i ( ) = (1 − )/k + ( )
k 1/3 4 k 13
i=1 i=1
4
X 4
i 2 X i 2
i ( 1/3 ) = ( )
k 4 k 13
i=1 i=1
X4 X4
i i 3
i ( )3 = ( )
k
1
3 4 k 13
i=1 i=1
X4 4
i 4 X i 4
i ( 1 ) = ( ) .
k3 4 k 13
i=1 i=1
P4
√We then show that such a linear system has a feasible solution 1 , 2 , 3 , 4 > 0 and i=1 i 6
2/ k .
To prove this, by applying Cramer’s rule,
(1 − )/k + P4 ( i ) 2 3 4
i=1 4 1 1 1 1
P4 i 2 k 3 k3 k3 k3
i=1 4 ( 1 )
4 9 16
P4 ki3 3
2
k3
2
k3
2
k3
i=1 4 ( 1 )
8 27 64
3 3 3
P4 ki3 4 k3 k3 k3
i=1 4 ( 13 )
16 81 256
4 4 4
1 = 1 k 2 3
k3
4 k3 k3
1
k 3 k 13 k 13 1
k3
1 4 9 16
2 2 2 2
k13 k83 k273 k3
3 64
k 3 k 33 k 33 3
1 k3
4 164 814 256
4
k3 k3 k3 k3
13
For large enough k, we have 0 6 1 6 √1
. By similar calculation, we can bound 2 , 3 , 4 by 1
√ .
2√ k 2 k
Overall, we have 1 + 2 + 3 + 4 6 2/ k
Definition 4.2. For b ∈ {0, 1}, define the distribution D̃b on {0, 1}k as follows:
Observation 4.3. D̃0 and D̃1 also have matching moments up to degree 4.
Proof. Since the noise is defined to be an independent uniform random bit, when calculating mo-
ments of y, such as ED̃b [yi1 yi2 ∙ ∙ ∙ yid ], we can substitute yi by (1 − γ)xi + 12 γ. Therefore, a degree
d moment of y can be expressed as a weighted sum of moments of x of degree up to d. Since
D0 and D1 have matching moments up to degree 4, it follows that D̃0 and D̃1 also have the same
property.
The following simple lemma asserts that conditioning the two distributions D̃0 and D̃1 on the
same coordinate xj being fixed to value b results in conditional distributions that still have matching
moments up to degree 3.
Lemma 4.4. Given two distributions P0 , P1 on {0, 1}k with matching moments up to degree d, for
any multi-set S of elements from [k], |S| 6 d − 1, j ∈ [k] and c ∈ {0, 1}.
Y Y
E [ xi | xj = c] = E [ xi | xj = c].
P0 P1
i∈S i∈S
Therefore,
Q Q
Y EP0 [xj i∈S xi ] EP1 [xj i∈S xi ] Y
E[ xi | xj = 1] = = = E [ xi | xj = 1].
P0 EP0 [xj ] EP1 [xj ] P1
i∈S i∈S
For the case c = 0, replace xj with x0j = 1 − xj . It is easy to see that P0 and P1 still have
matching moments and conditioning on xj = 0 is the same as conditioning on x0j = 1. Hence we
can reduce to the case c = 1.
14
4.2 The Dictatorship Test
Let R be a positive integer. Based on the distributions D0 and D1 , we define the dictatorship test
as follows:
4. Output the labelled example (y, b). Equivalently, if h denotes the halfspace, ACCEPT
if h(y) = b.
We can also view y as being generated as follows: i) With probability 12 , generate a negative
sample from distribution D̃0R ; ii) With probability 12 , generate a positive sample from distribution
D̃1R .
The dictatorship test has the following completeness and soundness properties.
(j)
Theorem 4.5. (completeness) For any j ∈ [R], h(y) = ∨ki=1 yi passes with probability > 1 − √3 .
k
Theorem 4.6. (soundness) Fix τ = k17 and t = τ12 (3 ln(1/τ ) + ln R) + d4k 2 ln ked τ42 ln(1/τ )e. Let
h(x) = pos(hw, yi − θ) be a halfspace such that Ht (wi ) ∩ Ht (wj ) = ∅ for all i, j ∈ [k]. Then the
halfspace h(y) passes the dictatorship test with probability at most 12 + O( k1 ).
Proof. (Theorem 4.5) If x is generated from D0R , we know that with probability at least 1 − √2 , all
k
(j) (j) (j)
the bits in {x1 , x2 , . . . , xk } are set to 0. By union bound, with probability at least 1 − √2
k
− k1 ,
(j) (j) (j) (j)
{y1 , y2 , . . . , yk } are all set to 0, in which case the test passes as ∨ki=1 yi = 0. If x is generated
(j) (j) (j)
from D1R , we know that with probability at least 1 − √1k , one of the bits in {x1 , x2 , . . . , xk } is set
(j) (j) (j)
to 1 and by union bound one of {y1 , y2 , . . . , yk } is set to 1 with probability at least 1 − √1k − k1 ,
(j)
in which case the test passes since ∨ki=1 yi = 1. Overall, the test passes with probability at least
1 − √3 .
k
We will prove the contrapositive statement of Theorem 4.6: if some h(y) passes the above dictator-
ship test with high probability, then we can decode for each wi (i ∈ [k]), a small list of coordinates
and at least two of the lists will intersect.
The proof is based on two key lemmas (Lemmas 4.7, 4.8). The first lemma states that if a
halfspace passes the test with good probability, then two of its critical index sets Cτ (wi ), Cτ (wj )
15
must intersect. This would immediately imply Theorem 4.6 if cτ is less than t. The second lemma
states that every halfspace can be approximated by another halfspace with critical index less than
t; so we can assume that cτ is small without loss of generality.
Let h(y) be a halfspace function on {0, 1}kR given by h(y) = pos(hw, yi − θ). Equivalently,
h(y) can be written as
X X
h(y) = pos hw(j) , y (j) i − θ = pos hwi , yi i − θ ,
j∈[R] i∈[k]
Similarly define the ensemble B = B {1} , . . . , B {R} using y chosen randomly from the distribution
(j) (j)
D̃1R . Further let us denote l{j} = (l1 , . . . , lk ). Now we apply the invariance principle (Theorem
3.10) to the ensembles A, B and the linear function l. For each j ∈ [R], there is at most one
coordinate i ∈ [k] such that j ∈ Cτ (wi ). Thus, conditioning on y C amounts to fixing of at most
(j) (j)
one variable yi in each column {yi }i∈[k] . By Lemma 4.4, since D̃0 and D̃1 have matching moments
up to degree 4, we get that A{j} and B {j} have matching moments up to degree 3. Also notice
(j) (j)
that maxj∈[R],i∈[k] |li | 6 τ kli k2 6 τ klk2 (as li is a τ -regular) and each yi is set to be a random
unbiased bit with probability k12 ; by Lemma 3.3, the linear function l and the ensembles A, B
satisfy the following spread property for every θ0 ∈ R:
h i
PrA l(A) ∈ [θ0 − α, θ 0 + α] 6 c(α)
h i
PrB l(B) ∈ [θ0 − α, θ 0 + α] 6 c(α),
16
1
where c(α) 6 8αk + 4τ k + 2e− 2τ 2 k4 (by setting γ = 1
k2
and |b − a| = 2α in Lemma 3.3). Using the
invariance principle (Theorem 3.10) this implies:
h X i
E pos hs, y C i + hl{j} , A{j} i − θ y C −
A
j∈[R]
h X i
E pos hs, y C i + hl{j} , B {j} i − θ y C
B
j∈[R]
1X
6O kl{i} k41 + 2c(α) (1)
α4
i∈[R]
By definition of the critical index, we have max j∈[R] lij 6 τ kli k2 . Using this, we can bound
P {i} k4 as follows:
i∈[R] kl 1
X X X (j)
X (j)
kl{j} k41 6 k 4 kli k4 6 k 4 max |li |2 kli k22
j∈[R]
j∈[R] i∈[k] j∈[R] i∈[k]
X 1
6 k4 τ 2 kli k22 6 k 4 τ 2 klk22 6 .
k 10
i∈[k]
In the final inequality in above calculation, we used the fact that τ = k17 and klk2 = 1. Let us
choose α = k12 and (1) is therefore bounded by O(1/k) for all settings of y C . Averaging over all
settings of y C we get that
1
E [h(y)] − E [h(y)] 6 O .
D̃0R D̃1R k
The above lemma asserts that unless some two vectors wi , wj have a common influential co-
ordinate, the halfspace h(y) cannot distinguish between D̃0R and D̃1R . Unlike with the traditional
notion of influence, it is unclear whether the number of coordinates in Cτ (wi ) is small. The following
lemma yields a way to get around this.
Lemma 4.8. (Bounding the number of influential coordinates) Let t be set as Pin Theorem 4.6.
Given a halfspace h(y) and r ∈ [k] such that |Cτ (wr )| > t, define h̃(y) = pos( i∈[k] hw̃i , yi i − θ)
as follows: w̃r = Truncate(wr , Ht (wr )) and w̃i = wi for all i 6= r. Then,
1 1
E [h̃(y)] − E [h(y)] 6 2 and E [h̃(y)] − E [h(y)] 6 2 .
R
D̃0 R
D̃0 k R
D̃1 R
D̃1 k
17
Similarly, define the vectors y1G , y1H , y1>t . We now rewrite the halfspace functions h(y) and h̃(y)
as:
X k
h(y) = pos hwi , yi i + hw1G , y1G i + hw1H , y1H i + hw1>t , y1>t i − θ
i=2
X
k
h̃(y) = pos hwi , yi i + hw1G , y1G i + hw1H , y1H i − θ .
i=2
(gT ) 2 τ2 τ2 R >t 2
|w1 | > kw1>t k22 > kw1>t k22 > kw1 k2 .
(1 − τ 2 )t−gT (1 − τ 2 )
1
τ2
(3 ln(1/τ )+ln R) τ
√ (gT ) (gT )
Using the fact that Rkw1>t k22 > kw1>t k21 , we can get that kw1>t k1 6 τ |w1 | 6 16 |w1 |. Combin-
ing the above inequality with (2) we see that,
h i h Xk i
Pr h(y) 6= h̃(y) 6 Pr | hwi , yi i + hw1G , y1G i + hw1H , y1H i − θ| 6 |hw1>t , y1>t i|
D̃0R D̃0R i=2
h Xk
|w1(gT ) | i
6 Pr hwi , yi i + +
hw1G , y1G i − θ 6
hw1H , y1H i
D̃0R 6
i=2
h 1 (g ) 1 (g ) i
= Pr hw1G , y1G i ∈ [θ0 − |w1 T |, θ 0 + |w1 T |]
D̃0R 6 6
P
where θ0 = − ki=2 hwi , yi i − hw1H , y1H i + θ. For any fixing of the value of θ0 ∈ R, it induces a
certain distribution on y1G . However, the k12 noise introduced in y1G is completely independent.
This corresponds to the setting of Lemma 3.7, and hence we can bound the above probability by
T
1 − 2k12 6 k12 . The result follows from averaging over all values of θ0 .
With the two lemmas above, we now prove the soundness property.
Proof. (Theorem 4.6) The probability of success of h(y) is given by 12 + 12 ED̃R [h(y)] − ED̃R [h(y)] .
1 0
Therefore, it suffices to show that ED̃R [h(y)] − ED̃R [h(y)] = O( k1 ).
0 1
18
1. Sample an edge e = (v1 , . . . , vk ) ∈ E.
2. Generate a random bit b ∈ {0, 1}.
3. Sample x ∈ {0, 1}kR from D̃bR .
4. Define y ∈ {0, 1}|V |×R as follows:
2. I 6= ∅. Then for all r ∈ I, we set w̃r = Truncate(wr , Ht (wr )) and replace wr with w̃r in h to
get a new halfspace h0 . Since such replacements occur at most k times and by Lemma 4.8 every
replacement changes the output of the halfspace on at most k12 fraction of examples, we can bound
the overall change by k × k12 = k1 . That is
1 1
E [h0 (y)] − E [h(y)] 6 , E [h0 (y)] − E [h(y)] 6 . (3)
D̃0R D̃0R k D̃1R D̃1R k
With the dictatorship test defined, we now describe briefly a reduction from k-Unique Label
Cover problem to agnostic learning of monomials, thus showing Theorem 1.1 under the Unique
Games Conjecture (Conjecture 2.2). Although our final hardness result only assumes P 6= NP, we
describe the reduction to k-Unique Label Cover for the purpose of illustrating the main idea of
our proof.
Let L(G(V, E), R, R, {π v,e |v ∈ V, e ∈ E}) be an instance of k-Unique Label Cover. The re-
duction is defined in Figure 4.4. It will produce a distribution over labeled examples: (y, b) where
(i)
y ∈ {0, 1}|V |×R and label b ∈ {0, 1}. We will index the coordinates of y ∈ {0, 1}|V |×R by yw (for
(1) (2) (R)
w ∈ V, i ∈ R) and denote yw (for w ∈ V ) to be the vector (yw , yw , . . . , yw ).
3
Proof of Theorem 1.1 assuming Unique Games Conjecture Fix k = 10
2
, η = 100 and a
1
positive integer R > d(2k) η2 e for which Conjecture 2.2 holds.
19
Completeness: Suppose that Λ : V → [R] is a labeling that strongly satisfies 1 − kη fraction
W (Λ(v))
of the edges. Consider disjunction h(y) = v∈V yv . For at least 1 − kη fraction of edges
e = (v1 , v2 , . . . , vk ) ∈ E, π v1 ,e (Λ(v1 )) = ∙ ∙ ∙ = π vk ,e (Λ(vk )) = r. Let us fix such a choice of edge e in
step 1. As all coordinates of y outside of {yv1 , . . . , yvk } are set to 0 in step 4(a), the disjunction
(Λ(v )) (r)
reduces to ∨i∈[k] yvi i = ∨i∈[k] xi . By Theorem 4.5, such a disjunction agrees with every (y, b)
with probability at least 1 − √3 . Therefore h(y) agrees with a random example with probability
k
at least (1 − √3 )(1 − kη) > 1 − √3 − kη > 1 − .
k k
P
Soundness: Suppose there exists a halfspace h(y) = v∈V hwv , yv i that agrees with more than
2 + > 2 + √1 fraction of the examples. Set t = k 14 (3 ln(k 7 ) + ln R) + d4k 14 ln k 7 e ∙ d4k 2 ln ke =
1 1
k
O k 16 ln R (same as in Theorem 4.6). Define the labeling Λ using the following strategy : for each
vertex v ∈ V randomly pick a label from Ht (wv ).
By an averaging argument, for at least 2 fraction of the edges e ∈ E generated in step 1 of the
reduction, h(y) agrees with the examples corresponding to e with probability at least 12 + 2 . We
will refer to such edges as good. By Theorem 4.6 for each good edge e ∈ E, there exists i, j ∈ [k],
such that π vi ,e Ht (wvi ) ∩ π vj ,e Ht (wvj ) 6= ∅. Therefore the edge e ∈ E is weakly satisfied by the
labeling Λ with probability at least t12 . Hence, in expectation the labeling Λ weakly satisfies at least
2k2
2 ∙ t2 = Ω( k33 ln2 R ) > Rη/4 fraction of the edges (by the choice of R and t).
1 1
In this section, we describe a reduction from a k-Label Cover with an additional smoothness
property to the problem of agnostic learning of disjunctions by halfspaces. This will give us Theo-
rem 1.1 without assuming the Unique Games Conjecture.
Our reduction use the following hardness result for k-Label Cover (Definition 2.1) with the
additional smoothness property.
Theorem 5.1. There exists a constant γ > 0 such that for any integer parameter J, u > 1, it is NP-
hard to distinguish between the following two types of k-Label Cover L(G(V, E), M, N, {π v,e |e ∈
E, v ∈ e}) instances with M = 7(J+1)u and N = 2u 7Ju :
1. (Strongly satisfiable instances) There is some labeling that strongly satisfies every hyperedge.
2. (Instances that are not 2k 2 2−γu -weakly satisfiable) There is no labeling that weakly satisfies
at least 2k 2 2−γu fraction of the hyperedges.
20
• Pick a hyperedge e = (v1 , v2 , . . . , vk ) ∈ E with corresponding projections π v1 ,e , . . . , π vk ,e :
[M ] → [N ].
1. For each v ∈
/ e, yv = 0.
2. For each i ∈ [k], set yvi ∈ {0, 1}M as follows:
( vi ,e
(π (j))
xi with probability 1 − 1
yv(j) = k2
i
random bit with probability k12
• For any mapping π v,e and any number i ∈ [N ], we have |(π v,e )−1 (i)| 6 d = 4u ; i.e., there are
at most d = 4u elements in [M ] that are mapped to the same number in [N ].
The starting point is a smooth k-Label Cover L(G(V, E), M, N, {π v,e |e ∈ E, v ∈ e}) with M =
7(J+1)u and N = 2u 7Ju as described in Theorem 5.1. Figure 5.2 illustrates the reduction from
k-Label Cover L(G(V, E), N, M, {π v,e |e ∈ E, v ∈ e}) that given an instance of k-Label Cover
L produces a random labeled example. We refer to the obtained distribution on examples as E.
We claim that our reduction has the following completeness and soundness properties.
Theorem 5.2. • Completeness: If L is a strongly-satisfiable instance of smooth k-Label
Cover, then there is a disjunction that agrees with a random example from E with probability
at least 1 − O( √1k ).
• Soundness: If L is not 2k 2 2−γk -weakly satisfiable and is smooth with parameters J = 417k
and d = 4k , then there is no halfspace that agrees with a random example from E with
probability more than 12 + O( √1k ).
21
Combining the above theorem with Theorem 5.1 we get that for k = O(1/2 ), we obtain our
main result: Theorem 1.1.
It remains to check the correctness of the completeness and soundness claims in Theorem 5.2.
First let us prove the completeness property.
Proof. (Proof of Completeness) Let Λ be the labeling that strongly satisfies L. Consider disjunction
W (Λ(v))
h(y) = v∈V yv . Let e = (v1 , v2 , . . . , vk ) be any hyperedge and let Ee be the distribution E
Λ(v ) π vi ,e (Λ(v ))
restricted to the examples generated for e. With probability at least 1 −1/k, yvi i = xi i
for
every i ∈ [k]. As e is strongly satisfied by Λ, for all i, j ∈ [k], π (Λ(vi )) = π (Λ(vj )). Therefore,
v i ,e v j ,e
as in the proof of Theorem 4.5,√ we obtain that h(y) agrees with a random example from Ee with
probability at least 1 − O(1/ k). Labeling Λ strongly satisfies all edges and therefore √ we obtain
that h(y) agrees with a random example from E with probability at least 1 − O(1/ k).
The more complicated part is the soundness property which we prove in Section 5.4.
Proof Idea The main idea is similar to the proof of Theorem 4.6 although it is more technically
involved. Notice that the reduction in Figure 5.2 produces examples such that yvj1i , yvj2i are “almost
identical” copies when π vi ,e (j1 ) = π vi ,e (j2 ). Further for different edges e, the coordinates of y will
be grouped in different ways, such that each group will have almost identical copies.
To handle these additional complications, the first step of the proof is to show that almost all
the hyperedges in smooth k-Label Cover satisfy a certain “niceness” property. After that we
generalize the proofs of Lemma 4.7 and Lemma 4.8 under the weaker assumption that most of the
hyperedges are “nice”.
The formal definition of “niceness” and the proof that most of the edges are “nice” appear in
Section 5.4.1. The generalization of Lemma 4.7 appears in Section 5.4.2. The generalization of
Lemma 4.8 appears in Section 5.4.3. All these results are put together into a proof of Theorem 5.2
in Section 5.4.4.
Let h(y) be a halfspace that agrees with more than 12 + √1k -fraction of the examples. Suppose,
X
h(y) = pos hwv , yv i − θ .
v∈V
Let τ = 1
k13
and let
sv = Truncate(wv , Cτ (wv )), l v = w v − sv .
Definition 5.3. A vertex v ∈ V is said to be β-nice with respect to a hyperedge e ∈ E containing
it if
X X 4
|lv(j) | 6 βklv k42 ,
i∈[N ] j∈π −1 (i)
22
where π : [M ] → [N ] is the projection associated with vertex v and hyperedge e. A hyperedge
e = (v1 , v2 , . . . , vk ) is β-nice, if for every i ∈ [k], the vertex vi is β-nice with respect to e.
(i)
(lv )2
Proof. By definition, we know that lv is τ -regular vector. Denote Iv = {i | kl v k22
> 1
d8
}. By
definition |I| 6 d8 .
Notice there are at most pairs of values in I × I. By the smoothness
d16
16
property of the k-Label Cover instance, for any vertex v, at least 1− dJ fraction of the hyperedges
incident on v have the following property: for any i, j ∈ Iv , π v,e (i) 6= π v,e (j). If all the vertices in
a hyperedge have this property we call it a good hyperedge. By an averaging argument, we know
16
that among all hyperedges at least 1 − kdJ = 1 − 4kk > 1 − O( k1 ) fraction is good.
We will show all these good hyperedges are also 2τ -nice. For a given good hyperedge e, a vertex
(i)
(lv )2
v ∈ e, π = π v,e and i ∈ [N ], there is at most one j ∈ π −1 (i) such that kl v k22
> 1
d8
.
Based on the above property, we will show
X X 4
|lv(j) | 6 2τ klv k42 .
i∈[N ] j∈π −1 (i)
Notice that
X X 4 X X (j ) (j ) (j ) (j )
|lv(j) | = l v 1 lv 2 lv 3 lv 4 (4)
i∈[N ] j∈π −1 (i) i∈[N ] j1 ,j2 ,j3 ,j4 ∈π −1 (i)
klv k2 X
|lv(j1 ) |3 + |lv(j2 ) |3 + |lv(j3 ) |3 + |lv(j4 ) |3 .
d4
j1 ,j2 ,j3 ,j4
klv k2 X X
klv k44 + |lv(j1 ) |3 + |lv(j2 ) |3 + |lv(j3 ) |3 + |lv(j4 ) |3
d4
i∈[N ] j1 ,j2 ,j3 ,j4 ∈π −1 (i)
klv k2 3 X (j) 3
6τ 2 klv k42 + 4d |lv | (since |π −1 (i)| 6 d, each lv(j) appears at most 4d3 times)
d4
j∈[M ]
τ
6(τ + 4 )klv k42
2
(lv is τ -regular vector, so |lvj | 6 τ klv k2 for all j ∈ [M ] )
d
62τ klv k2 .
4
23
Let us fix a 2τ -nice hyperedge e = (v1 , . . . , vk ). As before let Ee denote the distribution on
examples restricted to those generated for hyperedge e. We will analyze the probability that the
halfspace h(y) agrees with a random example from Ee .
Let π v1 ,e , π v2 ,e , . . . , π vk ,e : [M ] → [N ] denote the projections associated with the hyperedge e.
For the sake of brevity, we shall write wi , yi , li instead of wvi , yvi , lvi . For all j ∈ [N ] and i ∈ [k],
define
{j}
yi = Truncate(yi , (π vi ,e )−1 (j)).
{j} {j} {j}
Similarly, define vectors wi , li and si .
Notice that for every example (y, b) in the support of Ee , yv = 0 for every vertex v ∈
/ e.
Therefore, on restricting to examples from Ee we can write:
X
h(y) = pos hwi , yi i − θ .
i∈[k]
Lemma 5.5. Let h(y) be a halfspace such that for all i 6= j ∈ [k], we have π vi ,e (Cτ (wi )) ∩
π vj ,e (Cτ (wj )) = ∅. Then
1
E[h(y)|b = 0] − E[h(y)|b = 1] 6 O . (5)
Ee Ee k
Similarly define the ensemble B = B {1} , . . . , B {N } for the conditioning b = 1. Now we shall apply
the invariance principle (Theorem 3.10) to the ensembles A, B and the linear function l(y):
X
l(y) = hl{j} , y {j} i.
j∈[N ]
As we prove in Claim 5.6 below, the ensembles A, B have matching moments up to degree 3.
Furthermore, by Lemma 3.3, the linear function l and the ensembles A, B satisfy the following
spread property:
h i h i
PrA l(A) ∈ [θ0 − α, θ 0 + α] 6 c(α) PrB l(B) ∈ [θ0 − α, θ 0 + α] 6 c(α)
24
1
for all θ0 ∈ R, where c(α) = 8αk + 4τ k + 2e− 2k4 τ 2 (by setting γ = 1
k2
and |b − a| = 2α in Lemma
3.3).
Using the invariance principle (Th. 3.10), this implies:
X X
E pos hs, y C i + {j} {j}
hl , A i − θ |y C
− E pos hs, y i +
C {j} {j}
hl , B i − θ |y C
A B
j∈[N ] j∈[N ]
1 X
6 O( 4 ) kl{j} k41 + 2c(α). (6)
α
j∈[N ]
Take α to be 1
k2
and recall that τ = 1
k13
. In Claim 5.7 below we show that
X
kl{j} k41 6 2τ k 4 .
j∈[N ]
The above inequality holds for an arbitrary conditioning of the values of y C . Hence, by averaging
over all settings of y C we prove (5).
P {j}
Proof. Since kl{j} k1 = i∈[k] kli k1 ,
we can write
X X X {j} X X {j}
kl{j} k41 6 k4 kli k41 = k 4 kli k41 . (7)
j∈[N ] j∈N i∈[k] i∈[k] j∈[N ]
P {j}
As e = (v1 , . . . , vk ) is a 2τ -nice hyperedge, we have j∈[N ] kli k41 6 2τ kli k42 . By normalization of
P
l, we know i∈[k] kli k22 = 1. Substituting this into inequality (7) we get the claimed bound.
as follows:
25
• w̃r = Truncate(wr , Ht (wr )) and w̃i = wi for all i 6= r.
Then,
1 1
E[h̃(y)|b = 0] − E[h(y)|b = 0] 6 2 , E[h̃(y)|b = 1] − E[h(y)|b = 1] 6 2 .
Ee Ee k Ee Ee k
Proof. It is easy to see that the matching moments condition implies that
Let us show the inequality for the case b = 0, the other inequality can be derived in an identical
way. Let Ee,0 denote distribution Ee conditioned on b = 0. Without loss of generality, we may
(1) (2) (M )
assume that r = 1 and |w1 | > |w1 | . . . > |w1 |. In particular, this implies Ht (w1 ) = {1, . . . , t}.
Define
Let us set T = d4k 2 ln(2k)e and define the subset G = {g1 , . . . , gT } of Ht (w1 ) as follows:
Similarly, define the vectors y1G , y1H , y1>t . By definition, we have a1 = w1>t . Rewriting the halfspace
functions h(y), h̃(y) :
X
k
h(y) = pos hwi , yi i + hw1G , y1G i + hw1H , y1H i + ha1 , y1>t i − θ ,
i=2
X
k
h̃(y) = pos hwi , yi i + hw1G , y1G i + hw1H , y1H i + μ1 − θ .
i=2
26
Claim 5.9. h i 1
PrEe,0 |ha1 , y1 i − μ1 | > d4 ka1 k2 6 .
d
Proof. Write [M ] as the union of disjoint sets R1 ∪ R2 ∪ ∙ ∙ ∙ ∪ RN where Ri = (π v1 ,e )−1 (i). Notice
every Ri has size at most d, therefore
X Ri
X
VarEe,0 ha1 , y1 i = VarEe,0 haR1 , y1 i 6
i
dkaR
1 k2 = dka1 k2 .
i 2 2
i∈[N ] i∈[N ]
Proof. The proof is similar to the proof of Theorem 4.6. Define I = {r | Cτ (wr ) > t}. We divide
the problem into the following two cases.
1. I = ∅; i.e., for all i ∈ [k], Cτ (wi ) 6 t. Then for any i 6= j ∈ [k], Ht (wi ) ∩ Ht (wj ) = ∅ implies
Cτ (wi ) ∩ Cτ (wj ) = ∅. By Lemma 5.5, we have
1
E[h(y)|b = 0] − E[h(y)|b = 1] 6 O .
Ee Ee k
2. I 6= ∅. Then for all r ∈ I, we set w̃r = Truncate(wr , Ht (wr )) and define a new halfspace h0 by
replacing wr with w̃r in h. Since such replacements occur at most k times and, by Lemma
27
5.8, every replacement changes the output of the halfspace on at most k12 fraction of examples
from Ee , we can bound the overall change by k × k12 = k1 . That is
1 1
E [h (y)] − E [h(y)] 6 ,
0
E [h (y)] − E [h(y)] 6 .
0
(8)
Ee,0 Ee,0 k E e,1 E e,1 k
For the halfspace h0 and for all r ∈ [k], we have |Cτ (w̃r )| 6 t, thus reducing to Case 1.
Therefore 1
E [h0 (y)] − E [h0 (y)] 6 O . (9)
Ee,o Ee,1 k
Combining (8) and (9), we get
1
E [h(y)] − E [h(y)] 6 O .
Ee,0 Ee,1 k
In other words, the probability that halfspace h(y) agrees with a random example from Ee is at
most 12 + O( k1 ).
Proof. The proof is by contradiction. We can define the following labeling strategy: for each vertex
v, uniformly randomly pick a label from Ht (wv ). We know that the size of Ht (wvi ) is t = O(k 29 ).
Suppose there exists a halfspace that agrees with a random example from E with probability
more than 12 + √1k . Then by an averaging argument, for at least 2√1 k -fraction of the hyperedges e,
h(y) agrees with a random example from Ee with probability at least 1
2 + 1
√
2 k
. We refer to these
edges as good.
Since there is at most O(1/k)-fraction of the hyperedges that are not 2τ -nice we know that
at least 4√1 k -fraction of the hyperedges are 2τ -nice and good. By Lemma 5.11, for each 2τ -nice
and good hyperedge e there exist two vertices vi , vj ∈ e such that π vi ,e (Ht (wi )) and π vj ,e (Ht (wj ))
intersect. Then there is a t12 probability that the labeling strategy we defined will weakly satisfy
hyperedge e.
Overall this strategy is expected to weakly satisfy at least 1 1
√
4 k t2
= Ω( k159 ) fraction of the
2k2
hyperedges. This is a contradiction since L is not 2γk
-weakly satisfiable.
References
[1] E. Amaldi and V. Kann. On the approximability of minimizing nonzero variables or unsatisfied
relations in linear systems. Theoretical Computer Science, 109:237–260, 1998. 3
[2] D. Angluin and P. Laird. Learning from noisy examples. Machine Learning, 2:343–370, 1988.
3
28
[3] S. Arora, L. Babai, J. Stern, and Z. Sweedyk. The hardness of approximate optima in lattices,
codes, and systems of linear equations. J. Comput. Syst. Sci., 54(2):317–331, 1997. 3
[4] P. Auer and M. K. Warmuth. Tracking the best disjunction. Machine Learning, 32(2):127–150,
1998. 4
[6] A. Blum, A. Frieze, R. Kannan, and S. Vempala. A polynomial-time algorithm for learning
noisy linear threshold functions. Algorithmica, 22(1-2):35–52, 1998. 3
[7] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Occam’s razor. Inf. Process.
Lett., 24(6):377–380, 1987. 2
[8] N. Bshouty and L. Burroughs. Bounds for the minimum disagreement problem with applica-
tions to learning theory. In Proceedings of COLT, pages 271–286, 2002. 3
[9] N. Bshouty and L. Burroughs. Maximizing agreements and coagnostic learning. Theoretical
Computer Science, 350(1):24–39, 2006. 3
[10] T. Bylander. Learning linear threshold functions in the presence of classification noise. In
Proceedings of COLT, pages 340–347, 1994. 3
[12] E. Cohen. Learning noisy perceptrons by a perceptron in polynomial time. In IEEE FOCS,
pages 514–523, 1997. 3
[14] V. Feldman. Optimal hardness results for maximizing agreements with monomials. In IEEE
CCC, pages 226–236, 2006. 3
[16] S. Galant. Perceptron based learning algorithms. IEEE Trans. on Neural Networks, 1(2),
1990. 4
[18] C. Gentile and M. K. Warmuth. Linear hinge loss and average margin. In Proceedings of NIPS,
pages 225–231, 1998. 4
[19] P. Gopalan, S. Khot, and R. Saket. Hardness of reconstructing multivariate polynomials over
finite fields. SIAM J. Comput., 39(6):2598–2621, 2010. 36
[20] V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. SIAM J.
Comput., 39(2):742–765, 2009. 3
29
[21] D. Haussler. Decision theoretic generalizations of the PAC model for neural net and other
learning applications. Information and Computation, 100(1):78–150, 1992. 2
[22] K. Hoffgen, K. van Horn, and H. U. Simon. Robust trainability of single neurons. J. Comput.
Syst. Sci., 50(1):114–125, 1995. 3
[23] D. S. Johnson and F. P. Preparata. The densest hemisphere problem. Theoretical Computer
Science, 6:93–107, 1978. 3
[24] A. Kalai, A. Klivans, Y. Mansour, and R. Servedio. Agnostically learning halfspaces. SIAM
Journal on Computing, 37(6):1777–1805, 2008. 3, 4
[25] M. Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM,
45(6):983–1006, 1998. 3
[26] M. Kearns, R. Schapire, and L. Sellie. Toward efficient agnostic learning. Machine Learning,
17:115–141, 1994. 2, 3
[27] M. J. Kearns and M. Li. Learning in the presence of malicious errors. SIAM J. Comput.,
22(4):807–837, 1993. 3
[29] S. Khot. On the power of unique 2-Prover 1-Round games. In ACM STOC, pages 767–775,
May 19–21 2002. 4, 5
[30] S. Khot. New techniques for probabilistically checkable proofs and inapproximability results
(thesis). Princeton University Technical Reports, TR-673-03, 2003. 8, 36
[31] S. Khot, G. Kindler, E. Mossel, and R. O’Donnell. Optimal inapproximability results for
MAX-CUT and other 2-variable CSPs? SIAM J. Comput, 37(1):319–357, 2007. 8
[32] S. Khot and R. Saket. Hardness of minimizing and learning DNF expressions. In IEEE FOCS,
pages 231–240, 2008. 2, 3
[34] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold
algorithm. Machine Learning, 2:285–318, 1987. 2
[36] E. Mossel. Gaussian bounds for noise correlation of functions. IEEE FOCS, 2008. 11
[37] E. Mossel, R. O’Donnell, and K. Oleszkiewicz. Noise stability of functions with low influences:
Invariance and optimality. In IEEE FOCS, 2005. 11, 35
[38] R. O’Donnell and R. A. Servedio. The chow parameters problem. In ACM STOC, pages
517–526, 2008. 6, 9
30
[39] R. Rivest. Learning decision lists. Machine Learning, 2(3):229–246, 1987. 2
[40] F. Rosenblatt. The perceptron: a probabilistic model for information storage and organization
in the brain. Psychological Review, 65:386–407, 1958. 2
[41] R. A. Servedio. Every linear threshold function has a low-weight approximator. Comput.
Complex., 16(2):180–209, 2007. 6, 9
[42] L. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
2
Appendix
A Probabilistic Inequalities
In the discussion below we will make use of the following well-known inequalities.
Theorem A.3. (Chebyshev’s Inequality) Let X be a random variable with expected value u and
variance σ 2 . Then for any real number t > 0,
Let us define a random vector z ∈ {0, 1}n based on y. For y generated, if y (i) is generated as a
copy of x(i)Pin (10), then z (i) = 0; if y (i) is generated as a random bit in (10), then z (i) = 1. Let us
write S = ni=1 w(i) y (i) . Our proof is based on two claims.
31
Pn (i) 2 (i) > γ/2] > 1 − 2e− 2τ 2 .
γ2
Claim B.1. For a τ -regular vector w, Pr[ i=1 |w | z
Claim B.2. For a τ -regular vector w, given any a0 < b0 ∈ R and any fixing of z (1) , z (2) , . . . , z (n) ,
P 0 0
if ni=1 (w(i) )2 z (i) = σ 2 > 0, then Pr[S ∈ [a0 , b0 ]] 6 2|b σ−a | + 2τ
σ .
P
Given the above two claims are correct, define event V to be { ni=1 (w(i) )2 z (i) > γ2 } and use
1[a,b] (x) : R → {0, 1} to denote the indicator function of whether x falls into interval [a, b].
Pr[S ∈ [a, b]] = E[1[a,b] (S)] = Pr[V ] E[1[a,b] (S) | V ] + Pr[¬V ] E[1[a,b] (S) | ¬V ]
By Claim B.1,
γ2
Pr[¬V ] E[1[a,b] (S) | ¬V ] 6 Pr[¬V ] 6 2e− 2τ 2 .
By Claim B.2,
4(b − a) 4τ
Pr[V ] E[1[a,b] (S) | V ] 6 √ +√ .
γ γ
Overall,
4(b − a) 4τ γ2
Pr [S ∈ [a, b]] 6 √ + √ + 2e− 2τ 2 .
γ γ
γ2 Pn
Therefore, with probability at least 1 − 2e− 2τ 2 , i=1 (w
(i) )2 z (i) > γ2 .
To prove Claim
P B.2, we need use Berry-Esseen
P Theorem (See Theorem A.2). Let us split S into
two parts: S = zi =1 wi yi and S = zi =0 wi yi . Since S = S 0 + S 00 and S 0 is independent of S 00 ,
0 00
2|b0 −a0 |
it suffices to show that Pr [S 0 ∈ [a0 , b0 ]] 6 √
σ
+ 2τ
σ for any a , b ∈ R. Define y
0 0 0(i) = 2y (i) − 1 and
note that y 0(i) a {−1, 1} variable. By rewriting S 0 using this definition, we have
X X 1 + y 0(i)
S0 = w(i) y (i) = w(i) .
2
z (i) =1 z (i) =1
32
Then
X
Pr S 0 ∈ [a0 , b0 ] = Pr w(i) y 0(i) ∈ [a00 , b00 ] , (11)
z (i) =1
P P
where a00 = 2a0 − z (i) =1 w(i) and b00 = 2b0 − z (i) =1 w(i) . We can further rewrite the above term
as
X X
Pr w(i) y 0(i) 6 b00 − Pr w(i) y 0(i) 6 a00
z (i) =1 z (i) =1
w X (i) y 0(i) b 00 X w (i) y 0(i) a 00
= Pr qP 6 qP −Pr qP 6 qP .
(i)
z =1 z (i) =1 (w (i) )2
z (i) =1 (w (i) )2 (i)
z =1 z (i) =1 (w (i) )2
z (i) =1 (w (i) )2
We can now apply Berry-Esseen’s theorem. Notice that for all the i such that z (i) = 1, y 0(i) is
(i)
distributed as an independent unbiased random {−1, 1} variable. Also maxz (i) =1 qP |w | (i) 2 6
z (i) =1
(w )
τ
qP .
z (i) =1
(w(i) )2
Using the fact that a unit Gaussian variable falls in any interval of length λ with probability at
most λ and noticing that b00 − a00 = 2(b0 − a0 ), we can bound the above quantity by
2|b0 − a0 | 2τ 2|b − a| 2τ
qP + qP = + .
(w (i) )2 (w (i) )2 σ σ
z (i) =1 z (i) =1
Theorem 3.10 restated (Invariance Principle) Let A = {A{1} , . . . , A{R} }, B = {B {1} , . . . , B {R} }
(i) (i) (i) (i)
be families of ensembles of random variables with A{i} = {a1 , . . . , aki } and B {i} = {b1 , . . . , bki },
satisfying the following properties:
• For each i ∈ [R], the random variables in ensembles (A{i} , B {i} ) have matching moments up
to degree 3. Further all the random variables in A and B are bounded by 1.
• The ensembles A{i} are all independent of each other, similarly the ensembles B {i} are inde-
pendent of each other.
33
Given a set of vectors l = {l{1} , . . . , l{R} }(l{i} ∈ Rki ), define the linear function l : Rk1 ×∙ ∙ ∙×RkR →
R as X
l(x) = hl{i} , x{i} i
i∈[R]
for all θ > 0. Further, define the spread function c(α) corresponding to the ensembles A, B and the
linear function l as follows,
Proof. Let us prove equation (12) first. Let Xi = {B {1} , . . . , B {i−1} , B {i} , A{i+1} , . . . , A{R} }.
We know that
Let Yi = {B {1} , . . . , B {i−1} , A{i+1} , . . . , A{R} } and we have Xi = {Yi , B {i} } and Xi−1 =
{Yi , A{i} }. Then
E [Ψ(l(Xi−1 ) − θ)] − E [Ψ(l(Xi ) − θ)] = E E [Ψ(l(Xi−1 ) − θ)] − E [Ψ(l(Xi ) − θ)] . (15)
Xi−1 Xi Yi A {i} B {i}
Notice that
X X
l(Xi−1 ) − θ = hl{i} , A{i} i + hl{j} , B {j} i + hl{j} , A{j} i − θ
16j6i−1 i+16j6R
and X X
l(Xi ) − θ = hl{i} , B {i} i + hl{j} , B {j} i + hl{j} , A{j} i − θ.
16j6i−1 i+16j6R
34
P P
Take θ0 = 16j6i−1 hl
{j} , B {j} i + i+16j6R hl
{j} , A{j} i − θ, We can further rewrite equation
(15) as
E E [Ψ(hl{i} , A{i} i + θ0 )] − E [Ψ(hl{i} , B {i} i + θ0 )] . (16)
Yi A {i} B {i}
Using the Taylor expansion of Ψ, we have that the inner expectation of equation (16) is equal
to
E [Ψ(θ0 ) + Ψ0 (θ0 )hl{i} , A{i} i + Ψ (θ ) (hl{i} , A{i} i)2 + Ψ (θ ) (hl{i} , A{i} i)3 + Ψ (δ1 ) (hl{i} , A{i} i)4 ]
00 0 000 0 0000
A {i} 2 6 24
Ψ 00 (θ 0 ) Ψ 000 (θ 0 ) Ψ 0000 (δ )
2
− E [Ψ(θ0 ) + Ψ0 (θ0 )hl{i} , B {i} i + (hl{i} , B {i} i)2 + (hl{i} , B {i} i)3 + (hl{i} , B {i} i)4 ].
B {i} 2 6 24
(17)
for some δ1 , δ2 ∈ R.
Using the fact that A{i} and B {i} have matching moments up to degree 3, we can upper bound
equation (17) by
E [ Ψ (δ1 ) (hl{i} , A{i} i)4 ] − E [ Ψ (δ2 ) (hl{i} , B {i} i)4 ] 6 K |l{i} |41 .
0000 0000
A {i} 24 B {i} 24 12
In the last inequality, we use the fact that Ψ is K-bounded and hl{i} , A{i} i 6 kl{i} k1 , hl{i} , B {i} i 6
kl{i} k1 since all random variables in A, B are bounded by 1.
Overall, we bound the inner expectation of equation (16) by 12 kl k1 . This implies equation
K {i} 4
By the above lemma, we can find a αC4 -bounded function Φα such that Φα (l(A) − θ) is equal to
pos(l(A) − θ) except when l(A) ∈ [θ − α, θ + α] and Φα (l(B) − θ) is equal to pos(l(B) − θ) except
when l(B) ∈ [θ − α, θ + α]. Also for any x ∈ R, |pos(x) − Φα (x)| 6 1 as pos(x) and Φα (x) are both
in [0, 1].
Overall, we have
h i h i h i h i
E pos l(A) − θ − E pos l(B) − θ 6 E pos l(A) − θ − E Φα l(A) − θ
A B A A
h i h i h i h i
+ E Φα l(A) − θ − E Φα l(B) − θ + E Φα l(B) − θ − E pos l(B) − θ
A B B B
C X {i} 4
6 kl k1 + 2c(α).
α4
i∈[R]
35
D Hardness of Smooth k-Label Cover
First we state the bipartite smooth Label Cover given by Khot [30]. Our reduction is similar to
the one in [19] but in addition requires proving the smoothness property.
Definition D.1. A Label Cover problem L(G(W, V, E), M, N, {π v,w |(w, w) ∈ E}) consists of a
bipartite graph G(V, W, E) with bipartition V and W . M, N are two positive integers such that
M > N . There are projection functions π v,w : [M ] → [N ] associated with each edge (w, v) ∈ E
where v ∈ V, w ∈ W . All vertices in W have the same degree (i.e., W -side regular). For any
labeling Λ : V → [M ] and Λ : W → [N ], an edge is said to be satisfied if π v,w (Λ(v)) = Λ(w). We
define Opt(L) to be the maximum fraction of edges satisfied by any labeling.
Theorem D.2. There is an absolute constant γ > 0 such that for all integer parameters u and J, it
is NP-hard to distinguish the following two cases: A Label Cover problem L(G(W, V, E), N, M, {π v,w |(w, v) ∈
E}) with M = 7(J+1)u and N = 2u 7Ju having
• Opt(L) = 1 or
• Opt(L) 6 2−2γu .
• for each π v,w and any i ∈ [N ], we have |(π v,w )−1 (i)| 6 4u ;
Proof. Given an instance of bipartite Label Cover L(G(V, W, E), M, N, {π v,w |(w, v) ∈ E}), we can
convert it to a smooth k-Label Cover instance L0 as follows. The vertex set of L0 is V and we
generate the hyperedge set E 0 and projections associated with the hyperedges in the following way:
1. pick a vertex w ∈ W ;
Completeness: If Opt(L) = 1, then there exists a labeling Λ such that for every edge (w, v) ∈ E,
π v,w (Λ(v)) = Λ(w). We can simply take the restriction of labeling Λ on V for the smooth k-
Label Cover instance L0 . For any hyperedge e = (v1 , v2 , . . . , vk ) generated by w ∈ W , we know
π vi ,e (Λ(vi )) = Λ(w) = π vj ,e (Λ(vj )) for any i, j ∈ [k].
36
Soundness: If Opt(L) 6 2−2γu , then we can weakly satisfy at most 2k 2 2−γu -fraction of the
hyperedges in L0 . This can be proved via contrapositive argument. Suppose there is a labeling
strategy Λ (defined on V ) for the smooth k-Label Cover that weakly satisfies α > 2k 2 2−γu
fraction of the hyperedges. Extend the labelling to W as follows: For each vertex w ∈ W and a
neighbor v ∈ V , let πv,w (Λ(v)) be the label recommended by v to w. Simply assign for every vertex
w ∈ W , the label most recommended by its neighbours.
By the fact that Λ weakly satisfies α-fraction of hyperedges in L0 , we know that if we pick a
vertex w and randomly pick two of its neighbors v1 , v2 then
α 2α
Pr [π v1 ,w (Λ(v1 )) = π v2 ,w (Λ(v2 ))] > k
> 2.
k
2
By an averaging argument, at least kα2 -fraction of the vertices w ∈ W , will have the following
property: among all the possible pairs of w’s neighbors, at least kα2 -fraction of pairs recommend the
same label for w. Let us call such a w to be a nice. It is easy to see that for every nice w, the
most recommended label is actually recommended by at least kα2 fraction of its neighbours. Hence,
the extended labelling satisfies at least α/k 2 fraction of edges incident at each nice w ∈ W . Using
2
W -side regularity, we conclude that the extended labelling satisfies αk4 = 4 ∙ 2−2γu -fraction the edges
of L – a contradiction.
Smoothness of L0 : For any given vertex v in L0 , we want so show that if we randomly pick an
hyperedge e0 containing v, then for the projection π v,e as defined in L0 ,
0 0 1
∀i, j ∈ [M ], Pr[π v,e (i) = π v,e (j)] 6 .
J
0
To see this, notice that all vertices in W have the same degree; picking a projection π v,e using
the above procedure is the same as randomly picking a neighbor w of v and using the projection
π v,w defined in L. Therefore,
0 0 1
∀i, j ∈ [M ], Pr[π v,e (i) = π v,e (j) = Pr[π v,w (i) = π v,w (j)] 6 .
J
37
http://eccc.hpi-web.de