Lect 0407
Lect 0407
1
Co-training Example: classifying webpages
Actually, many problems have a similar characteristic.
• Co-training: Agreement between two parts
• Examples x can be written in two parts – examples contain two sets of features, i.e. an example is
(x1,x2). x= x1, x2 and the belief is that the two parts of the
example are sufficient and consistent, i.e. ∃ c1, c2 such that
• Either part alone is in principle sufficient to c1(x1)=c2(x2)=c(x)
produce a good classifer.
• E.g., speech+video, image and context, web Prof. Avrim Blum My Advisor Prof. Avrim Blum My Advisor
2
A really simple learning algorithm
Co-Training Theorems
Claim: if data has a separator of margin γ, there’s
a reasonable chance a random hyperplane will • [BM98] if x1, x2 are independent given the
have error ≤ ½ - γ/4. [all hyperplanes through origin] label: D = p(D1+ × D2+) + (1-p)(D1- × D2-), and if
Proof:
C is SQ-learnable, then can learn from an
w Pick a (positive) example x. Consider the 2-d initial “weakly-useful” h1 plus unlabeled data.
plane defined by x and target w*. w* • [BB05] in some cases (e.g., LTFs), you can use
x
w Prh(h⋅x ≤ 0 | h⋅w* ≥ 0) this to learn from a single labeled example!
≤ (π/2 - γ)/π = ½ - γ/π. – Repeat process multiple times.
w So, Eh[err(h) | h⋅w* ≥ 0] ≤ ½ - γ/π. – Get 4 kinds of hyps: {close to c, close to ¬c,
close to 1, close to 0}
w Since err(h) is bounded between 0 and 1, there
must be a reasonable chance of success.
QED
+ _ + +
SVM _ _
+ + +
Transductive SVM _ _ _
Labeled data only
3
Graph-based methods Graph-based methods
• Suppose we believe that very similar • Transductive approach. (Given L + U, output
examples probably have the same label. predictions on U).
• If you have a lot of labeled data, this • Construct a graph with edges between very
suggests a Nearest-Neighbor type of alg. similar examples.
• If you have a lot of unlabeled data, suggests • Solve for:
a graph-based method. – Minimum cut
– Minimum “soft-cut” [ZGL]
– Spectral partitioning
Graph-based methods
• Suppose just two labels: 0 & 1.
• Solve for labels f(x) for unlabeled examples
x to minimize: How can we think about
– ∑e=(u,v)|f(u)-f(v)| [soln = minimum cut] these approaches to using
– ∑e=(u,v) (f(u)-f(v))2 [soln = electric potentials] unlabeled data in a PAC-style
model?
+ -
+ -
• Express relationships that one hopes the • To do this, need unlabeled data to allow us to
target function and underlying distribution uniformly estimate compatibilities well.
will possess.
• Require that the degree of compatibility be
• Idea: use unlabeled data & the belief that something that can be estimated from a finite
the target is compatible to reduce C down to sample.
just {the highly compatible functions in C}.
4
Proposed Model [BB05] Margins, Compatibility
• Augment the notion of a concept class C • Margins: belief is that should exist a large margin separator.
with a notion of compatibility χ between a
+
concept and the data distribution.
_
• “learn C” becomes “learn (C,χ)” (i.e. learn +
class C under compatibility notion χ) _
Highly compatible +
_
Highly compatible + • Legal (and natural) notion of compatibility:
– the compatibility of h1, h2 and D:
5
Infinite hypothesis spaces / VC-dimension
Target in C, but not fully compatible
Infinite Hypothesis Spaces
Finite Hypothesis Spaces – c* not fully compatible: Assume χ(h,x) ∈ {0,1} and χ(C) = {χh : h ∈ C} where χh(x) = χ(h,x).
Theorem C[m,D] - expected # of splits of m points from D with concepts in C.
+ _
+ _
• Can result in much better bound than uniform convergence!