0% found this document useful (0 votes)
3 views6 pages

Lect 0407

Uploaded by

aegr82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views6 pages

Lect 0407

Uploaded by

aegr82
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Semi-Supervised Learning

• The main models we have been studying (PAC,


15-859(B) Machine Learning Theory mistake-bound) are for supervised learning.
– Given labeled examples S = {(xi,yi)}, try to learn a
good prediction rule.
Semi-Supervised Learning
• But often labeled data is rare or expensive.
• On the other hand, often unlabeled data is
Avrim Blum plentiful and cheap.
04/07/08 – Documents, images, OCR, web-pages, protein
sequences, …
• Can we use unlabeled data to help?

Semi-Supervised Learning Semi-Supervised Learning


Can we use unlabeled data to help? Can we use unlabeled data to help?
• Unlabeled data is missing the most important • This is a question a lot of people in ML have
info! But maybe still has useful regularities been interested in. A number of interesting
that we can use. E.g., OCR. methods have been developed.
Today:
• Discuss several methods for trying to use
unlabeled data to help.
• Extension of PAC model to make sense of
what’s going on.

Plan for today Co-training


[Blum&Mitchell’98] motivated by [Yarowsky’95]
Methods: Yarowsky’s Problem & Idea:
• Co-training • Some words have multiple meanings (e.g., “plant”).
Want to identify which meaning was intended in any
• Transductive SVM given instance.
• Graph-based methods • Standard approach: learn function from local
context to desired meaning from labeled data.
Model: “…nuclear power plant generated…”
• Augmented PAC model for SSL. • Idea: use fact that in most documents, multiple
uses have same meaning. Use to transfer confident
predictions over.
There’s also a book “Semi-supervised
learning” on the topic.

1
Co-training Example: classifying webpages
Actually, many problems have a similar characteristic.
• Co-training: Agreement between two parts
• Examples x can be written in two parts – examples contain two sets of features, i.e. an example is
(x1,x2). x= x1, x2  and the belief is that the two parts of the
example are sufficient and consistent, i.e. ∃ c1, c2 such that
• Either part alone is in principle sufficient to c1(x1)=c2(x2)=c(x)
produce a good classifer.
• E.g., speech+video, image and context, web Prof. Avrim Blum My Advisor Prof. Avrim Blum My Advisor

page contents and links.


• So if confident about label for x1, can use to
impute label for x2, and vice versa. Use each
classifier to help train the other.
x - Link info & Text info x1- Link info x2- Text info

Example: intervals Co-Training Theorems


Suppose x1 ∈ R, x2 ∈ R. c1 = [a1,b1], c2 = [a2,b2]
• [BM98] if x1, x2 are independent given the
label: D = p(D1+ × D2+) + (1-p)(D1- × D2-), and if
C is SQ-learnable, then can learn from an
initial “weakly-useful” h1 plus unlabeled data.
+ • Def: h is weakly-useful if
+ +
+ Pr[h(x)=1|c(x)=1] > Pr[h(x)=1|c(x)=0] + ε.
+
(same as weak hyp if target c is balanced)
• E.g., say “syllabus” appears on 1/3 of course
pages but only 1/6 of non-course pages.

Co-Training Theorems Co-Training Theorems


• [BM98] if x1, x2 are independent given the • [BM98] if x1, x2 are independent given the
label: D = p(D1+ × D2+) + (1-p)(D1- × D2-), and if label: D = p(D1+ × D2+) + (1-p)(D1- × D2-), and if
C is SQ-learnable, then can learn from an C is SQ-learnable, then can learn from an
initial “weakly-useful” h1 plus unlabeled data. initial “weakly-useful” h1 plus unlabeled data.
• E.g., say “syllabus” appears on 1/3 of course • [BB05] in some cases (e.g., LTFs), you can use
pages but only 1/6 of non-course pages. this to learn from a single labeled example!
• Use as noisy label. Like classification noise
with potentially asymmetric noise rates α, β.
• Can learn so long as α+β < 1-ε.
(helpful trick: balance data so observed labels are 50/50)

2
A really simple learning algorithm
Co-Training Theorems
Claim: if data has a separator of margin γ, there’s
a reasonable chance a random hyperplane will • [BM98] if x1, x2 are independent given the
have error ≤ ½ - γ/4. [all hyperplanes through origin] label: D = p(D1+ × D2+) + (1-p)(D1- × D2-), and if
Proof:
C is SQ-learnable, then can learn from an
w Pick a (positive) example x. Consider the 2-d initial “weakly-useful” h1 plus unlabeled data.
plane defined by x and target w*. w* • [BB05] in some cases (e.g., LTFs), you can use
x
w Prh(h⋅x ≤ 0 | h⋅w* ≥ 0) this to learn from a single labeled example!
≤ (π/2 - γ)/π = ½ - γ/π. – Repeat process multiple times.
w So, Eh[err(h) | h⋅w* ≥ 0] ≤ ½ - γ/π. – Get 4 kinds of hyps: {close to c, close to ¬c,
close to 1, close to 0}
w Since err(h) is bounded between 0 and 1, there
must be a reasonable chance of success.
QED

Co-Training Theorems Co-Training and expansion


Want initial sample to expand to full set of positives
• [BM98] if x1, x2 are independent given the after limited number of iterations.
label: D = p(D1+ × D2+) + (1-p)(D1- × D2-), and if
Link info Text info
C is SQ-learnable, then can learn from an
X2
initial “weakly-useful” h1 plus unlabeled data. X1

• [BB05] in some cases (e.g., LTFs), you can use


this to learn from a single labeled example! +

• [BBY04] if don’t want to assume indep, and C +


+
is learnable from positive data only, then
suffices for D+ to have expansion.

Transductive SVM [Joachims98] Transductive SVM [Joachims98]


• Suppose we believe target separator goes through • Suppose we believe target separator goes through
low density regions of the space/large margin. low density regions of the space/large margin.
• Aim for separator with large margin wrt labeled • Aim for separator with large margin wrt labeled
and unlabeled data. (L+U) and unlabeled data. (L+U)
• Unfortunately, optimization problem is now NP-
_
hard. Algorithm instead does local optimization.
+ _ _
+ + – Start with large margin over labeled data. Induces
+ _ + +
labels on U.
_ _
– Then try flipping labels in greedy fashion.

+ _ + +
SVM _ _
+ + +
Transductive SVM _ _ _
Labeled data only

3
Graph-based methods Graph-based methods
• Suppose we believe that very similar • Transductive approach. (Given L + U, output
examples probably have the same label. predictions on U).
• If you have a lot of labeled data, this • Construct a graph with edges between very
suggests a Nearest-Neighbor type of alg. similar examples.
• If you have a lot of unlabeled data, suggests • Solve for:
a graph-based method. – Minimum cut
– Minimum “soft-cut” [ZGL]
– Spectral partitioning

Graph-based methods
• Suppose just two labels: 0 & 1.
• Solve for labels f(x) for unlabeled examples
x to minimize: How can we think about
– ∑e=(u,v)|f(u)-f(v)| [soln = minimum cut] these approaches to using
– ∑e=(u,v) (f(u)-f(v))2 [soln = electric potentials] unlabeled data in a PAC-style
model?
+ -

+ -

Proposed Model [BB05] Proposed Model [BB05]


• Augment the notion of a concept class C • Augment the notion of a concept class C
with a notion of compatibility χ between a with a notion of compatibility χ between a
concept and the data distribution. concept and the data distribution.
• “learn C” becomes “learn (C,χ)” (i.e. learn • “learn C” becomes “learn (C,χ)” (i.e. learn
class C under compatibility notion χ) class C under compatibility notion χ)

• Express relationships that one hopes the • To do this, need unlabeled data to allow us to
target function and underlying distribution uniformly estimate compatibilities well.
will possess.
• Require that the degree of compatibility be
• Idea: use unlabeled data & the belief that something that can be estimated from a finite
the target is compatible to reduce C down to sample.
just {the highly compatible functions in C}.

4
Proposed Model [BB05] Margins, Compatibility
• Augment the notion of a concept class C • Margins: belief is that should exist a large margin separator.
with a notion of compatibility χ between a
+
concept and the data distribution.
_
• “learn C” becomes “learn (C,χ)” (i.e. learn +
class C under compatibility notion χ) _
Highly compatible +

• Require χ to be an expectation over individual


examples: • Incompatibility of h and D (unlabeled error rate of h) – the
probability mass within distance γ of h.
– χ(h,D)=Ex ∈ D[χ(h, x)] compatibility of h with D,
• Can be written as an expectation over individual examples
χ(h,x) ∈ [0,1] χ(h,D)=Ex ∈ D[χ(h,x)] where:
– errunl(h)=1-χ(h, D) incompatibility of h with D • χ(h,x)=0 if dist(x,h)  γ

(unlabeled error rate of h) • χ(h,x)=1 if dist(x,h) ≥ γ

Margins, Compatibility Co-Training, Compatibility


• Margins: belief is that should exist a large margin
• Co-training: examples come as pairs  x1, x2  and the goal
separator. + is to learn a pair of functions  h1, h2 
_ • Hope is that the two parts of the example are consistent.
+

_
Highly compatible + • Legal (and natural) notion of compatibility:
– the compatibility of  h1, h2  and D:

• If do not want to commit to γ in advance, define χ(h,x) to be


a smooth function of dist(x,h), e.g.:
– can be written as an expectation over examples:

• Illegal notion of compatibility: the largest γ s.t. D has


probability mass exactly zero within distance γ of h.

Sample Complexity - Uniform convergence bounds Semi-Supervised Learning


Natural Formalization (PACχ)
Finite Hypothesis Spaces, Doubly Realizable Case
• Define CD,χ(ε) = {h ∈ C : errunl(h)  ε}. • We will say an algorithm "PACχ-learns" if it runs in
Theorem poly time using samples poly in respective bounds.

• E.g., can think of ln|C| as # bits to describe target


without knowing D, and ln|CD,χ(ε)| as number of bits to
describe target knowing a good approximation to D,
given the assumption that the target has low
• Bound the # of labeled examples as a measure of the
helpfulness of D with respect to χ unlabeled error rate.
– a helpful distribution is one in which CD,χ(ε) is small

5
Infinite hypothesis spaces / VC-dimension
Target in C, but not fully compatible
Infinite Hypothesis Spaces
Finite Hypothesis Spaces – c* not fully compatible: Assume χ(h,x) ∈ {0,1} and χ(C) = {χh : h ∈ C} where χh(x) = χ(h,x).
Theorem C[m,D] - expected # of splits of m points from D with concepts in C.

ε-Cover-based bounds ε-Cover-based bounds


• For algorithms that behave in a specific way: • For algorithms that behave in a specific way:
– first use the unlabeled data to choose a – first use the unlabeled data to choose a
representative set of compatible hypotheses representative set of compatible hypotheses
– then use the labeled sample to choose among these – then use the labeled sample to choose among these
Theorem
E.g., in case of co-training linear separators with
independence assumption:
– ε-cover of compatible set = {0, 1, c*, ¬ c*}
E.g., Transductive SVM when data is in two blobs.

+ _

+ _
• Can result in much better bound than uniform convergence!

Ways unlabeled data can help in this model

• If the target is highly compatible with D and have enough


unlabeled data to estimate χ over all h ∈ C, then can reduce
the search space (from C down to just those h ∈ C whose
estimated unlabeled error rate is low).

• By providing an estimate of D, unlabeled data can allow a


more refined distribution-specific notion of hypothesis
space size (such as Annealed VC-entropy or the size of the
smallest ε-cover).

• If D is nice so that the set of compatible h ∈ C has a small


ε-cover and the elements of the cover are far apart, then
can learn from even fewer labeled examples than the 1/ε
needed just to verify a good hypothesis.

You might also like