0% found this document useful (0 votes)
11 views

Lecture 6

Lec

Uploaded by

bunnypatel063
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Lecture 6

Lec

Uploaded by

bunnypatel063
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Machine Learning Coms-4771

Machine Learning Theory


The Mistake-Bound model
Lecture 6

Based entirely on Avrim Blum’s notes (see the link at the web
page)
Goals of ML theory:
I develop and analyze models of learning that capture the key aspects of
machine learning
I help understand what type of guarantees we can hope to achieve, and
what type of learning problems we can hope to solve
Two main learning models:

1. Distributional (PAC) setting (a “batch” model)


I Assumption: training and test examples come (independently)
from some fixed probability distribution D over the instance
space, labeled by an unknown target function (from a known
hypothesis class).
I Basic question: How much data do I need to see so that if I do
well over it, I can expect to do well on new points drawn from
D?
2. On-line setting: No distributional assumptions. An adversary selects the
order in which examples are presented. No separate training set. The
learner attempts to predict on every example seen. Count the number of
mistakes.

We are going to start with (2).


Mistake Bound Model

Learning is in stages:
I The learner gets an unlabeled example x ∈ X
I The learner predicts its classification
I The learner is told the correct label f (x)

Goal: minimize the number of mistakes

Definition
Algorithm A learns class of functions C with mistake bound M if A makes at
most M mistakes on any sequence of examples consistent with some f ∈ C .

Can’t talk about past performance predicting future performance, or the


number of samples needed to learn (e.g., we may end up seeing the same
example over and over again, so we don’t learn anything about f , but it’s Ok,
because we won’t be making many mistakes either).
Example: Disjunctions

X = {0, 1}n . Assumption: examples are consistent with a disjunction (an OR


on a subset of features)
Example: Disjunctions

X = {0, 1}n . Assumption: examples are consistent with a disjunction (an OR


on a subset of features)

I Start with h(x) = x1 ∨ x2 ∨ · · · ∨ xn


I Get x, answer according to h.
I Can’t make a mistake on positive. Mistake on negative: throw out all
variables in h set to 1 in x.
I Each mistake removes at least one variable, so the total number of
mistakes is at most n.
Example: Disjunctions

No deterministic algorithm can do better:

1 0 0 0 0 0
0 1 0 0 0 0
0 0 1 0 0 0
0 0 0 1 0 0
0 0 0 0 1 0
0 0 0 0 0 1

Any labeling of these examples is consistent with some disjunction. So


regardless of what the algorithm predicts, there is always a consistent
disjunction that disagrees with the algorithm on every single example.
Halving Algorithm

I Learn arbitrary concept class C with at most log(|C |) mistakes. (The


number of monotone disjunctions on n variables is 2n , so this makes at
most n mistakes, one mistake per bit)
I Take majority vote over all h ∈ C consistent with all examples seen so far.
I Proof: Each mistake cuts down the number of available concepts at least
in half.
What if our concept class has functions of different sizes (e.g., decision trees)?
I Cb = the set of functions in C that take at most b bits to represent,
|Cb | < 2b .
I ”Guess and double” on b. If the target function takes s bits and
2b ≤ s ≤ 2b+1 , the number of mistakes is at most
1 + 2 + 4 + ... + 2s < 4s.
I Can get rid of the factor of 4:

I Give each h a weight of (1/2)size(h) . If our description


language is a prefix-code (no concept is a prefix of another),
the total sum of weights is at most 1.
I Take weighted vote. Each mistake removes at least 1/2 of
total weight left. So if the target has size s, then the number
of mistakes is at most log(2s ) = s. So we are paying one
mistake per bit of the description of the target function (if you
don’t care about computation time).
Standard Optimal Algorithm

Online Learning as a 2-player game between Algorithm and Adversary


I Adversary selects x which splits C into two sets: C0 (x)—functions that
label x negative, and C1 (x)—functions that labels x positive.
I The algorithm gets to pick one of C0 (x) and C1 (x) to throw away (by
predicting postive and making a mistake it throws out C1 (x), or by
predicting negative and making a mistake it throws out C0 (x))
I Repeat until only one function is left.

The value of this game (to the opponent) is the number of rounds the game is
played.
I Well-defined number opt(C ) that is the optimal mistake bound for
concept class C (minimum over all algorithms).
I Well defined optimal strategy for each player: Given an example x, we
“just” calculate opt(C0 (x)) and opt(C1 (x)) (by applying this idea
recursively), and throw out whichever set has larger mistake bound.
Is Halving Algorithm optimal?

I Halving algorithm: throw out larger set


I Optimal algorithm: throw out set with larger mistake bound
I Not always the same (so halving is not always optimal)! It’s possible for
the mistake bound for a set of functions C to be much smaller than
log |C |.
I You will come up with an example in your next homework.
What if there is no perfect function?

I Think of functions in C as experts giving advice. You want to do as well


as the best expert in hindsight (regret bounds).
I Next lecture: Flexible version of the halving algoritm (instead of throwing
out inconsistent functions, reduce their weight) (Weighted Majority)
I Next lecture: efficient algorithms that can handle large number of
irrelevant features (Winnow)

You might also like