0% found this document useful (0 votes)
17 views17 pages

Vector Space Classification

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views17 pages

Vector Space Classification

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 17

Introduction to Information Retrieval

Introduction to
Information Retrieval
Vector Space Classification

Chris Manning, Pandu Nayak and Prabhakar


Raghavan
Introduction to Information Retrieval Ch. 13

Vector Space Classification -


Topics to Do
 Vector Space Classification

 Rocchio Classification

 K- Nearest Neighbour (KNN) Classification


Introduction to Information Retrieval

Classification Using Vector Spaces


 In vector space classification, training set
corresponds to a labeled set of points
(equivalently, vectors)
 Premise 1: Documents in the same class
form a contiguous region of space
 Premise 2: Documents from different
classes don’t overlap (much)
 Learning a classifier: build surfaces to
delineate classes in the space
Sec.14.1

Documents in a Vector Space

Government

Science
Arts

4
Sec.14.1

Test Document of what class?

Government

Science
Arts

5
Sec.14.1

Test Document = Government

Is this
similarity
hypothesis
true in
general?

Government

Science
Arts

Our focus: how to find good separators 6


Sec.14.2

Definition of centroid

 Where Dc is the set of all documents that belong to


class c and v(d) is the vector space representation of
d.

 Note that centroid will in general not be a unit vector


even when the inputs are unit vectors.

7
Sec.14.2

Rocchio classification
 Rocchio forms a simple representative for
each class: the centroid/prototype
 Classification: nearest prototype/centroid
 It does not guarantee that classifications are
consistent with the given training data

8
Rocchio - Pseudo code

9
Sec.14.2

Rocchio classification
 Little used outside text classification
 It has been used quite effectively for text
classification
 But in general worse than Naïve Bayes
 Again, cheap to train and test documents

10
Sec.14.3

k Nearest Neighbor Classification


 kNN = k Nearest Neighbor

 To classify a document d:
 Define k-neighborhood as the k nearest
neighbors of d
 Pick the majority class label in the k-
neighborhood

11
Sec.14.3

Example: k=6 (6NN)

P(science| )?

Government

Science
Arts

12
Sec.14.3

Nearest-Neighbor Learning
 Learning: just store the labeled training examples D
 Testing instance x (under 1NN):
 Compute similarity between x and all examples in D.
 Assign x the category of the most similar example in D.
 Also called:
 Case-based learning
 Memory-based learning
 Lazy learning
 Rationale of kNN: contiguity hypothesis

13
Sec.14.3

k Nearest Neighbor
 Using only the closest example (1NN)
subject to errors due to:
 A single atypical example.
 Noise (i.e., an error) in the category label of
a single training example.
 More robust: find the k examples and
return the majority category of these k
 k is typically odd to avoid ties; 3 and 5
are most common
14
KNN - Pseudo code

15
Sec.14.3

kNN decision boundaries


Boundaries
are in
principle
arbitrary
surfaces –
but usually
polyhedra

Government

Science
Arts

kNN gives locally defined decision boundaries between


classes – far away points do not influence each classification
16
decision (unlike in Naïve Bayes, Rocchio, etc.)
Sec.14.3

kNN: Discussion
 No feature selection necessary
 No training necessary
 Scales well with large number of classes
 Don’t need to train n classifiers for n classes
 Classes can influence each other
 Small changes to one class can have ripple effect
 May be expensive at test time
 In most cases it’s more accurate than NB or Rocchio

17

You might also like