Vector Space Classification
Vector Space Classification
Introduction to
Information Retrieval
Vector Space Classification
Rocchio Classification
Government
Science
Arts
4
Sec.14.1
Government
Science
Arts
5
Sec.14.1
Is this
similarity
hypothesis
true in
general?
Government
Science
Arts
Definition of centroid
7
Sec.14.2
Rocchio classification
Rocchio forms a simple representative for
each class: the centroid/prototype
Classification: nearest prototype/centroid
It does not guarantee that classifications are
consistent with the given training data
8
Rocchio - Pseudo code
9
Sec.14.2
Rocchio classification
Little used outside text classification
It has been used quite effectively for text
classification
But in general worse than Naïve Bayes
Again, cheap to train and test documents
10
Sec.14.3
To classify a document d:
Define k-neighborhood as the k nearest
neighbors of d
Pick the majority class label in the k-
neighborhood
11
Sec.14.3
P(science| )?
Government
Science
Arts
12
Sec.14.3
Nearest-Neighbor Learning
Learning: just store the labeled training examples D
Testing instance x (under 1NN):
Compute similarity between x and all examples in D.
Assign x the category of the most similar example in D.
Also called:
Case-based learning
Memory-based learning
Lazy learning
Rationale of kNN: contiguity hypothesis
13
Sec.14.3
k Nearest Neighbor
Using only the closest example (1NN)
subject to errors due to:
A single atypical example.
Noise (i.e., an error) in the category label of
a single training example.
More robust: find the k examples and
return the majority category of these k
k is typically odd to avoid ties; 3 and 5
are most common
14
KNN - Pseudo code
15
Sec.14.3
Government
Science
Arts
kNN: Discussion
No feature selection necessary
No training necessary
Scales well with large number of classes
Don’t need to train n classifiers for n classes
Classes can influence each other
Small changes to one class can have ripple effect
May be expensive at test time
In most cases it’s more accurate than NB or Rocchio
17