Unit III PPT Slides
Unit III PPT Slides
Chapter 8
Text Classification
Introduction
A Characterization of Text Classification
Unsupervised Algorithms
Supervised Algorithms
Feature Selection or Dimensionality Reduction
Evaluation Metrics
Organizing the Classes - Taxonomies
Text classification
key technology in modern enterprises
Unsupervised learning
no training data is provided
Examples:
neural network models
independent component analysis
clustering
Semi-supervised learning
small training data
combined with larger amount of unlabeled data
~ p = d~j
△
2. Assignment Step.
assign each document to cluster with closest centroid
distance function computed as inverse of the similarity
similarity between dj and cp , use cosine formula
~ p • d~j
△
sim(dj , cp ) =
~ p | × |d~j |
|△
~p= 1
d~j
X
△
size(cp )
d~j ∈cp
4. Final Step.
repeat assignment and update steps until no centroid changes
3. Selection Step.
if stop criteria satisfied (e.g., no cluster larger than
pre-defined size), stop execution
go back to Split Step
Text Classification, Modern Information Retrieval, Addison Wesley, 2009 – p. 23
Hierarchical Clustering
Goal: to create a hierarchy of clusters by either
decomposing a large cluster into smaller ones, or
agglomerating previously defined clusters into larger ones
Complete-Link Algorithm
dist(cp , cr ) = max dist(dj , dl )
∀ dj ∈cp ,dl ∈cr
Average-Link Algorithm
1 X X
dist(cp , cr ) = dist(dj , dl )
np + nr
dj ∈cp dl ∈cr
d~j • c~p
sim(dj , cp ) =
|d~j | × |c~p |
associate dj classes cp with highest values of sim(dj , cp )
Solution:
delay construction of tree until new document is presented for
classification
build tree based on features presented in this document, avoiding
the problem
where
Nk (dj ): set of the k nearest neighbors of dj in training set
similarity(dj , dt ): cosine formula of Vector model (for instance)
T (dt , cp ): training set function returns
1, if dt belongs to class cp
0, otherwise
Classifier assigns to dj class(es) cp with highest
score(s)
β X ~ γ
d~l
X
~cp = dj −
np Nt − n p
dj ∈cp dl 6∈cp
where
np : number of documents in class cp
Nt : total number of documents in the training set
terms of training docs in class cp : positive weights
terms of docs outside class cp : negative weights
P (cp |d~j )
S(dj , cp ) =
P (cp |d~j )
P (cp |d~j ): probability that document dj belongs to class cp
P (cp |d~j ): probability that document dj does not belong to cp
P (cp |d~j ) + P (cp |d~j ) = 1
P (d~j |cp )
S(dj , cp ) ∼
P (d~j |cp )
Independence assumption
P (d~j |cp ) =
Y Y
P (ki |cp ) × P (k i |cp )
ki ∈d~j ki 6∈d~j
P (d~j |cp ) =
Y Y
P (ki |cp ) × P (k i |cp )
ki ∈d~j ki 6∈d~j
P
1+ dj |dj ∈Dt ∧ki ∈dj P (cp |dj ) 1 + ni,p
piP = P =
2+ dj ∈Dt P (cp |dj ) 2 + np
P
1+ dj |dj ∈Dt ∧ki ∈dj P (cp |dj ) 1 + (ni − ni,p )
qiP = P =
2+ dj ∈Dt P (cp |dj ) 2 + (Nt − np )
P (c p ) × P ( d~j |cp )
P (cp |d~j ) =
P (d~j )
P (d~j ): prior document probability
P (cp ): prior class probability
P
dj ∈Dt P (cp |dj ) np
P (cp ) = =
Nt Nt
P (cp |dj ) ∈ {0, 1}: given by training set of size Nt
L
P (d~j ) = Pprior (d~j |cp ) × P (cp )
X
p=1
where
Delimiting
Hyperplanes
parallel dashed lines that
delimit region where to
look for a solution
Support vectors:
documents that belong
to, and define, the
delimiting hyperplanes
Hyperplane s : y + x −√
7=0
has margin equal to 3 2
maximum for this case
s is the decision hyperplane
generic point Z is
represented as
~z = (z1 , z2 , . . . , zn )
zi , 1 ≤ i ≤ n, are real
variables
Similar notation to refer to
specific fixed points such as
A, B, H, P, and Q
s : ~z = tw
~ + p~
Hw : (~z − ~h)w
~ =0
Can be rewritten as
Hw : ~zw
~ +k =0
line(AP ) : ~z = tw
~ + ~a
p~ = tp w
~ + ~a
where tp is value of t for point P
Since P ∈ Hw
(tp w
~ + ~a)w
~ +k =0
Solving for tp ,
~aw
~ +k
tp = −
~2
|w|
where |w|
~ is the vector norm
~aw
~ +k w
~
~a − p~ = ×
|w|
~ |w|
~
Since w/|
~ w|~ is a unit vector
~aw
~ +k
AP = |~a − p~| =
|w|
~
Hw is determined by a
point H (represented by
~h) and by a
perpendicular vector w
~
neither ~h nor w
~ are known
a priori
~aw
~ +k
AP =
|w|
~
~bw
~ +k
BQ = −
|w|
~
m = AP + BQ
is independent of size of w
~
Vectors w
~ of varying sizes
maximize m
Impose restrictions on |w|
~
~aw~ +k = 1
~bw
~ + k = −1
1 1
m= +
|w|
~ |w|
~
2
m=
|w|
~
Then,
Optimization problem:
maximize m = 2/|w|
~
subject to
w
~ · (5, 5) + b = +1
w
~ · (1, 3) + b = −1
or
y+x−7=0
f (~zj ) = sign(w~
~ zj + b)
minimize m = 12 ∗ |w|~2
subject to
f (w,
~ ~zj ) + k ≥ +1, if cj = ca
f (w,
~ ~zj ) + k ≤ −1, if cj = cb
Conventional SVM case
f (w,
~ ~zj ) = w~
~ zj , the kernel, is dot product of input vectors
Transformed SVM case
the kernel is a modified map function
polynomial kernel: f (w, ~ xj + 1)d
~ ~xj ) = (w~
radial basis function: f (w,
~ ~xj ) = exp(λ ∗ |w~ ~ xj |2 ), λ > 0
sigmoid: f (w,~ ~xj ) = tanh(ρ(w~ ~ xj ) + c), for ρ > 0 and c < 0
That is,
L
X
M I(ki , C) = P (cp ) I(ki , cp )
p=1
L ni,p
X np Nt
= log ni np
Nt Nt × Nt
p=1
L
X
IG(ki , C) = − P (cp ) log P (cp )
p=1
L
X
− − P (ki , cp ) log P (cp |ki )
p=1
L
X
− − P (k i , cp ) log P (cp |k i )
p=1
Further let
T : Dt × C → [0, 1]: training set function
nt : number of docs from training set Dt in class cp
F : D × C → [0, 1]: text classifiier function
nf : number of docs from training set assigned to class cp by the
classifier
nf,t : number of docs that both the training and classifier functions
assigned to class cp
nt − nf,t : number of training docs in class cp that were
miss-classified
The remaining quantities are calculated analogously
2P (cp )R(cp )
F1 (cp ) =
P (cp ) + R(cp )
T (dj , cp ) = 1 T (dj , cp ) = 0
F(dj , cp ) = 1 10 0 10
F(dj , cp ) = 0 10 980 990
all docs 20 980 1,000
2P R
micF1 =
P +R
P
cp ∈C nf,t
P = P
cp ∈C nf
P
cp ∈C nf,t
R = P
cp ∈C nt
classifier Ψi
training, or tuning, done on Dt minus the ith fold
testing done on the ith fold