Intro Text Mining
Intro Text Mining
in
collaboraBon
with:
David
D.
Lewis
Text
Mining
•StaBsBcal
text
analysis
has
a
long
history
in
literary
analysis
and
in
solving
disputed
authorship
problems
•First
(?)
is
Thomas
C.
Mendenhall
in
1887
Mendenhall
•Mendenhall
was
Professor
of
Physics
at
Ohio
State
and
at
University
of
Tokyo,
Superintendent
of
the
USA
Coast
and
GeodeBc
Survey,
and
later,
President
of
Worcester
Polytechnic
InsBtute
Mendenhall
Glacier,
Juneau,
Alaska
X2
=
127.2,
df=12
•Hamilton
versus
Madison
•Used
Naïve
Bayes
with
Poisson
and
NegaBve
Binomial
model
•Out-‐of-‐sample
predicBve
performance
Today
• StaBsBcal
methods
rouBnely
used
for
textual
analyses
of
all
kinds
• Machine
translaBon,
part-‐of-‐speech
tagging,
informaBon
extracBon,
quesBon-‐answering,
text
categorizaBon,
disputed
authorship
(stylometry),
etc.
• Not
reported
in
the
staBsBcal
literature
(no
staBsBcians?)
To: [email protected]
Dear Sir or Madam, My drier made smoke
and a big whoooshie noise when I started
it! Was the problem drying my new
Australik raincoat? It is made of oilcloth.
I guess it was my fault.
1.0
•With
p>1,
the
decision
boundary
is
linear
0.5
y
e.g.
0.5
=
b0
+
b1
x1
+
b2
x2
0.0
-3 -2 -1 0 1 2 3
zeroOneR.txt
3.0
2.5
2.0
1.5
x2
1.0
0.5
0.0
x1
Naïve
Bayes
via
a
Toy
Spam
Filter
Example
•Naïve
Bayes
is
a
generaBve
model
that
makes
drasBc
simplifying
assumpBons
•Consider
a
small
training
data
set
for
spam
along
with
a
bag
of
words
representaBon
Naïve
Bayes
Machinery
•We
need
a
way
to
esBmate:
and
leading
to:
Maximum Likelihood Estimation
weights
of
evidence
Naïve
Bayes
PredicBon
p(y=1|xi)
log p(y=-1|xi) = ! bj xij = bxi
Maximum Likelihood Training
• Tends to overfit
• Not defined if d > n
• Feature selection
Shrinkage/Regularization/Bayes
• Avoid combinatorial challenge of feature
selection
• L1 shrinkage: regularization + feature
selection
• Expanding theoretical understanding
• Large scale
• Empirical performance
Ridge Logistic Regression
∑ j ≤s
β 2
j =1
∑β
j =1
j ≤s
s
1/s
Bayesian Perspective
Polytomous
Logis.c
Regression
(PLR)
exp( β k x i )
P( yi = k | x i ) =
∑ exp(βk 'x i )
k'
1
7 3
predicted
outcome
0
0 10
Test Data
More generally…
b+c
misclassification rate:
actual outcome a+b+c+d
1 0
a
sensitivity:
1 a+c
a b (aka recall)
predicted
outcome
d
0 specificity:
b+d
c d
a
predicitive value positive:
a+b
(aka precision)
Suppose we use a
cutoff of 0.5…
actual outcome
1 0
7
sensitivity: = 100%
7+0
1
7 3
predicted
outcome
10
specificity: = 77%
0 10+3
0 10
Suppose we use a
cutoff of 0.8…
actual outcome
1 0
5
sensitivity: = 71%
5+2
1
5 2
predicted
outcome
11
specificity: = 85%
0 11+2
2 11
• Note there are 20 possible thresholds
actual outcome
• Note if threshold = minimum 1 0
Berlin
Chen
POS
Tagging
Algorithms
• Rule-‐based
taggers:
large
numbers
of
hand-‐
craced
rules
• ProbabilisBc
tagger:
used
a
tagged
corpus
to
train
some
sort
of
model,
e.g.
HMM.
tag tag
tag1
2
3
t1 t2 t3
number
of
Bmes
a
word
that
had
never
been
seen
with
tag
i
gets
tag
i
w1 w2 w3
•Brown
test
set
accuracy
=
95.97%
Morphological
Features
• Knowledge
that
“quickly”
ends
in
“ly”
should
help
idenBfy
the
word
as
an
adverb
• “randomizing”
-‐>
“ing”
• Split
each
word
into
a
root
(“quick”)
and
a
suffix
(“ly”)
t1 t2 t3
r1
s1
r2
s2
Morphological
Features
• Typical
morphological
analyzers
produce
mulBple
possible
splits
• “GastroenteriBs”
???
o1 ot-1 ot ot+1 oT
State Transition
o1 ot-1 ot Probability
ot+1 oT
“Emission”
δ j (t ) = max P( x1...xt −1 , o1...ot −1 , xt = j , ot )
x1 ... xt −1 Probability
δ j (t + 1) = max δ i (t )aij b jot +1
i Recursive
Computation
Viterbi
Small
Example
x1 x2 Pr(x1=T) = 0.2
Pr(x2=T|x1=T) = 0.7
Pr(x2=T|x1=F) = 0.1
Pr(o=T|x=T) = 0.4
Pr(o=T|x=F) = 0.9
o1 o2
o1=T; o2=F
Brute Force
Pr(x1=T,x2=T, o1=T,o2=F) = 0.2 x 0.4 x 0.7 x 0.6 = 0.0336
Pr(x1=T,x2=F, o1=T,o2=F) = 0.2 x 0.4 x 0.3 x 0.1 = 0.0024
Pr(x1=F,x2=T, o1=T,o2=F) = 0.8 x 0.9 x 0.1 x 0.6 = 0.0432
Pr(x1=F,x2=F, o1=T,o2=F) = 0.8 x 0.9 x 0.9 x 0.1 = 0.0648
•Idea:
using
the
“next”
tag
as
well
as
the
“previous”
tag
should
improve
tagging
performance
Named-‐EnBty
ClassificaBon
• “Mrs.
Frank”
is
a
person
• “Steptoe
and
Johnson”
is
a
company
• “Honduras”
is
a
locaBon
• etc.