0% found this document useful (0 votes)
7 views

Intro Text Mining

text mining and text classification

Uploaded by

AmirReza Tajally
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Intro Text Mining

text mining and text classification

Uploaded by

AmirReza Tajally
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 83

Text

 Mining:  An  Overview  


David  Madigan  
[email protected]  
h:p://www.stat.columbia.edu/~madigan  

in  collaboraBon  with:  
David  D.  Lewis  
 
Text  Mining  
•StaBsBcal  text  analysis  has  a  long  history  in  literary  analysis  and  in  solving  
disputed  authorship  problems  
•First  (?)  is  Thomas  C.  Mendenhall  in  1887  
Mendenhall  
•Mendenhall  was  Professor  of  Physics  at  Ohio  State  and  at  University  of  
Tokyo,  Superintendent  of  the  USA  Coast  and  GeodeBc  Survey,  and  later,  
President  of  Worcester  Polytechnic  InsBtute  

Mendenhall  Glacier,    
Juneau,  Alaska  
X2  =  127.2,  df=12  
•Hamilton  versus  Madison  
•Used  Naïve  Bayes  with  Poisson  and  NegaBve  Binomial  model  
•Out-­‐of-­‐sample  predicBve  performance  
Today  
• StaBsBcal  methods  rouBnely  used  for  textual  
analyses  of  all  kinds  
• Machine  translaBon,  part-­‐of-­‐speech  tagging,  
informaBon  extracBon,  quesBon-­‐answering,  
text  categorizaBon,  disputed  authorship  
(stylometry),  etc.  
• Not  reported  in  the  staBsBcal  literature  (no  
staBsBcians?)  

Mosteller,  Wallace,  Efron,  Thisted  


Text  Mining’s  Connec.ons  with  
Language  Processing  
• LinguisBcs  
• ComputaBonal  linguisBcs  
• InformaBon  retrieval    
• Content  analysis    
• StylisBcs  
• Others    
 Why  Are  We  Mining  the  Text?  
Are  we  trying  to  understand:    
1. The  texts  themselves?  
2. The  writer  (or  speakers)  of  the  texts?  
a. The  writer  as  a  writer?  
b. The  writer  as  an  enBty  in  the  world?    
3. Things  in  the  world?  
a. Directly  linked  to  texts?    
b. Described  by  texts?    
Text known to be linked to Important terms for
Stylistic clues to author particular product. searching database of
identity and demographics. such messages.

To: [email protected]
Dear Sir or Madam, My drier made smoke
and a big whoooshie noise when I started
it! Was the problem drying my new
Australik raincoat? It is made of oilcloth.
I guess it was my fault.      

Another entity information Customer probably won’t


could be extracted on. make a fuss.
Granularity  of  Text  Linked  to  En.ty?  
Increasing linguistic
agreement on structure and
representations
• Morphemes,  words,  
simple  noun  phrases  
• Clauses,  sentences  
• Paragraphs,  secBons,  
documents  
• Corpora,  networks  of  
documents   Increasing size & complexity
(and variance in these)
Ambiguity  
• Correct  interpretaBon  of  an  u:erance  is  
rarely  explicit!  
– People  use  massive  knowledge  of  language  
and  the  world  to  understand  NL  
– informaBon  readily  inferred  by  speaker/reader  
will  be  lec  out  
• The  core  task  in  NLP  is  resolving  the  
resul5ng  ambiguity  
“I  made  her  duck”  
[acer  Jurafsky  &  MarBn]

• I  cooked  waterfowl  for  her.    


• I  cooked  waterfowl  belonging  to  her.    
• I  created  the  (plaster?)  duck  she  owns.  
• I  caused  her  to  quickly  lower  her  head.  
• I  magically  converted  her  into  roast  fowl.  

• These  vary  in  morphology,  syntax,  semanBc,  pragmaBcs.  


....Plus,  Phone.c  and  Visual  
Ambiguity    
• Aye,  made  her  duck!  
• I  made  her....duck!  
• Aye,  maid.    Her  duck.        
Synonymy  
• Can  use  different  words  to  communicate  
the  same  meaning  
– (Of  course  also  true  for  sentences,....)  
• Synonymy  in  all  contexts  is  rare:  
– a  big  plane  =  a  large  plane  
– a  big  sister  ≠  a  large  sister    
• And  choice  among  synonyms  may  be  a  clue  
to  topic,  writer’s  background,  etc.  
–  rich  vs.  high  SES  
Metaphor  &  Metonymy    
• Metaphor:  use  of  a  construct  with  one  meaning  to  
convey  a  very  different  one:  
– Television  is  eaFng  away  at  the  moral  fiber  of  our  country.    
• Metonymy:  menBoning  one  concept  to  convey  a  
closely  related  one  
       -­‐  "On  the  way  downtown  I  stopped  at  a  bar  and  had  a  couple  
of  double  Scotches.  They  didn't  do  me  any  good.  All  they  did  
was  make  me  think  of  Silver  Wig,  and  I  never  saw  her  again."  
         (Raymond  Chandler,  The  Big  Sleep)  
A:ribute  Vectors  
• Most  text  mining  based  on    
– Breaking  language  into  symbols    
– TreaBng  each  symbol  as  an  a:ribute    
• But  what  value  should  each  a:ribute  have  
for  a  unit  of  text?  
Term  WeighBng    
• How  strongly  does  a  parBcular  word  
indiciate  the  content  of  a  document?  
• Some  clues:  
– Number  of  Bmes  word  occurs  in  this  document  
– Number  of  Bmes  word  occurs  in  other  
documents  
– Length  of  document      
TF (term IDF (inverse
frequency) document
frequency)
N
(1 + ln fij )ln , if t j present in di
w raw
ij = nj
0, otherwise
raw
w
wij =
ij
d
Set L2-norm
to 1.0
∑w raw
ij ×w raw
ij
j '=1

• “Cosine-­‐normalized  TFIDF  weighBng”  


– Many  minor  variants  on  this  theme    
Case  Study:  Representa.on  for  
Authorship  AJribu.on  
• StaBsBcal  methods  for  authorship  a:ribuBon  
• Represent  documents  with  a:ribute  vectors  
• Then  use  regression-­‐type  methods  
• Bag  of  words?  
• StylisBc  features?  (e.g.,  passive  voice)  
• Topic  free?    
1-of-K Sample Results: brittany-l
Feature Set % Number of
errors Features
“Argamon” function 74.8 380
words, raw tf
POS 75.1 44
1suff 64.2 121
1suff*POS 50.9 554
2suff 40.6 1849
2suff*POS 34.9 3655 4.6 million parameters
3suff 28.7 8676
3suff*POS 27.9 12976
3suff+POS+3suff*POS 27.6 22057
+Argamon
All words 23.9 52492
89 authors with at least 50 postings. 10,076 training documents, 3,322 test documents.
BMR-Laplace classification, default hyperparameter
The Federalist

• Mosteller and Wallace attributed all 12


disputed papers to Madison

• Historical evidence is more muddled

• Our results suggest attribution is highly


dependent on the document representation
Table 1 Authorship of the Federalist Papers
Paper Number Author Table 3 The feature sets
1 Hamilton Features Name in Short
2-5 Jay The length of each character charcount
6-9 Hamilton Part of speeches POS
10 Madison Two-letter-suffix Suffix2
11-13 Hamilton Three-letter-suffix Suffix3
14 Madison Words, numbers, signs, punctuations Words
15-17 Hamilton The length of each character plus the part of speeches Charcount+POS
18-20 Joint: Hamilton and Madison Two-letter-suffix plus the part of speeches Suffix2+POS
21-36 Hamilton Three-letter-suffix plus the part of speeches Suffix3+POS
37-48 Madison Words, numbers, signs, punctuations plus the part of speeches Words+POS
49-58 Disputed The 484 function words in Koppel’s paper 484 features
59-61 Hamilton The feature set in the Mosteller and Wallace paper Wallace features
62-63 Disputed Words appear at least twice Words(>=2)
64 Jay Each word shown in the Federalist papers Each word
65-85 Hamilton
Feature Set 10-fold Error Rate
Charcount 0.21
POS 0.19
Suffix2 0.12
Suffix3 0.09
Words 0.10
Charcount+POS 0.12
Suffix2+POS 0.08
Suffix3+POS 0.04
four  papers  to  Hamilton  
Words+POS 0.08
484 features 0.05
Wallace features 0.05
Words (>=2) 0.05
Each Word 0.05
Supervised  Learning  for  
Text  ClassificaBon  
PredicBve  Modeling  
Goal:  learn  a  mapping:    y  =  f(x;!)  
Need:      1.  A  model  structure  
 2.  A  score  funcBon  
 3.  An  opBmizaBon  strategy  
 
Categorical  y  !  {c1,…,cm}:  classificaBon  
Real-­‐valued  y:    regression  
Note:  usually  assume  {c1,…,cm}  are  mutually  exclusive  and  exhausBve  
Classifier  Types  

Discrimina.ve:  model  p(ck  |  x  )    


   -­‐  e.g.  linear  regression,  logisBc  regression,  CART  
Genera.ve:  model  p(x  |ck  ,  !k)    
   -­‐  e.g.  “Bayesian  classifiers”,  LDA  
Regression  for  Binary  ClassificaBon  

•Can  fit  a  linear  regression  model  to  a  0/1  


response  
•Predicted  values  are  not  necessarily  between  
zero  and  one  

1.0
•With  p>1,  the  decision  
boundary  is  linear  

0.5
y
e.g.  0.5  =  b0  +  b1  x1  +  b2  x2  
0.0

-3 -2 -1 0 1 2 3

zeroOneR.txt  
3.0
2.5
2.0
1.5
x2

1.0
0.5
0.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

x1
Naïve  Bayes  via  a  Toy  Spam  Filter  
Example  
•Naïve  Bayes  is  a  generaBve  model  that  makes  
drasBc  simplifying  assumpBons  
•Consider  a  small  training  data  set  for  spam  
along  with  a  bag  of  words  representaBon  
Naïve  Bayes  Machinery  
•We  need  a  way  to  esBmate:  

•Via  Bayes  theorem  we  have:  

or,  on  the  log-­‐odds  scale:  


Naïve  Bayes  Machinery  
•Naïve  Bayes  assumes:  

and  

leading  to:  
Maximum Likelihood Estimation

weights    
of  
evidence  
Naïve  Bayes  PredicBon  

•Usually  add  a  small  constant  (e.g.  0.5)  to  avoid  


divide  by  zero  problems  and  to  reduce  bias  

•New  message:  “the  quick  rabbit  rests”  


•New  message:  “the  quick  rabbit  rests”  

•Predicted  log  odds:  


0.51  +  0.51  +  0.51  +  0.51  +  1.10  +  0  =  3.04  

•Corresponds  to  a  spam  probability  of  0.95  


A Close Look at Logistic
Regression for Text Classification
Logistic Regression

•Linear model for log odds of category


membership:

p(y=1|xi)
log p(y=-1|xi) = ! bj xij = bxi
Maximum Likelihood Training

• Choose parameters (bj's) that maximize


probability (likelihood) of class labels (yi's)
given documents (xi’s)

• Tends to overfit
• Not defined if d > n
• Feature selection
Shrinkage/Regularization/Bayes
• Avoid combinatorial challenge of feature
selection
 
• L1 shrinkage: regularization + feature
  selection
 
• Expanding theoretical understanding
 • Large scale

• Empirical performance
Ridge Logistic Regression

Maximum likelihood plus a constraint:

∑ j ≤s
β 2

j =1

Lasso Logistic Regression

Maximum likelihood plus a constraint:


p

∑β
j =1
j ≤s
s
1/s  
Bayesian Perspective
Polytomous  Logis.c  Regression  (PLR)  


exp( β k x i )
P( yi = k | x i ) = 
∑ exp(βk 'x i )
k'

• Elegant  approach  to  mulBclass  problems  


• Also  known  as  polychotomous  LR,  
mulFnomial  LR,  and,  ambiguously,  mulFple  
LR  and  mulFvariate  LR  
Why  LR  is  Interes.ng  
• Parameters  have  a  meaning  
– How  log  odds  increases  w/  feature  values  
• Lets  you  
– Look  at  model  and  see  if  sensible  
– Use  domain  knowledge  to  guide  parameter  fiwng  (more  later)  
– Build  some  parts  of  model  by  hand  
• Cavaet:  realisBcally,  a  lot  can  (does)  complicate  this  
interpretaBon  
Measuring the Performance of a
Binary Classifier
Suppose we use a
cutoff of 0.5…
actual outcome
1 0

1
7 3
predicted
outcome

0
0 10
Test Data
More generally…

b+c
misclassification rate:
actual outcome a+b+c+d
1 0
a
sensitivity:
1 a+c
a b (aka recall)
predicted
outcome
d
0 specificity:
b+d
c d
a
predicitive value positive:
a+b

(aka precision)
Suppose we use a
cutoff of 0.5…
actual outcome
1 0
7
sensitivity: = 100%
7+0
1
7 3
predicted
outcome
10
specificity: = 77%
0 10+3
0 10
Suppose we use a
cutoff of 0.8…
actual outcome
1 0
5
sensitivity: = 71%
5+2
1
5 2
predicted
outcome
11
specificity: = 85%
0 11+2
2 11
• Note there are 20 possible thresholds

• ROC computes sensitivity and specificity


for all possible thresholds and plots them

actual outcome
• Note if threshold = minimum 1 0

c=d=0 so sens=1; spec=0


1
• If threshold = maximum a b

a=b=0 so sens=0; spec=1 0


c d
• “Area under the curve” is a common
measure of predictive performance

• So is squared error: S(yi-yhat)2


also known as the “Brier Score”
Text Classification Example
• ModApte subset of Reuters-21578
– 90 categories; 9603 training docs; 18978 features
• Reuters RCV1-v2
– 103 cats; 23149 training docs; 47152 features
• OHSUMED heart disease categories
– 77 cats; 83944 training docs; 122076 features
• Cosine normalized TFxIDF weights
Dense vs. Sparse Models
(Macroaveraged F1)
ModApte RCV1-v2 OHSUMED
Lasso 52.03 56.54 51.30
Ridge 39.71 51.40 42.99
Ridge/500 38.82 46.27 36.93
Ridge/50 45.80 41.61 42.59
Ridge/5 46.20 28.54 41.33
SVM 53.75 57.23 50.58
61  
An  Example  Model    
(category  “grain”)  
Word Beta Word Beta
corn 29.78 formal -1.15
wheat 20.56 holder -1.43
rice 11.33 hungarian -6.15
sindt 10.56 rubber -7.12
madagascar 6.83 special -7.25
import 6.79 … …
grain 6.77 beet -13.24
contract 3.08 rockwood -13.61 62  
Text  Sequence  Modeling  
IntroducBon  
• Textual  data  comprise  sequences  of  words:  
“The  quick  brown  fox…”  
• Many  tasks  can  put  this  sequence  informaBon  
to  good  use:  
– Part  of  speech  tagging  
– Named  enBty  extracBon  
– Text  chunking  
– Author  idenBficaBon  
Part-­‐of-­‐Speech  Tagging  
• Assign  grammaBcal  tags  to  words  
• Basic  task  in  the  analysis  of  natural  language  data  
• Phrase  idenBficaBon,  enBty  extracBon,  etc.  
• Ambiguity:  “tag”  could  be  a  noun  or  a  verb  
• “a  tag  is  a  part-­‐of-­‐speech  label”  –  context  resolves  the  
ambiguity  
The  Penn  Treebank  POS  Tag  Set  
POS  Tagging  Process  

Berlin  Chen  
POS  Tagging  Algorithms  
• Rule-­‐based  taggers:  large  numbers  of  hand-­‐
craced  rules  
• ProbabilisBc  tagger:  used  a  tagged  corpus  to  
train  some  sort  of  model,  e.g.  HMM.  
  tag tag
tag1   2   3  

word1   word2   word3  


The  Brown  Corpus  
• Comprises  about  1  million  English  words  
• HMM’s  first  used  for  tagging  on  the  Brown  
Corpus  
• 1967.  Somewhat  dated  now.  
• BriBsh  NaBonal  Corpus  has  100  million  words  
Simple  Charniak  Model  
w1   w2   w3  

t1   t2   t3  

•What  about  words  that  have  never  been  seen  before?        


•Clever  tricks  for  smoothing  the  number  of  parameters  (aka  priors)  
some  details…  

number  of  Bmes  word  j  appears  with  tag  i  

number  of  Bmes  word  j  appears  

number  of  Bmes  a  word  that  had  never  been  seen  with  tag  i  
gets  tag  i  

number  of  such  occurrences  in  total  

Test  data  accuracy  on  Brown  Corpus  =  91.51%  


HMM  
t1   t2   t3  

w1   w2   w3  

 
•Brown  test  set  accuracy  =  95.97%  
Morphological  Features  
• Knowledge  that  “quickly”  ends  in  “ly”  
should  help  idenBfy  the  word  as  an  adverb  
• “randomizing”  -­‐>  “ing”  
• Split  each  word  into  a  root  (“quick”)  and  a  
suffix  (“ly”)  

t1   t2   t3  

r1   s1   r2   s2  
Morphological  Features  
• Typical  morphological  analyzers  produce  
mulBple  possible  splits  
• “GastroenteriBs”  ???  

• Achieves  96.45%  on  the  Brown  Corpus  


Inference  in  an  HMM  

• Compute  the  probability  of  a  given  observaBon  


sequence  
• Given  an  observaBon  sequence,  compute  the  
most  likely  hidden  state  sequence  
• Given  an  observaBon  sequence  and  set  of  
possible  models,  which  model  most  closely  fits  
the  data?  
David Blei
Viterbi  Algorithm  
x1 xt-1 j

o1 ot-1 ot ot+1 oT

δ j (t ) = max P( x1...xt −1 , o1...ot −1 , xt = j , ot )


x1 ... xt −1

The state sequence which maximizes the


probability of seeing the observations to time
t-1, landing in state j, and seeing the
observation at time t
Viterbi  Algorithm  
x1 xt-1 xt xt+1

State Transition
o1 ot-1 ot Probability
ot+1 oT

“Emission”
δ j (t ) = max P( x1...xt −1 , o1...ot −1 , xt = j , ot )
x1 ... xt −1 Probability
δ j (t + 1) = max δ i (t )aij b jot +1
i Recursive
Computation
Viterbi  Small  Example  
x1 x2 Pr(x1=T) = 0.2
Pr(x2=T|x1=T) = 0.7
Pr(x2=T|x1=F) = 0.1
Pr(o=T|x=T) = 0.4
Pr(o=T|x=F) = 0.9
o1 o2
o1=T; o2=F

Brute Force
Pr(x1=T,x2=T, o1=T,o2=F) = 0.2 x 0.4 x 0.7 x 0.6 = 0.0336
Pr(x1=T,x2=F, o1=T,o2=F) = 0.2 x 0.4 x 0.3 x 0.1 = 0.0024
Pr(x1=F,x2=T, o1=T,o2=F) = 0.8 x 0.9 x 0.1 x 0.6 = 0.0432
Pr(x1=F,x2=F, o1=T,o2=F) = 0.8 x 0.9 x 0.9 x 0.1 = 0.0648

Pr(X1,X2 | o1=T,o2=F) ! Pr(X1,X2 , o1=T,o2=F)


Viterbi  Small  Example  
x1 x2
Xˆ 2 = argmax δ j (2)
j

δ j (2) = max δi (1)aij b jo2


o1 o2 i

δT (1) = Pr(x1 = T)Pr(o1 = T | x1 = T) = 0.2 × 0.4 = 0.08


δF (1) = Pr(x1 = F)Pr(o1 = T | x1 = F) = 0.8 × 0.9 = 0.72
δT (2) = max(δF (1) × Pr(x 2 = T | x1 = F)Pr(o2 = F | x 2 = T),δT (1) × Pr(x 2 = T | x1 = T)Pr(o2 = F | x 2 = T))
= max(0.72 × 0.1× 0.6,0.08 × 0.7 × 0.6) = 0.0432
δF (2) = max(δF (1) × Pr(x 2 = F | x1 = F)Pr(o2 = F | x 2 = F),δT (1) × Pr(x 2 = F | x1 = T)Pr(o2 = F | x 2 = F))
= max(0.72 × 0.9 × 0.1,0.0.08 × 0.3 × 0.1) = 0.0648
Other  Developments  
• Toutanova  et  al.,  
2003,  use  a  
“dependency  
network”  and  richer  
feature  set  

•Idea:  using  the  “next”  tag  as  well  as  the  “previous”  tag  should  improve  tagging  
performance          
Named-­‐EnBty  ClassificaBon  
• “Mrs.  Frank”  is  a  person  
• “Steptoe  and  Johnson”  is  a  company  
• “Honduras”  is  a  locaBon  
• etc.  

• Bikel  et  al.  (1998)  from  BBN  “Nymble”  


staBsBcal  approach  using  HMMs  
nc1   nc2   nc3  

word1   word2   word3  

⎧ [ wi | wi −1 , nci ] if nci = nci −1


[ wi | wi −1 , nci , nci −1 ] = ⎨
⎩[ wi | nci , nci −1 ] if nci ≠ nci −1

• “name  classes”:  Not-­‐A-­‐Name,  Person,  LocaBon,  etc.  


• Smoothing  for  sparse  training  data  +  word  features  
• Training  =  100,000  words  from  WSJ  
• Accuracy  =  93%  
• 450,000  words  à  same  accuracy  
training-­‐development-­‐test  

You might also like