0% found this document useful (0 votes)

7 views

Intro Text Mining

text mining and text classification

Uploaded by

AmirReza Tajally

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Intro Text Mining

text mining and text classification

Uploaded by

AmirReza Tajally

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 83

Text

Mining: An Overview

David Madigan
[email protected]
h:p://www.stat.columbia.edu/~madigan

in collaboraBon with:
David D. Lewis

Text Mining
•StaBsBcal text analysis has a long history in literary analysis and in solving
disputed authorship problems
•First (?) is Thomas C. Mendenhall in 1887
Mendenhall
•Mendenhall was Professor of Physics at Ohio State and at University of
Tokyo, Superintendent of the USA Coast and GeodeBc Survey, and later,
President of Worcester Polytechnic InsBtute

Mendenhall Glacier,
Juneau, Alaska
X2 = 127.2, df=12
•Hamilton versus Madison
•Used Naïve Bayes with Poisson and NegaBve Binomial model
•Out-‐of-‐sample predicBve performance
Today
• StaBsBcal methods rouBnely used for textual
analyses of all kinds
• Machine translaBon, part-‐of-‐speech tagging,
informaBon extracBon, quesBon-‐answering,
text categorizaBon, disputed authorship
(stylometry), etc.
• Not reported in the staBsBcal literature (no
staBsBcians?)

Mosteller, Wallace, Efron, Thisted

Text Mining’s Connec.ons with
Language Processing
• LinguisBcs
• ComputaBonal linguisBcs
• InformaBon retrieval
• Content analysis
• StylisBcs
• Others
Why Are We Mining the Text?
Are we trying to understand:
1. The texts themselves?
2. The writer (or speakers) of the texts?
a. The writer as a writer?
b. The writer as an enBty in the world?
3. Things in the world?
a. Directly linked to texts?
b. Described by texts?
Text known to be linked to Important terms for
Stylistic clues to author particular product. searching database of
identity and demographics. such messages.

To: [email protected]
Dear Sir or Madam, My drier made smoke
and a big whoooshie noise when I started
it! Was the problem drying my new
Australik raincoat? It is made of oilcloth.
I guess it was my fault.

Another entity information Customer probably won’t

could be extracted on. make a fuss.
Granularity of Text Linked to En.ty?
Increasing linguistic
agreement on structure and
representations
• Morphemes, words,
simple noun phrases
• Clauses, sentences
• Paragraphs, secBons,
documents
• Corpora, networks of
documents Increasing size & complexity
(and variance in these)
Ambiguity
• Correct interpretaBon of an u:erance is
rarely explicit!
– People use massive knowledge of language
and the world to understand NL
– informaBon readily inferred by speaker/reader
will be lec out
• The core task in NLP is resolving the
resul5ng ambiguity
“I made her duck”
[acer Jurafsky & MarBn]

• I cooked waterfowl for her.

• I cooked waterfowl belonging to her.
• I created the (plaster?) duck she owns.
• I caused her to quickly lower her head.
• I magically converted her into roast fowl.

• These vary in morphology, syntax, semanBc, pragmaBcs.

....Plus, Phone.c and Visual
Ambiguity
• Aye, made her duck!
• I made her....duck!
• Aye, maid. Her duck.
Synonymy
• Can use different words to communicate
the same meaning
– (Of course also true for sentences,....)
• Synonymy in all contexts is rare:
– a big plane = a large plane
– a big sister ≠ a large sister
• And choice among synonyms may be a clue
to topic, writer’s background, etc.
– rich vs. high SES
Metaphor & Metonymy
• Metaphor: use of a construct with one meaning to
convey a very different one:
– Television is eaFng away at the moral fiber of our country.
• Metonymy: menBoning one concept to convey a
closely related one
-‐ "On the way downtown I stopped at a bar and had a couple
of double Scotches. They didn't do me any good. All they did
was make me think of Silver Wig, and I never saw her again."
(Raymond Chandler, The Big Sleep)
A:ribute Vectors
• Most text mining based on
– Breaking language into symbols
– TreaBng each symbol as an a:ribute
• But what value should each a:ribute have
for a unit of text?
Term WeighBng
• How strongly does a parBcular word
indiciate the content of a document?
• Some clues:
– Number of Bmes word occurs in this document
– Number of Bmes word occurs in other
documents
– Length of document
TF (term IDF (inverse
frequency) document
frequency)
N
(1 + ln fij )ln , if t j present in di
w raw
ij = nj
0, otherwise
raw
w
wij =
ij
d
Set L2-norm
to 1.0
∑w raw
ij ×w raw
ij
j '=1

• “Cosine-‐normalized TFIDF weighBng”

– Many minor variants on this theme
Case Study: Representa.on for
Authorship AJribu.on
• StaBsBcal methods for authorship a:ribuBon
• Represent documents with a:ribute vectors
• Then use regression-‐type methods
• Bag of words?
• StylisBc features? (e.g., passive voice)
• Topic free?
1-of-K Sample Results: brittany-l
Feature Set % Number of
errors Features
“Argamon” function 74.8 380
words, raw tf
POS 75.1 44
1suff 64.2 121
1suff*POS 50.9 554
2suff 40.6 1849
2suff*POS 34.9 3655 4.6 million parameters
3suff 28.7 8676
3suff*POS 27.9 12976
3suff+POS+3suff*POS 27.6 22057
+Argamon
All words 23.9 52492
89 authors with at least 50 postings. 10,076 training documents, 3,322 test documents.
BMR-Laplace classification, default hyperparameter
The Federalist

• Mosteller and Wallace attributed all 12

disputed papers to Madison

• Historical evidence is more muddled

• Our results suggest attribution is highly

dependent on the document representation
Table 1 Authorship of the Federalist Papers
Paper Number Author Table 3 The feature sets
1 Hamilton Features Name in Short
2-5 Jay The length of each character charcount
6-9 Hamilton Part of speeches POS
10 Madison Two-letter-suffix Suffix2
11-13 Hamilton Three-letter-suffix Suffix3
14 Madison Words, numbers, signs, punctuations Words
15-17 Hamilton The length of each character plus the part of speeches Charcount+POS
18-20 Joint: Hamilton and Madison Two-letter-suffix plus the part of speeches Suffix2+POS
21-36 Hamilton Three-letter-suffix plus the part of speeches Suffix3+POS
37-48 Madison Words, numbers, signs, punctuations plus the part of speeches Words+POS
49-58 Disputed The 484 function words in Koppel’s paper 484 features
59-61 Hamilton The feature set in the Mosteller and Wallace paper Wallace features
62-63 Disputed Words appear at least twice Words(>=2)
64 Jay Each word shown in the Federalist papers Each word
65-85 Hamilton
Feature Set 10-fold Error Rate
Charcount 0.21
POS 0.19
Suffix2 0.12
Suffix3 0.09
Words 0.10
Charcount+POS 0.12
Suffix2+POS 0.08
Suffix3+POS 0.04
four papers to Hamilton
Words+POS 0.08
484 features 0.05
Wallace features 0.05
Words (>=2) 0.05
Each Word 0.05
Supervised Learning for
Text ClassificaBon
PredicBve Modeling
Goal: learn a mapping: y = f(x;!)
Need: 1. A model structure
2. A score funcBon
3. An opBmizaBon strategy

Categorical y ! {c1,…,cm}: classificaBon
Real-‐valued y: regression
Note: usually assume {c1,…,cm} are mutually exclusive and exhausBve
Classifier Types

Discrimina.ve: model p(ck | x )

-‐ e.g. linear regression, logisBc regression, CART
Genera.ve: model p(x |ck , !k)
-‐ e.g. “Bayesian classiﬁers”, LDA
Regression for Binary ClassiﬁcaBon

•Can ﬁt a linear regression model to a 0/1

response
•Predicted values are not necessarily between
zero and one

1.0
•With p>1, the decision
boundary is linear

0.5
y
e.g. 0.5 = b0 + b1 x1 + b2 x2
0.0

-3 -2 -1 0 1 2 3

zeroOneR.txt
3.0
2.5
2.0
1.5
x2

1.0
0.5
0.0

0.0 0.5 1.0 1.5 2.0 2.5 3.0

x1
Naïve Bayes via a Toy Spam Filter
Example
•Naïve Bayes is a generaBve model that makes
drasBc simplifying assumpBons
•Consider a small training data set for spam
along with a bag of words representaBon
Naïve Bayes Machinery
•We need a way to esBmate:

•Via Bayes theorem we have:

or, on the log-‐odds scale:

Naïve Bayes Machinery
•Naïve Bayes assumes:

and

leading to:
Maximum Likelihood Estimation

weights
of
evidence
Naïve Bayes PredicBon

•Usually add a small constant (e.g. 0.5) to avoid

divide by zero problems and to reduce bias

•New message: “the quick rabbit rests”

•Predicted log odds:

0.51 + 0.51 + 0.51 + 0.51 + 1.10 + 0 = 3.04

•Corresponds to a spam probability of 0.95

A Close Look at Logistic
Regression for Text Classification
Logistic Regression

•Linear model for log odds of category

membership:

p(y=1|xi)
log p(y=-1|xi) = ! bj xij = bxi
Maximum Likelihood Training

• Choose parameters (bj's) that maximize

probability (likelihood) of class labels (yi's)
given documents (xi’s)

• Tends to overfit
• Not defined if d > n
• Feature selection
Shrinkage/Regularization/Bayes
• Avoid combinatorial challenge of feature
selection

• L1 shrinkage: regularization + feature
selection

• Expanding theoretical understanding
• Large scale

• Empirical performance
Ridge Logistic Regression

Maximum likelihood plus a constraint:

∑ j ≤s
β 2

j =1

Lasso Logistic Regression

Maximum likelihood plus a constraint:

∑β
j =1
j ≤s
s
1/s
Bayesian Perspective
Polytomous Logis.c Regression (PLR)


exp( β k x i )
P( yi = k | x i ) = 
∑ exp(βk 'x i )
k'

• Elegant approach to mulBclass problems

• Also known as polychotomous LR,
mulFnomial LR, and, ambiguously, mulFple
LR and mulFvariate LR
Why LR is Interes.ng
• Parameters have a meaning
– How log odds increases w/ feature values
• Lets you
– Look at model and see if sensible
– Use domain knowledge to guide parameter ﬁwng (more later)
– Build some parts of model by hand
• Cavaet: realisBcally, a lot can (does) complicate this
interpretaBon
Measuring the Performance of a
Binary Classifier
Suppose we use a
cutoff of 0.5…
actual outcome
1 0

1
7 3
predicted
outcome

0
0 10
Test Data
More generally…

b+c
misclassification rate:
actual outcome a+b+c+d
1 0
a
sensitivity:
1 a+c
a b (aka recall)
predicted
outcome
d
0 specificity:
b+d
c d
a
predicitive value positive:
a+b

(aka precision)
Suppose we use a
cutoff of 0.5…
actual outcome
1 0
7
sensitivity: = 100%
7+0
1
7 3
predicted
outcome
10
specificity: = 77%
0 10+3
0 10
Suppose we use a
cutoff of 0.8…
actual outcome
1 0
5
sensitivity: = 71%
5+2
1
5 2
predicted
outcome
11
specificity: = 85%
0 11+2
2 11
• Note there are 20 possible thresholds

• ROC computes sensitivity and specificity

for all possible thresholds and plots them

actual outcome
• Note if threshold = minimum 1 0

c=d=0 so sens=1; spec=0

1
• If threshold = maximum a b

a=b=0 so sens=0; spec=1 0

c d
• “Area under the curve” is a common
measure of predictive performance

• So is squared error: S(yi-yhat)2

also known as the “Brier Score”
Text Classification Example
• ModApte subset of Reuters-21578
– 90 categories; 9603 training docs; 18978 features
• Reuters RCV1-v2
– 103 cats; 23149 training docs; 47152 features
• OHSUMED heart disease categories
– 77 cats; 83944 training docs; 122076 features
• Cosine normalized TFxIDF weights
Dense vs. Sparse Models
(Macroaveraged F1)
ModApte RCV1-v2 OHSUMED
Lasso 52.03 56.54 51.30
Ridge 39.71 51.40 42.99
Ridge/500 38.82 46.27 36.93
Ridge/50 45.80 41.61 42.59
Ridge/5 46.20 28.54 41.33
SVM 53.75 57.23 50.58
61
An Example Model
(category “grain”)
Word Beta Word Beta
corn 29.78 formal -1.15
wheat 20.56 holder -1.43
rice 11.33 hungarian -6.15
sindt 10.56 rubber -7.12
madagascar 6.83 special -7.25
import 6.79 … …
grain 6.77 beet -13.24
contract 3.08 rockwood -13.61 62
Text Sequence Modeling
IntroducBon
• Textual data comprise sequences of words:
“The quick brown fox…”
• Many tasks can put this sequence informaBon
to good use:
– Part of speech tagging
– Named enBty extracBon
– Text chunking
– Author idenBﬁcaBon
Part-‐of-‐Speech Tagging
• Assign grammaBcal tags to words
• Basic task in the analysis of natural language data
• Phrase idenBﬁcaBon, enBty extracBon, etc.
• Ambiguity: “tag” could be a noun or a verb
• “a tag is a part-‐of-‐speech label” – context resolves the
ambiguity
The Penn Treebank POS Tag Set
POS Tagging Process

Berlin Chen
POS Tagging Algorithms
• Rule-‐based taggers: large numbers of hand-‐
craced rules
• ProbabilisBc tagger: used a tagged corpus to
train some sort of model, e.g. HMM.
tag tag
tag1 2 3

word1 word2 word3

The Brown Corpus
• Comprises about 1 million English words
• HMM’s ﬁrst used for tagging on the Brown
Corpus
• 1967. Somewhat dated now.
• BriBsh NaBonal Corpus has 100 million words
Simple Charniak Model
w1 w2 w3

t1 t2 t3

•What about words that have never been seen before?

•Clever tricks for smoothing the number of parameters (aka priors)
some details…

number of Bmes word j appears with tag i

number of Bmes word j appears

number of Bmes a word that had never been seen with tag i
gets tag i

number of such occurrences in total

Test data accuracy on Brown Corpus = 91.51%

HMM
t1 t2 t3

w1 w2 w3

•Brown test set accuracy = 95.97%
Morphological Features
• Knowledge that “quickly” ends in “ly”
should help idenBfy the word as an adverb
• “randomizing” -‐> “ing”
• Split each word into a root (“quick”) and a
suﬃx (“ly”)

t1 t2 t3

r1 s1 r2 s2
Morphological Features
• Typical morphological analyzers produce
mulBple possible splits
• “GastroenteriBs” ???

• Achieves 96.45% on the Brown Corpus

Inference in an HMM

• Compute the probability of a given observaBon

sequence
• Given an observaBon sequence, compute the
most likely hidden state sequence
• Given an observaBon sequence and set of
possible models, which model most closely ﬁts
the data?
David Blei
Viterbi Algorithm
x1 xt-1 j

o1 ot-1 ot ot+1 oT

δ j (t ) = max P( x1...xt −1 , o1...ot −1 , xt = j , ot )

x1 ... xt −1

The state sequence which maximizes the

probability of seeing the observations to time
t-1, landing in state j, and seeing the
observation at time t
Viterbi Algorithm
x1 xt-1 xt xt+1

State Transition
o1 ot-1 ot Probability
ot+1 oT

“Emission”
δ j (t ) = max P( x1...xt −1 , o1...ot −1 , xt = j , ot )
x1 ... xt −1 Probability
δ j (t + 1) = max δ i (t )aij b jot +1
i Recursive
Computation
Viterbi Small Example
x1 x2 Pr(x1=T) = 0.2
Pr(x2=T|x1=T) = 0.7
Pr(x2=T|x1=F) = 0.1
Pr(o=T|x=T) = 0.4
Pr(o=T|x=F) = 0.9
o1 o2
o1=T; o2=F

Brute Force
Pr(x1=T,x2=T, o1=T,o2=F) = 0.2 x 0.4 x 0.7 x 0.6 = 0.0336
Pr(x1=T,x2=F, o1=T,o2=F) = 0.2 x 0.4 x 0.3 x 0.1 = 0.0024
Pr(x1=F,x2=T, o1=T,o2=F) = 0.8 x 0.9 x 0.1 x 0.6 = 0.0432
Pr(x1=F,x2=F, o1=T,o2=F) = 0.8 x 0.9 x 0.9 x 0.1 = 0.0648

Pr(X1,X2 | o1=T,o2=F) ! Pr(X1,X2 , o1=T,o2=F)

Viterbi Small Example
x1 x2
Xˆ 2 = argmax δ j (2)
j

δ j (2) = max δi (1)aij b jo2

o1 o2 i

δT (1) = Pr(x1 = T)Pr(o1 = T | x1 = T) = 0.2 × 0.4 = 0.08

δF (1) = Pr(x1 = F)Pr(o1 = T | x1 = F) = 0.8 × 0.9 = 0.72
δT (2) = max(δF (1) × Pr(x 2 = T | x1 = F)Pr(o2 = F | x 2 = T),δT (1) × Pr(x 2 = T | x1 = T)Pr(o2 = F | x 2 = T))
= max(0.72 × 0.1× 0.6,0.08 × 0.7 × 0.6) = 0.0432
δF (2) = max(δF (1) × Pr(x 2 = F | x1 = F)Pr(o2 = F | x 2 = F),δT (1) × Pr(x 2 = F | x1 = T)Pr(o2 = F | x 2 = F))
= max(0.72 × 0.9 × 0.1,0.0.08 × 0.3 × 0.1) = 0.0648
Other Developments
• Toutanova et al.,
2003, use a
“dependency
network” and richer
feature set

•Idea: using the “next” tag as well as the “previous” tag should improve tagging
performance
Named-‐EnBty ClassiﬁcaBon
• “Mrs. Frank” is a person
• “Steptoe and Johnson” is a company
• “Honduras” is a locaBon
• etc.

• Bikel et al. (1998) from BBN “Nymble”

staBsBcal approach using HMMs
nc1 nc2 nc3

word1 word2 word3

⎧ [ wi | wi −1 , nci ] if nci = nci −1

[ wi | wi −1 , nci , nci −1 ] = ⎨
⎩[ wi | nci , nci −1 ] if nci ≠ nci −1

• “name classes”: Not-‐A-‐Name, Person, LocaBon, etc.

• Smoothing for sparse training data + word features
• Training = 100,000 words from WSJ
• Accuracy = 93%
• 450,000 words à same accuracy
training-‐development-‐test

Insight Pre-Intermediate
100% (4)
Insight Pre-Intermediate
4 pages
88 - Before, After, During, While PDF
67% (9)
88 - Before, After, During, While PDF
2 pages
[FREE PDF sample] Practice Makes Perfect Spanish Vocabulary 3rd Edition Dorothy Richmond ebooks
No ratings yet
[FREE PDF sample] Practice Makes Perfect Spanish Vocabulary 3rd Edition Dorothy Richmond ebooks
62 pages
Complete Hiragana Workbook: Yayuyo
No ratings yet
Complete Hiragana Workbook: Yayuyo
45 pages
Ling571 Class14 Distr Thes
No ratings yet
Ling571 Class14 Distr Thes
122 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
CSE442 Text
No ratings yet
CSE442 Text
89 pages
Computational Journalism 2016 Week 2: Text Analysis
No ratings yet
Computational Journalism 2016 Week 2: Text Analysis
68 pages
Text
No ratings yet
Text
102 pages
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
No ratings yet
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
45 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
Neural Models For NLP
No ratings yet
Neural Models For NLP
67 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
Text Mining
No ratings yet
Text Mining
34 pages
feature eng
No ratings yet
feature eng
34 pages
Lect04
No ratings yet
Lect04
44 pages
DVT UNIT -4 Notes 211124 (1)
No ratings yet
DVT UNIT -4 Notes 211124 (1)
21 pages
week2and3
No ratings yet
week2and3
76 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
NLP Basic - YL
No ratings yet
NLP Basic - YL
16 pages
Task 3
No ratings yet
Task 3
17 pages
NLP-Neuro Linguistic Programming: What Is A Corpus?
No ratings yet
NLP-Neuro Linguistic Programming: What Is A Corpus?
3 pages
Module5-Representing and Mining Text
No ratings yet
Module5-Representing and Mining Text
24 pages
Qta Lse Day2.PDF
No ratings yet
Qta Lse Day2.PDF
55 pages
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
No ratings yet
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
31 pages
Ci 5
No ratings yet
Ci 5
17 pages
Using Wordnet For Text Categorization
No ratings yet
Using Wordnet For Text Categorization
9 pages
Text Analytics Basics
No ratings yet
Text Analytics Basics
28 pages
Introduction To Semantic Processing
No ratings yet
Introduction To Semantic Processing
13 pages
Vector Semantics
No ratings yet
Vector Semantics
83 pages
04 Textcat
No ratings yet
04 Textcat
101 pages
Text Categorization: See Chapter 16 in Manning&Schütze
No ratings yet
Text Categorization: See Chapter 16 in Manning&Schütze
89 pages
NLP
No ratings yet
NLP
4 pages
DSB - Unit4-Representing and Miniing text-decision-analytic-think-II
No ratings yet
DSB - Unit4-Representing and Miniing text-decision-analytic-think-II
46 pages
Predictive Methods For Text Mining
No ratings yet
Predictive Methods For Text Mining
75 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
Unit 2a
No ratings yet
Unit 2a
51 pages
ML UNIT-II
No ratings yet
ML UNIT-II
27 pages
Semantic Analysis Theory1
No ratings yet
Semantic Analysis Theory1
16 pages
Lecture 3. Vector Semantics
No ratings yet
Lecture 3. Vector Semantics
51 pages
CACIC 20070725 Induction Trees LopezDeLuise - v7
No ratings yet
CACIC 20070725 Induction Trees LopezDeLuise - v7
12 pages
Applications of AI
No ratings yet
Applications of AI
11 pages
Automated Text Classification of News Articles: A Practical Guide
No ratings yet
Automated Text Classification of News Articles: A Practical Guide
39 pages
Statistical NLP
No ratings yet
Statistical NLP
45 pages
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
No ratings yet
NLP Prez Word - Sentence Embedding - MAQUET - MARTIN - LEEFEBURE - MOGAVERO
18 pages
Unit 3 NLP
No ratings yet
Unit 3 NLP
103 pages
05 Introduction To NLP
No ratings yet
05 Introduction To NLP
63 pages
Word Embeddings
No ratings yet
Word Embeddings
59 pages
Lecture 6 - From Unstructured Texts to Structure Data I
No ratings yet
Lecture 6 - From Unstructured Texts to Structure Data I
17 pages
lecture 10
No ratings yet
lecture 10
86 pages
Text Classification by Augmenting Bag of Words (BOW) Representation With Co-Occurrence Feature
No ratings yet
Text Classification by Augmenting Bag of Words (BOW) Representation With Co-Occurrence Feature
5 pages
Text Mining
No ratings yet
Text Mining
31 pages
NLP-UNIT-4 (1) (1)
No ratings yet
NLP-UNIT-4 (1) (1)
23 pages
A Comparative Study On Text Representation Schemes in Text Categorization
No ratings yet
A Comparative Study On Text Representation Schemes in Text Categorization
11 pages
NLP Q2 21SAL54 Scheme
No ratings yet
NLP Q2 21SAL54 Scheme
6 pages
Semantic_Technology-Assisted_Review_STAR_Document_
No ratings yet
Semantic_Technology-Assisted_Review_STAR_Document_
14 pages
Lec8-9- VSM
No ratings yet
Lec8-9- VSM
20 pages
Tweaking 2
No ratings yet
Tweaking 2
13 pages
Topic Analysis Presentation
No ratings yet
Topic Analysis Presentation
23 pages
Unit iv
No ratings yet
Unit iv
58 pages
Philosophy of Language: The Classics Explained
From Everand
Philosophy of Language: The Classics Explained
Colin McGinn
3/5 (1)
Grade 6 English FAL Lets Practise Adverbs
No ratings yet
Grade 6 English FAL Lets Practise Adverbs
2 pages
Jadwal Ganjil 2022-2023
No ratings yet
Jadwal Ganjil 2022-2023
7 pages
Action Research Beced 1C Group 1
No ratings yet
Action Research Beced 1C Group 1
9 pages
Tim Winton - The Turning
No ratings yet
Tim Winton - The Turning
39 pages
01- Basic Verbs 4
No ratings yet
01- Basic Verbs 4
19 pages
Budding Scientist - Discovering Language HICHAM OUAZENE
No ratings yet
Budding Scientist - Discovering Language HICHAM OUAZENE
2 pages
How To Use State Verbs
No ratings yet
How To Use State Verbs
11 pages
Verb 3rd Singular Present Participle Past Simple Past Participle Translation
No ratings yet
Verb 3rd Singular Present Participle Past Simple Past Participle Translation
20 pages
DLP - Grace For COT 1
No ratings yet
DLP - Grace For COT 1
5 pages
Comparatives and Superlatives
No ratings yet
Comparatives and Superlatives
2 pages
B2 Erasmus Answer Key
No ratings yet
B2 Erasmus Answer Key
1 page
Semantics Fix
No ratings yet
Semantics Fix
26 pages
M36 - 13122023 - Chan Yoke Ling
No ratings yet
M36 - 13122023 - Chan Yoke Ling
3 pages
Kaccayana Pali Sandhi
100% (1)
Kaccayana Pali Sandhi
140 pages
Chapter I - Self Introduction
No ratings yet
Chapter I - Self Introduction
15 pages
Vettorel - Lopriore 2013
No ratings yet
Vettorel - Lopriore 2013
22 pages
Grammar Booklet 6° 2025 (1)
No ratings yet
Grammar Booklet 6° 2025 (1)
33 pages
examen-anglais-lettres-2019-session-normale-corrige
No ratings yet
examen-anglais-lettres-2019-session-normale-corrige
1 page
FP1 U01 Grammar Vocabulary Plus
No ratings yet
FP1 U01 Grammar Vocabulary Plus
2 pages
Integrated Chinese Volume 1 Workbook No Changes
No ratings yet
Integrated Chinese Volume 1 Workbook No Changes
132 pages
THEORETICAL-FOUNDATIONS-OF-LANGUAGE-FOR-LET-REVIEW
No ratings yet
THEORETICAL-FOUNDATIONS-OF-LANGUAGE-FOR-LET-REVIEW
13 pages
3D Factorising Algebraic Expressions
No ratings yet
3D Factorising Algebraic Expressions
12 pages
Class-Viii English Sample Paper
No ratings yet
Class-Viii English Sample Paper
8 pages
Heuer - Psychology of Intelligence Analysis
No ratings yet
Heuer - Psychology of Intelligence Analysis
186 pages
2008 - Engleză - Alte Concursuri - Subiecte - Clasa A VII-a - 0
No ratings yet
2008 - Engleză - Alte Concursuri - Subiecte - Clasa A VII-a - 0
2 pages
Conjunctions
No ratings yet
Conjunctions
30 pages

Intro Text Mining

Uploaded by

Intro Text Mining

Uploaded by

Text

Mining: An Overview

Mosteller, Wallace, Efron, Thisted

Another entity information Customer probably won’t

• I cooked waterfowl for her.

• These vary in morphology, syntax, semanBc, pragmaBcs.

• “Cosine-­‐normalized TFIDF weighBng”

• Mosteller and Wallace attributed all 12

• Historical evidence is more muddled

• Our results suggest attribution is highly

Discrimina.ve: model p(ck | x )

•Can ﬁt a linear regression model to a 0/1

0.0 0.5 1.0 1.5 2.0 2.5 3.0

•Via Bayes theorem we have:

or, on the log-­‐odds scale:

•Usually add a small constant (e.g. 0.5) to avoid

•New message: “the quick rabbit rests”

•Predicted log odds:

•Corresponds to a spam probability of 0.95

•Linear model for log odds of category

• Choose parameters (bj's) that maximize

Maximum likelihood plus a constraint:

Lasso Logistic Regression

Maximum likelihood plus a constraint:

• Elegant approach to mulBclass problems

• ROC computes sensitivity and specificity

c=d=0 so sens=1; spec=0

a=b=0 so sens=0; spec=1 0

• So is squared error: S(yi-yhat)2

word1 word2 word3

•What about words that have never been seen before?

number of Bmes word j appears with tag i

number of Bmes word j appears

number of such occurrences in total

Test data accuracy on Brown Corpus = 91.51%

• Achieves 96.45% on the Brown Corpus

• Compute the probability of a given observaBon

δ j (t ) = max P( x1...xt −1 , o1...ot −1 , xt = j , ot )

The state sequence which maximizes the

Pr(X1,X2 | o1=T,o2=F) ! Pr(X1,X2 , o1=T,o2=F)

δ j (2) = max δi (1)aij b jo2

δT (1) = Pr(x1 = T)Pr(o1 = T | x1 = T) = 0.2 × 0.4 = 0.08

• Bikel et al. (1998) from BBN “Nymble”

word1 word2 word3

⎧ [ wi | wi −1 , nci ] if nci = nci −1

• “name classes”: Not-­‐A-­‐Name, Person, LocaBon, etc.

You might also like

• “Cosine-‐normalized TFIDF weighBng”

or, on the log-‐odds scale:

• “name classes”: Not-‐A-‐Name, Person, LocaBon, etc.