Text Mining
Text Mining
-- Text Mining
Text Mining
Synthesis of
Information Retrieval
the science of searching for documents, for information within documents, and for metadata about documents, as well as that of searching relational databases and the World Wide Web. Part of Speech Tagging Phrase Chunking Deep Parsing Named Entity Recognition Information Extraction
2
Information Retrieval
an example.
Det Noun
Brisbane
City
Text Classification
Text Clustering
Text Summarization
Extracting a summary for a document Based on syntax and semantics
Each record is defined by a set of attributes We can measure the similarity between any pair.
Example:
Given two documents, how can you compute their similarity? Base on what?
Unstructured => Structured How to represent a document structurally??? Document representation problem.
7
In other words
Document Representation
Document Word (term) This is a data mining course. Data mining is important.
This
1
term
is
1 1 2
a
1 1
data
1 1 2
mining
1 1 2
course
1
important
1
frequency
In term of geometry, wi is the coordinate of dimension i in d. Yet, conceptually, wi denotes the importance of word i in d.
9
An Example of VSM
course Document 1: (0.938, 0.346, 0, 0, 0, 0, 0) Document 2: (0, 0.225 0, 0, 0.611, 0.611, 0.450)
document 1
document 2
data
mine
10
Problems:
1.
2.
There are sooooooooooooo many English words! How to determine the importance of the words?
11
The first problem: too many words We solve the first problem by:
Stemming
12
The second problem: how to determine the importance of the terms We solve the second problem by:
TF-IDF
TF-IDF
Term frequency-inverse document frequency Evaluate how important a word is to a document in a collection the number of times a term occurs in a document is called its term frequency. the number of documents a term occurs in is called its document frequency.
14
Why TF-IDF
diminish the weight of terms that occur very frequently in the collection increase the weight of terms that occur rarely in the collection Example, the, a, Example, UQ
15
TF-IDF Calculation
Term Importance : w( wordi ) TF ( wordi ) IDF ( wordi ) Term Frequency : TF ( wordi ) number of times wordi appears in the document Inverse Document Frequency : total documents IDF ( wordi ) log document frequency
16
17
18
case
text We are studying text mining Text mining is a subfield of data mining
19
study mine text mine We are studying text mining Text mining is a subfield of data mining mine
study mine text mine We are studying text mining Text mining is a subfield of data mining mine data:1, mine:2, study:1, subfield:1, text:2 mine interest Mining text is interesting and I am interested in it interest Interest:2, mine:1, text:1
21
ID 1 2 3 4
document frequency 1 2 1 3 1
study mine text mine 5 We are studying text mining Text mining 6 7 is a subfield of data mining mine data:1, mine:2, study:1, subfield:1, text:2 mine interest Mining text is interesting and I am interested in it interest interest:2, mine:1, text:1
subfield
text
1
2
22
ID 1 2 3 4
document frequency 1 2 1 3 1
(1, 1, 0, 1, 0, 0, 0)
study mine text mine 5 We are studying text mining Text mining 6 7 is a subfield of data mining mine data:1, mine:3, study:1, subfield:1, text:2 (0, 1, 0, 3, 1, 1, 2) mine interest Mining text is interesting and I am interested in it interest interest:2, mine:1, text:1 (0, 0, 2, 1, 0, 0, 1)
subfield
text
1
2
23
(1, 1, 0, 1, 0, 0, 0)
ID 1 2 3
4
5
mine
study subfield text
3
1 1 2
0
0.477 0.477 0.176
(0, 1, 0, 3, 1, 1, 2)
6 7
(0, 0, 2, 1, 0, 0, 1)
24
ID 1 2 3
document frequency 1 2 1
4
5
mine
study subfield text
3
1 1 2
0
0.477 0.477 0.176
(0, 1, 0, 3, 1, 1, 2)
ID 1 2 3
document frequency 1 2 1
4
5
mine
study subfield text
3
1 1 2
0
0.477 0.477 0.176
6 7
w( wordi )
Mining text is interesting, and I am interested in it.
Normalization
This is a data mining course.
w(course)
0.938
(0.938, 0.346, 0, 0, 0, 0, 0)
A Running Example
ID
document frequency 1 2 1 3 1 1 2
We are studying text mining. Text mining is a subfield of data mining. Mining text is interesting, and I am interested in it.
1 2 3 4 5 6 7
28
Query A Document
2.
3.
4.
Remove stopwords. Stem every word of the query string. Transform the query string into a vector space model (VSM) by using TD-IDF schema. Normalize the VSM into unit length.
29
6
7
subfield
text
1
2
0.477
0.176
Term A
D2
Document Similarity
Q
D1
Stop Words
Rare Words
Term B
sim ( Q , D )
k 1
w qk
w dk
sim(Q, D )
w
k 1 t k 1
qk 2
wdk
t 2
(wqk ) (wdk )
k 1
31
p q p q
i i 2 i 2
6
7
subfield
text
1
2
0.477
0.176
2 i
cosine ( D1, Q) 0 cosine ( D 2, Q) cosine ( D3, Q) 0.346 0.450 (0.938 0.346 ) (0.225 0.611 0.611 0.450 )
2 2 2 2 2
0.156
0.985
QUIZ!
IDF
Word list W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 DF 1 1 2 3 2 1 1 1 1 1 IDF 0.477 0.477 0.176 0 0.176 0.477 0.477 0.477 0.477 0.477
34
VSM
35
Normalization
D1= [0.6634 0.6634 0.2448 0 0.2448 0 0 0 0 0] D2= [0 0 0 0 0.2525 0.6842 0.6842 0 0 0] D3= [0 0 0.2084 0 0 0 0 0.5647 0.5647 0.5647]
36
Query
Q=(0,0,0,0,0.176,0,0,0,0,0) (0,0,0,0,1,0,0,0,0,0)
Cosine_sim(Q,D1)=0.2448 Cosine_sim(Q,D2)=0.2525 Cosine_sim(Q,D3)=0
37
Simple? Well
What we have discussed so far is a general framework only. There are still a lot of issues:
A is usually regarded as a stopword. However, Vitamin A may be an important term in an article. What to stem and what not to stem?
Spelling error?
Spelling error always appears in documents! Should we consider two similar word as a same word?
Are they the same: classification and classificatiam? But then, how about Information and informatics?
39
A C
B D
Precision 100%
Retrieved
Recall 100%
40
Summary
Next Week:
Web Mining
41