0% found this document useful (0 votes)

198 views

Text Mining

Text Mining is the science of searching for documents, for information within documents. Text Mining uses part of speech tagging, entity recognition and deep parsing. Given two documents, how can you compute their similarity? Base on what?

Uploaded by

Deepak Kumar

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

198 views

Text Mining

Uploaded by

Deepak Kumar

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 41

Data Mining

-- Text Mining

Text Mining
Synthesis of

Information Retrieval

the science of searching for documents, for information within documents, and for metadata about documents, as well as that of searching relational databases and the World Wide Web. Part of Speech Tagging Phrase Chunking Deep Parsing Named Entity Recognition Information Extraction
2

Natural Language Processing

Information Retrieval

Natural Language Processing

An example of part-of-speech tagging:
This sentence serves as
Det Noun Verb P

an example.
Det Noun

An example of entity recognition:

The University of Queensland, St. Lucia
University Suburb

Brisbane
City

Text Mining Tasks

Text Classification

Assigning a document to one of several prespecified classes Unsupervised learning

Text Clustering

Text Summarization
Extracting a summary for a document Based on syntax and semantics

Challenge of Text Mining

In traditional data mining, all data are structured.

We usually store the data into database.

Table structure. Very clear.

Every attribute is well defined. We understand the record very well.

Each record is defined by a set of attributes We can measure the similarity between any pair.

Challenge of Text Mining

However, in text mining, data are unstructured!

Example:

Given two documents, how can you compute their similarity? Base on what?

So, what we need to do

Unstructured => Structured How to represent a document structurally??? Document representation problem.
7

In other words

Document Representation

Document Word (term) This is a data mining course. Data mining is important.
This
1

term

is
1 1 2

a
1 1

data
1 1 2

mining
1 1 2

course
1

important
1

frequency

Vector Space Model

Each word is a dimension

If we have M different words. Then, we have a Mdimensional vector space.

Each document is regarded as a point in this vector space.

d = {w1, w2, wm}

In term of geometry, wi is the coordinate of dimension i in d. Yet, conceptually, wi denotes the importance of word i in d.
9

An Example of VSM
course Document 1: (0.938, 0.346, 0, 0, 0, 0, 0) Document 2: (0, 0.225 0, 0, 0.611, 0.611, 0.450)

document 1
document 2

data

mine
10

Vector Space Model

Problems:
1.

There are sooooooooooooo many English words! How to determine the importance of the words?

Vector Space Model

The first problem: too many words We solve the first problem by:

Remove stop words

A, the, this, that

Stemming

study study, studying, studied

Vector Space Model

The second problem: how to determine the importance of the terms We solve the second problem by:

Using a weighting schema, the TF-IDF schema:

w( wordi ) TF ( wordi ) IDF ( wordi ) TF ( wordi ) number of times wordi appears in the document IDF ( wordi ) log total documents document frequency

Normalize the document into unit length

TF-IDF

Term frequency-inverse document frequency Evaluate how important a word is to a document in a collection the number of times a term occurs in a document is called its term frequency. the number of documents a term occurs in is called its document frequency.
14

Why TF-IDF

Can we simply use term frequency?

diminish the weight of terms that occur very frequently in the collection increase the weight of terms that occur rarely in the collection Example, the, a, Example, UQ

TF-IDF Calculation
Term Importance : w( wordi ) TF ( wordi ) IDF ( wordi ) Term Frequency : TF ( wordi ) number of times wordi appears in the document Inverse Document Frequency : total documents IDF ( wordi ) log document frequency

A Running Example Step 1 Extract text

This is a data mining course.

This is a data mining course

We are studying text mining. Text mining is a subfield of data mining.

We are studying text mining Text mining is a subfield of data mining

Mining text is interesting, and I am interested in it.

Mining text is interesting and I am interested in it

A Running Example Step 2 Remove stopwords

This is a data mining course.

This is a data mining course

We are studying text mining. Text mining is a subfield of data mining.

We are studying text mining Text mining is a subfield of data mining

Mining text is interesting, and I am interested in it.

Mining text is interesting and I am interested in it

A Running Example Step 3 Convert all words to lowercase

This is a data mining course.

case

This is a data mining course

We are studying text mining. Text mining is a subfield of data mining.

text We are studying text mining Text mining is a subfield of data mining

Mining text is interesting, and I am interested in it.

mining Mining text is interesting and I am interested in it

A Running Example Step 4 Stemming

This is a data mining course.

mine This is a data mining course

We are studying text mining. Text mining is a subfield of data mining.

study mine text mine We are studying text mining Text mining is a subfield of data mining mine

Mining text is interesting, and I am interested in it.

mine interest Mining text is interesting and I am interested in it interest

A Running Example Step 5 Count the word frequencies

This is a data mining course.

mine This is a data mining course course:1, data:1, mine:1

We are studying text mining. Text mining is a subfield of data mining.

study mine text mine We are studying text mining Text mining is a subfield of data mining mine data:1, mine:2, study:1, subfield:1, text:2 mine interest Mining text is interesting and I am interested in it interest Interest:2, mine:1, text:1
21

Mining text is interesting, and I am interested in it.

A Running Example Step 6 Create an indexing file

This is a data mining course.

mine This is a data mining course course:1, data:1, mine:1

ID 1 2 3 4

word course data interest mine study

document frequency 1 2 1 3 1

We are studying text mining. Text mining is a subfield of data mining.

study mine text mine 5 We are studying text mining Text mining 6 7 is a subfield of data mining mine data:1, mine:2, study:1, subfield:1, text:2 mine interest Mining text is interesting and I am interested in it interest interest:2, mine:1, text:1

subfield
text

1
2

Mining text is interesting, and I am interested in it.

A Running Example Step 7 Create the vector space model

This is a data mining course.

mine This is a data mining course course:1, data:1, mine:1

ID 1 2 3 4

word course data interest mine study

document frequency 1 2 1 3 1

(1, 1, 0, 1, 0, 0, 0)

We are studying text mining. Text mining is a subfield of data mining.

study mine text mine 5 We are studying text mining Text mining 6 7 is a subfield of data mining mine data:1, mine:3, study:1, subfield:1, text:2 (0, 1, 0, 3, 1, 1, 2) mine interest Mining text is interesting and I am interested in it interest interest:2, mine:1, text:1 (0, 0, 2, 1, 0, 0, 1)

subfield
text

1
2

Mining text is interesting, and I am interested in it.

A Running Example Step 8 Compute the inverse document frequency

This is a data mining course.
IDF ( word ) log total documents document frequency
document frequency 1 2 1 IDF 0.477 0.176 0.477

(1, 1, 0, 1, 0, 0, 0)

ID 1 2 3

word course data interest

We are studying text mining. Text mining is a subfield of data mining.

4
5

mine
study subfield text

3
1 1 2

0
0.477 0.477 0.176

(0, 1, 0, 3, 1, 1, 2)

6 7

Mining text is interesting, and I am interested in it.

(0, 0, 2, 1, 0, 0, 1)
24

A Running Example Step 9 Compute the weights of the words

This is a data mining course.

(1, 1, 0, 1, 0, 0, 0) (0.477, 0.176, 0, 0, 0, 0, 0)

ID 1 2 3

word course data interest

document frequency 1 2 1

IDF 0.477 0.176 0.477

We are studying text mining. Text mining is a subfield of data mining.

4
5

mine
study subfield text

3
1 1 2

0
0.477 0.477 0.176

(0, 1, 0, 3, 1, 1, 2)

(0, 0.176, 0, 0, 0.477, 0.477, 0.352) 7

w( wordi ) TF ( wordi ) IDF ( wordi )

Mining text is interesting, and I am interested in it.

TF ( wordi ) number of times wordi appears in the document

(0, 0, 2, 1, 0, 0, 1) (0, 0, 0.954, 0, 0, 0, 0.176)

A Running Example Step 10 Normalize all documents to unit length

This is a data mining course.

(1, 1, 0, 1, 0, 0, 0) (0.938, 0.346, 0, 0, 0, 0, 0)

ID 1 2 3

word course data interest

document frequency 1 2 1

IDF 0.477 0.176 0.477

We are studying text mining. Text mining is a subfield of data mining.

4
5

mine
study subfield text

3
1 1 2

0
0.477 0.477 0.176

(0, 1, 0, 3, 1, 1, 2) (0, 0.225 0, 0, 0.611, 0.611, 0.450)

6 7

w( wordi )
Mining text is interesting, and I am interested in it.

w( wordi ) w2 ( word1 ) w2 ( word2 ) w2 ( wordn )

(0, 0, 2, 1, 0, 0, 1) (0, 0, 0.983, 0, 0, 0, 0.181)

Normalization
This is a data mining course.

(1, 1, 0, 1, 0, 0, 0) (0.477, 0.176, 0, 0, 0, 0, 0)

w(course)

0.477 0.477 0.176 0 0 0 0 0

2 2

0.938

(0.938, 0.346, 0, 0, 0, 0, 0)

A Running Example

Everything become structural!

We can perform classification, clustering, etc!!!!

(0.938, 0.346, 0, 0, 0, 0, 0)

This is a data mining course.

word course data interest mine study subfield text

document frequency 1 2 1 3 1 1 2

IDF 0.477 0.176 0.477 0 0.477 0.477 0.176

We are studying text mining. Text mining is a subfield of data mining. Mining text is interesting, and I am interested in it.

(0, 0.225 0, 0, 0.611, 0.611, 0.450)

1 2 3 4 5 6 7

(0, 0, 0.983, 0, 0, 0, 0.181)

Query A Document

How can we query the document?

Simple! Just similar to the previous steps:

2.
3.

Remove stopwords. Stem every word of the query string. Transform the query string into a vector space model (VSM) by using TD-IDF schema. Normalize the VSM into unit length.

A Running Example Q = {interested in interesting data and text}

Original Query: (interested in interesting data and text) Step 1: Remove stop word: (interested interesting data text) Step 2: Stemming: (interest interest data text) Step 3: Remove duplication: (interest data text) Step 4: Construct a vector space model: (0, 1, 1, 0, 0, 0, 1) Step 5: Compute the weight of each word: (0, 0, 0.477, 0, 0, 0, 0.176) Step 5: Normalize the vector space model: (0, 0, 0.938, 0, 0, 0, 0.346)
30
ID 1 2 3 4 5 word course data interest mine study document frequency 1 2 1 3 1 IDF 0.477 0.176 0.477 0 0.477

6
7

subfield
text

1
2

0.477
0.176

Ranking Document by Similarity

Words Frequency

Zipfian Distribution of Term Frequencies

Term A
D2

Document Similarity
Q
D1

Stop Words

Rare Words

Term B

Vector similarity (dot product): Cosine vector similarity:

sim ( Q , D )

k 1

w qk

w dk

sim(Q, D )

w
k 1 t k 1

qk 2

wdk
t 2

(wqk ) (wdk )
k 1

A Running Example The Result

Q: (0, 0, 0.938, 0, 0, 0, 0.346) Document 1: (0.938, 0.346, 0, 0, 0, 0, 0) Document 2: (0, 0.225 0, 0, 0.611, 0.611, 0.450) Document 3: (0, 0, 0.983, 0, 0, 0, 0.181)
cosine ( P, Q)
ID 1 2 3 4 5 word course data interest mine study document frequency 1 2 1 3 1 IDF 0.477 0.176 0.477 0 0.477

p q p q
i i 2 i 2

6
7

subfield
text

1
2

0.477
0.176

2 i

cosine ( D1, Q) 0 cosine ( D 2, Q) cosine ( D3, Q) 0.346 0.450 (0.938 0.346 ) (0.225 0.611 0.611 0.450 )
2 2 2 2 2

0.156

0.938 0.983 0.346 0.181 (0.938 0.346 ) (0.983 0.181 )

2 2 2 2

0.985

Conclusion: Return Document 3

QUIZ!

Given a query of W4 W5 and a collection of the following three documents:

Document 1: <W1 W2 W3 W4 W5 > Document 2: <W6 W7 W4 W5> Document 3: <W8 W3 W9 W4 W10> Use the Vector Space Model, TF/IDF weighting scheme, and Cosine vector similarity measure to find the most relevant document(s) to the query.
33

IDF
Word list W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 DF 1 1 2 3 2 1 1 1 1 1 IDF 0.477 0.477 0.176 0 0.176 0.477 0.477 0.477 0.477 0.477

VSM

D1=(1,1,1,1,1,0,0,0,0,0) (0.477,0.477,0.176,0,0.176,0,0,0,0,0) D2=(0,0,0,1,1,1,1,0,0,0) (0,0,0,0,0.176,0.477,0.477, 0,0,0) D3=(0,0,1,1,0,0,0,1,1,1) (0,0,0.176,0,0,0,0,0.477,0.477, 0.477)

Normalization

D1= [0.6634 0.6634 0.2448 0 0.2448 0 0 0 0 0] D2= [0 0 0 0 0.2525 0.6842 0.6842 0 0 0] D3= [0 0 0.2084 0 0 0 0 0.5647 0.5647 0.5647]

Query

Q=(0,0,0,0,0.176,0,0,0,0,0) (0,0,0,0,1,0,0,0,0,0)
Cosine_sim(Q,D1)=0.2448 Cosine_sim(Q,D2)=0.2525 Cosine_sim(Q,D3)=0

Simple? Well

What we have discussed so far is a general framework only. There are still a lot of issues:

How to define stopword?

A is usually regarded as a stopword. However, Vitamin A may be an important term in an article. What to stem and what not to stem?

How to perform stemming?

Should booking be converted to book?

How to stem? There are many new words everyday!

Spelling error?

Spelling error always appears in documents! Should we consider two similar word as a same word?

Are they the same: classification and classificatiam? But then, how about Information and informatics?

IR Evaluation: Recall and Precision

NotRetrieved Retrieved Relevant NonRelevant
Relevant
Precision = A/(A+C) Recall = A/(A+B)

A C

B D

Fallout = C/(C+D) Ndatabase = A+B+C+D

Precision 100%

Retrieved

Recall 100%
40

Summary

Information Retrieval Concepts

VSM Model Similarity Measure IR Evaluations

Next Week:

Web Mining

80D-7E Hyundai Shop Manual
100% (5)
80D-7E Hyundai Shop Manual
336 pages
DumpKeys gm9
No ratings yet
DumpKeys gm9
5 pages
Isa S88 PDF
50% (2)
Isa S88 PDF
17 pages
1-Method Statement For Cfa Piles
No ratings yet
1-Method Statement For Cfa Piles
5 pages
Chicco KEY1 X-PLUS
No ratings yet
Chicco KEY1 X-PLUS
108 pages
Vector Space Model
No ratings yet
Vector Space Model
24 pages
Text Mining Assignment
No ratings yet
Text Mining Assignment
12 pages
Unit I –Text Mining
No ratings yet
Unit I –Text Mining
48 pages
08-Text_Mining
No ratings yet
08-Text_Mining
38 pages
Ex. No.: Text Mining On Commercial Application Date: Motivation
No ratings yet
Ex. No.: Text Mining On Commercial Application Date: Motivation
9 pages
Chapter 2
No ratings yet
Chapter 2
37 pages
Dissertation Text Mining
100% (2)
Dissertation Text Mining
4 pages
Text Mining: Data Mining - Volinsky - 2011 - Columbia University
No ratings yet
Text Mining: Data Mining - Volinsky - 2011 - Columbia University
63 pages
FDS-Content Beyond Syllabus
No ratings yet
FDS-Content Beyond Syllabus
15 pages
Text Mining
No ratings yet
Text Mining
25 pages
Text Mining
No ratings yet
Text Mining
85 pages
BDA3
No ratings yet
BDA3
61 pages
Text Analysis
No ratings yet
Text Analysis
13 pages
Analytics Concepts Social Listening
No ratings yet
Analytics Concepts Social Listening
10 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
82 pages
DATA SCIENCE May - 2019
No ratings yet
DATA SCIENCE May - 2019
21 pages
Lec1 PDF
No ratings yet
Lec1 PDF
20 pages
Lecture 10 - Data Mining in Practice
No ratings yet
Lecture 10 - Data Mining in Practice
41 pages
CT075!3!2 DTM Topic 12 Text Data Mining
No ratings yet
CT075!3!2 DTM Topic 12 Text Data Mining
25 pages
CAS CS 565, Data Mining
No ratings yet
CAS CS 565, Data Mining
30 pages
Text Pre Processing With NLTK
No ratings yet
Text Pre Processing With NLTK
42 pages
UNIT - 1 Text Mining
No ratings yet
UNIT - 1 Text Mining
18 pages
Jurnal Information Retrieval
No ratings yet
Jurnal Information Retrieval
4 pages
Vector Space Model
No ratings yet
Vector Space Model
6 pages
A Tutorial of Text Mining in R Using TM Package
No ratings yet
A Tutorial of Text Mining in R Using TM Package
6 pages
Text Mining PPT Merged
100% (1)
Text Mining PPT Merged
58 pages
Text and Web Mining
No ratings yet
Text and Web Mining
44 pages
Web Mining: Based On Tutorials and Presentations
No ratings yet
Web Mining: Based On Tutorials and Presentations
101 pages
FALLSEM2024-25_BCSE409L_TH_VL2024250101881_2024-11-15_Reference-Material-I
No ratings yet
FALLSEM2024-25_BCSE409L_TH_VL2024250101881_2024-11-15_Reference-Material-I
68 pages
Week 11 Lecture
No ratings yet
Week 11 Lecture
61 pages
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
No ratings yet
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
31 pages
Screenshot 2024-06-04 at 12.02.17 AM
No ratings yet
Screenshot 2024-06-04 at 12.02.17 AM
23 pages
Text Analytics Notes
No ratings yet
Text Analytics Notes
12 pages
Module 7 Mining Object Spatial Multimedia Text and Web Data
100% (1)
Module 7 Mining Object Spatial Multimedia Text and Web Data
28 pages
Kmeanseppcsit
No ratings yet
Kmeanseppcsit
5 pages
Effective Pattern Discovery For Text Mining
No ratings yet
Effective Pattern Discovery For Text Mining
8 pages
Exam-2
No ratings yet
Exam-2
5 pages
Text Mining and Its Applications
No ratings yet
Text Mining and Its Applications
5 pages
DATA MINING IN BUSINESS INTELLIGENCE
No ratings yet
DATA MINING IN BUSINESS INTELLIGENCE
63 pages
Predictive Methods For Text Mining
No ratings yet
Predictive Methods For Text Mining
75 pages
Data Mining
No ratings yet
Data Mining
34 pages
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
No ratings yet
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
27 pages
UNIT 4 Mining Object Spatial Multimedia Text and Web Data
No ratings yet
UNIT 4 Mining Object Spatial Multimedia Text and Web Data
30 pages
Text Mining: Seminar Submitted by
No ratings yet
Text Mining: Seminar Submitted by
22 pages
EBM
No ratings yet
EBM
16 pages
An Introduction To Text: Mining
No ratings yet
An Introduction To Text: Mining
39 pages
Content DM
No ratings yet
Content DM
10 pages
Efficient Preprocessing and Patterns Identification Approach For Text Mining
No ratings yet
Efficient Preprocessing and Patterns Identification Approach For Text Mining
6 pages
Lecture 6-Text Mining and Sentiment Analysis
No ratings yet
Lecture 6-Text Mining and Sentiment Analysis
57 pages
ML UNIT-II
No ratings yet
ML UNIT-II
27 pages
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
No ratings yet
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
11 pages
Data Mining - I
No ratings yet
Data Mining - I
126 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Unit 1 Notes-1
No ratings yet
Unit 1 Notes-1
10 pages
Introduction Data Mining
100% (1)
Introduction Data Mining
23 pages
RDataMining Slides Association Rules PDF
No ratings yet
RDataMining Slides Association Rules PDF
75 pages
Mastering Data Mining with Python – Find patterns hidden in your data
From Everand
Mastering Data Mining with Python – Find patterns hidden in your data
Megan Squire
No ratings yet
Google's PageRank and Beyond: The Science of Search Engine Rankings
From Everand
Google's PageRank and Beyond: The Science of Search Engine Rankings
Amy N. Langville
3.5/5 (10)
Search Tree: Fundamentals and Applications
From Everand
Search Tree: Fundamentals and Applications
Fouad Sabry
No ratings yet
Mastering Pandas in Python: Course Book
From Everand
Mastering Pandas in Python: Course Book
Pedro Martins
No ratings yet
ST Catalog USA 2013
No ratings yet
ST Catalog USA 2013
48 pages
Vibration PDF
No ratings yet
Vibration PDF
3 pages
LCD TV General Troubleshooting and Tips To RCA L32WD22 Fix
0% (1)
LCD TV General Troubleshooting and Tips To RCA L32WD22 Fix
6 pages
Norit Ac Beverage Brochure
No ratings yet
Norit Ac Beverage Brochure
9 pages
POWER SYSTEMS - 1 - Questions
No ratings yet
POWER SYSTEMS - 1 - Questions
7 pages
Transmission Loss in Piping From Fisher
No ratings yet
Transmission Loss in Piping From Fisher
8 pages
Islamic Science
No ratings yet
Islamic Science
222 pages
Traffic Survey Manual Guideline
No ratings yet
Traffic Survey Manual Guideline
30 pages
Pau Angl16jl
No ratings yet
Pau Angl16jl
8 pages
Buku Panduan Pendawaian 2008 Latest
No ratings yet
Buku Panduan Pendawaian 2008 Latest
12 pages
Manual On Training Preparation: Project On Improvement of Local Administration in Cambodia
No ratings yet
Manual On Training Preparation: Project On Improvement of Local Administration in Cambodia
17 pages
L3 Block Diagram & Schematics XT1025 V1.0 PDF
No ratings yet
L3 Block Diagram & Schematics XT1025 V1.0 PDF
38 pages
Natural Gas To BTX
No ratings yet
Natural Gas To BTX
505 pages
Pod 1405
No ratings yet
Pod 1405
32 pages
Reference Letter-Rwanda Education Commons
No ratings yet
Reference Letter-Rwanda Education Commons
1 page
Employee Job Satisfaction Project
100% (2)
Employee Job Satisfaction Project
51 pages
Performance Analyses of Combined Heating and Photovoltaic Power Systems For Residences PDF
No ratings yet
Performance Analyses of Combined Heating and Photovoltaic Power Systems For Residences PDF
12 pages
B&G 1510 Standard Design Parts List
No ratings yet
B&G 1510 Standard Design Parts List
51 pages
ECG SPO2 Inalambrico
No ratings yet
ECG SPO2 Inalambrico
72 pages
Jinan Hengsheng New Building Materials Co., LTD.: Hospital Handrail
No ratings yet
Jinan Hengsheng New Building Materials Co., LTD.: Hospital Handrail
8 pages
Mechatronics by Nitaigour Premchand Mahalik
25% (4)
Mechatronics by Nitaigour Premchand Mahalik
117 pages
Omni-Directional Walking of A Quadruped Robot: Shugen Ma, Takashi Tomiyama, Hideyuki Wada
No ratings yet
Omni-Directional Walking of A Quadruped Robot: Shugen Ma, Takashi Tomiyama, Hideyuki Wada
8 pages
AudioPipe TXXbd215 Users Manual
No ratings yet
AudioPipe TXXbd215 Users Manual
9 pages
Jones
No ratings yet
Jones
304 pages
Nayre q1 m1 & m2 Mar 22-26
No ratings yet
Nayre q1 m1 & m2 Mar 22-26
4 pages