0% found this document useful (0 votes)

34 views24 pages

Module5-Representing and Mining Text

1) Text documents can be represented as "bags of words" where each unique word becomes a feature and the presence of a word in a document is indicated with a 1 or 0. 2) Common pre-processing steps include normalization, stemming, stopword removal and term frequency normalization. 3) TF-IDF weighting takes into account the frequency of words in documents and across the entire corpus to emphasize more important words.

Uploaded by

Green Mongor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views24 pages

Module5-Representing and Mining Text

Uploaded by

Green Mongor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Module 5

Representing and Mining Text

Dealing with Text
• Data are represented in ways natural to problems from which
they were derived.

• Vast amount of text..

• If we want to apply the many data mining tools that we have at

our disposal, we must
• either engineer the data representation to match the tools
(representation engineering), or

• build new tools to match the data.

Why Text is Difficult
• Text is “unstructured” data.
• Linguistic structure is intended for human communication and
not computers.

• Word order matters sometimes.

• Text can be dirty

• People write ungrammatically, misspell words, abbreviate
unpredictably, and punctuate randomly.
• It may contain Synonyms, homograms, abbreviations, etc.

• Context matters.
Text Representation

• Goal: Take a set of documents –each of which is a relatively

free- form sequence of words– and turn it into our familiar
feature-vector form.

• A collection of documents is called a corpus.

• A document is composed of individual tokens or terms.

• Each document is one instance

• but we don’t know in advance what the features will be
“Bag of Words”
• Treat every document as just a collection of individual words.
• Ignore grammar, word order, sentence structure, and (usually)
punctuation.
• Treat every word in a document as a potentially important keyword of the
document.

• What will be the feature’s value in a given document?

• Each document is represented by a one (if the token is present in the
document) or a zero (the token is not present in the document).

• Straightforward representation
• Inexpensive to generate.
• Tends to work well for many tasks.
Pre-processing of Text
The following steps should be performed:

• The case should be normalized

• Every term is in lowercase

• Words should be stemmed

• Suffixes are removed
• E.g., noun plurals are transformed to singular forms

• Stop-words should be removed

• A stop-word is a very common word in English (or whatever language is
being parsed)
• Typical words such as the words the, and, of, and on are removed
Term Frequency

• Use the word count (frequency) in the document instead of just a zero
or one.

• Differentiates between how many times a word is used.

Normalized Term Frequency

• Documents of various lengths.

• Words of different frequencies

• Words should not be too common or too rare.
• Both upper and lower limit on the number (or fraction) of documents in
which a word may occur.
• Feature selection is often employed.

• The raw term frequencies are normalized in some way,

• such as by dividing each by the total number of words in the document
• or the frequency of the specific term in the corpus.
TF-IDF

TFIDF 𝑡, 𝑑 = TF 𝑡, 𝑑 × IDF 𝑡

• Inverse Document Frequency (IDF) of a term

Total number of documents

IDF 𝑡 = 1 + log
Number of documents containing 𝑡
TFIDF
Example: Jazz Musicians

• 15 prominent jazz musicians and excerpts of their

biographies from Wikipedia.

• Nearly 2,000 features after stemming and stop-word

removal!.

• Consider the sample phrase “Famous jazz saxophonist

born in Kansas who played bebop and latin” after
stemming.
Example: Jazz Musicians
Example: Jazz Musicians

Representation of the query “Famous jazz saxophonist born in Kansas who played
bebop and latin” after stopword removal and term frequency normalization.
Example: Jazz Musicians

Final TFIDF Representation of the query “Famous jazz saxophonist born in

Kansas who played bebop and latin”.
Example: Jazz Musicians
Beyond “Bag of Words”

• 𝑁 -gram Sequences

• Named Entity Extraction

• Topic Models
N-gram Sequences
• In some cases, word order is important and you want to preserve
some information about it in the representation

• A next step up in complexity is to include sequences of adjacent

words as terms

• Adjacent pairs are commonly called bi-grams

• Example: “The quick brown fox jumps”

• It would be transformed into {quick, brown, fox, jumps,
quick_brown, brown_fox, fox_jumps}

• N-grams they greatly increase the size of the feature set

Topic Models
Text Mining Example

Task: predict the stock market based on the stories that appear on
the news wires.
Mining News Stories to Predict Stock Price
Movement
Mining News Stories to Predict Stock Price
Movement
Mining News Stories to Predict Stock Price
Movement
Mining News Stories to Predict Stock Price
Movement
Mining News Stories to Predict Stock Price
Movement

Grade 8 Week 1 - Language
100% (1)
Grade 8 Week 1 - Language
15 pages
A Book of Anagrams: An Ancient Word Game
From Everand
A Book of Anagrams: An Ancient Word Game
Daniel H. Wieczorek
No ratings yet
Lecture 6 - From Unstructured Texts To Structure Data I
No ratings yet
Lecture 6 - From Unstructured Texts To Structure Data I
17 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
Feature Engineering
100% (2)
Feature Engineering
44 pages
Text
No ratings yet
Text
102 pages
Dealing With Textual Data
No ratings yet
Dealing With Textual Data
67 pages
CSE442 Text
No ratings yet
CSE442 Text
89 pages
Feature Eng
No ratings yet
Feature Eng
34 pages
Lect 5
No ratings yet
Lect 5
40 pages
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
No ratings yet
Motivation Video: Mitsuku Vs Cleverbot - AI (Artificial Intelligence)
45 pages
Feature Extraction Techniques in NLP
No ratings yet
Feature Extraction Techniques in NLP
10 pages
Chapter V - Working With Text Data
No ratings yet
Chapter V - Working With Text Data
30 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
No ratings yet
CS423 Data Warehousing and Data Mining: Dr. Hammad Afzal
31 pages
NLP Asgn3
No ratings yet
NLP Asgn3
6 pages
Intro To TM
No ratings yet
Intro To TM
32 pages
Machine Learning For NLP: Vocabulary
No ratings yet
Machine Learning For NLP: Vocabulary
37 pages
01 - Introduction To Text Analytics - Part2
No ratings yet
01 - Introduction To Text Analytics - Part2
48 pages
Module 3
No ratings yet
Module 3
40 pages
Module 5-Natural Language Processing
No ratings yet
Module 5-Natural Language Processing
13 pages
ML Unit-Ii
No ratings yet
ML Unit-Ii
27 pages
AIML Unit5
No ratings yet
AIML Unit5
36 pages
NLP-Neuro Linguistic Programming: What Is A Corpus?
No ratings yet
NLP-Neuro Linguistic Programming: What Is A Corpus?
3 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
37 pages
NLP Text Preprocessing
No ratings yet
NLP Text Preprocessing
19 pages
Lecture 7
No ratings yet
Lecture 7
32 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
Text Mining - Vectorization
No ratings yet
Text Mining - Vectorization
24 pages
NLP DeepNLP
No ratings yet
NLP DeepNLP
61 pages
DVT UNIT - 4 Notes 211124
No ratings yet
DVT UNIT - 4 Notes 211124
21 pages
DVT Unit 4
No ratings yet
DVT Unit 4
21 pages
Computational Journalism 2016 Week 2: Text Analysis
No ratings yet
Computational Journalism 2016 Week 2: Text Analysis
68 pages
Module III
No ratings yet
Module III
42 pages
Ass7 Write Up .Final
No ratings yet
Ass7 Write Up .Final
11 pages
Week 12
No ratings yet
Week 12
19 pages
Week 6: Introduction To Natural Language Processing
No ratings yet
Week 6: Introduction To Natural Language Processing
18 pages
SNLP Overview
No ratings yet
SNLP Overview
43 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
Unit-4 NLP
No ratings yet
Unit-4 NLP
21 pages
DSB - Unit4-Representing and Miniing text-decision-analytic-think-II
No ratings yet
DSB - Unit4-Representing and Miniing text-decision-analytic-think-II
46 pages
Lecture 2
No ratings yet
Lecture 2
80 pages
Semantic Analysis Theory1
No ratings yet
Semantic Analysis Theory1
16 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
Bag of Words
No ratings yet
Bag of Words
32 pages
Predictive Methods For Text Mining
No ratings yet
Predictive Methods For Text Mining
75 pages
Vector Semantics
No ratings yet
Vector Semantics
83 pages
NLP Notes-1
No ratings yet
NLP Notes-1
54 pages
05 Introduction To NLP
No ratings yet
05 Introduction To NLP
63 pages
William James Thesis FINAL
No ratings yet
William James Thesis FINAL
47 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
Text Mining
No ratings yet
Text Mining
62 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
ITD253 L2 TextPreprocessing
No ratings yet
ITD253 L2 TextPreprocessing
33 pages
NLP Basic - YL
No ratings yet
NLP Basic - YL
16 pages
Text Analytics Basics
No ratings yet
Text Analytics Basics
28 pages
NLP - Module 2
No ratings yet
NLP - Module 2
54 pages
Text Mining
No ratings yet
Text Mining
34 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
82 pages
Introduction To NLP
No ratings yet
Introduction To NLP
50 pages
Text Pre Processing With NLTK
No ratings yet
Text Pre Processing With NLTK
42 pages
Test Bank For Java How To Program (Early Objects), 9th Edition: Paul Deitel Download
100% (2)
Test Bank For Java How To Program (Early Objects), 9th Edition: Paul Deitel Download
38 pages
GR 11 English FAL
No ratings yet
GR 11 English FAL
3 pages
Present Perfect Simple
No ratings yet
Present Perfect Simple
1 page
Approaches and Methods Summary
No ratings yet
Approaches and Methods Summary
7 pages
System Modelling II
No ratings yet
System Modelling II
44 pages
Worksheet 3. Literal and Figurative Language (Teacher Version)
No ratings yet
Worksheet 3. Literal and Figurative Language (Teacher Version)
2 pages
Present Tenses
No ratings yet
Present Tenses
11 pages
Linkers Activities PDF
No ratings yet
Linkers Activities PDF
2 pages
Module 5 Compposing Effective Paagraphs DLP
No ratings yet
Module 5 Compposing Effective Paagraphs DLP
8 pages
Basic English Vocabulary
No ratings yet
Basic English Vocabulary
40 pages
Class 1
No ratings yet
Class 1
18 pages
Yayasan Mardliyah Pameungpeuk Madrasah Tsanawiyah Mardliyah: I. Choose A, B, C, or D For The Best Answer!
No ratings yet
Yayasan Mardliyah Pameungpeuk Madrasah Tsanawiyah Mardliyah: I. Choose A, B, C, or D For The Best Answer!
3 pages
Parts of Speech (Updated) - 1
No ratings yet
Parts of Speech (Updated) - 1
33 pages
The First Salt Hotel
No ratings yet
The First Salt Hotel
4 pages
Life Sciences GR 11 June 2023 Marking Guidelines
No ratings yet
Life Sciences GR 11 June 2023 Marking Guidelines
13 pages
A Roadmap For Interpreters: Tips and Advice For Young Professionals
No ratings yet
A Roadmap For Interpreters: Tips and Advice For Young Professionals
9 pages
RPH Minggu 34
No ratings yet
RPH Minggu 34
14 pages
Category Scoring Criteria Total Points Score
No ratings yet
Category Scoring Criteria Total Points Score
1 page
Translation Techniques
No ratings yet
Translation Techniques
22 pages
General Information: Dorms: Participants Live in 2 Room Apartments, With Two Residents Assigned To Each
No ratings yet
General Information: Dorms: Participants Live in 2 Room Apartments, With Two Residents Assigned To Each
2 pages
Creative Writing First Quarter-Module 2
No ratings yet
Creative Writing First Quarter-Module 2
8 pages
Creative Writing First Quarter-Module 4
No ratings yet
Creative Writing First Quarter-Module 4
9 pages
Ahmedabad's Sabarmati Riverfront To Welcome Yoga-Meditation Centre, Global Food Plaza - Nativeplanet
No ratings yet
Ahmedabad's Sabarmati Riverfront To Welcome Yoga-Meditation Centre, Global Food Plaza - Nativeplanet
7 pages
What, If Anything, Is Typology?: Johanna Nichols
No ratings yet
What, If Anything, Is Typology?: Johanna Nichols
8 pages
Đáp Án HSG
No ratings yet
Đáp Án HSG
2 pages
Excerpt
No ratings yet
Excerpt
10 pages
Ba Honours English Syllabus 2013-1
No ratings yet
Ba Honours English Syllabus 2013-1
53 pages
NZ e 1631771119 Level 3 Writing Onomatopoeia Worksheet - Ver - 5
No ratings yet
NZ e 1631771119 Level 3 Writing Onomatopoeia Worksheet - Ver - 5
4 pages
Le Fusain
No ratings yet
Le Fusain
135 pages

Module5-Representing and Mining Text

Uploaded by

Module5-Representing and Mining Text

Uploaded by

Module 5

Representing and Mining Text

• Vast amount of text..

• If we want to apply the many data mining tools that we have at

• build new tools to match the data.

• Word order matters sometimes.

• Text can be dirty

• Goal: Take a set of documents –each of which is a relatively

• A collection of documents is called a corpus.

• A document is composed of individual tokens or terms.

• Each document is one instance

• What will be the feature’s value in a given document?

• The case should be normalized

• Words should be stemmed

• Stop-words should be removed

• Differentiates between how many times a word is used.

• Documents of various lengths.

• Words of different frequencies

• The raw term frequencies are normalized in some way,

• Inverse Document Frequency (IDF) of a term

Total number of documents

• 15 prominent jazz musicians and excerpts of their

• Nearly 2,000 features after stemming and stop-word

• Consider the sample phrase “Famous jazz saxophonist

Final TFIDF Representation of the query “Famous jazz saxophonist born in

• Named Entity Extraction

• A next step up in complexity is to include sequences of adjacent

• Adjacent pairs are commonly called bi-grams

• Example: “The quick brown fox jumps”

• N-grams they greatly increase the size of the feature set

You might also like