Module 5
Representing and Mining Text
Dealing with Text
• Data are represented in ways natural to problems from which
they were derived.
• Vast amount of text..
• If we want to apply the many data mining tools that we have at
our disposal, we must
• either engineer the data representation to match the tools
(representation engineering), or
• build new tools to match the data.
Why Text is Difficult
• Text is “unstructured” data.
• Linguistic structure is intended for human communication and
not computers.
• Word order matters sometimes.
• Text can be dirty
• People write ungrammatically, misspell words, abbreviate
unpredictably, and punctuate randomly.
• It may contain Synonyms, homograms, abbreviations, etc.
• Context matters.
Text Representation
• Goal: Take a set of documents –each of which is a relatively
free- form sequence of words– and turn it into our familiar
feature-vector form.
• A collection of documents is called a corpus.
• A document is composed of individual tokens or terms.
• Each document is one instance
• but we don’t know in advance what the features will be
“Bag of Words”
• Treat every document as just a collection of individual words.
• Ignore grammar, word order, sentence structure, and (usually)
punctuation.
• Treat every word in a document as a potentially important keyword of the
document.
• What will be the feature’s value in a given document?
• Each document is represented by a one (if the token is present in the
document) or a zero (the token is not present in the document).
• Straightforward representation
• Inexpensive to generate.
• Tends to work well for many tasks.
Pre-processing of Text
The following steps should be performed:
• The case should be normalized
• Every term is in lowercase
• Words should be stemmed
• Suffixes are removed
• E.g., noun plurals are transformed to singular forms
• Stop-words should be removed
• A stop-word is a very common word in English (or whatever language is
being parsed)
• Typical words such as the words the, and, of, and on are removed
Term Frequency
• Use the word count (frequency) in the document instead of just a zero
or one.
• Differentiates between how many times a word is used.
Normalized Term Frequency
• Documents of various lengths.
• Words of different frequencies
• Words should not be too common or too rare.
• Both upper and lower limit on the number (or fraction) of documents in
which a word may occur.
• Feature selection is often employed.
• The raw term frequencies are normalized in some way,
• such as by dividing each by the total number of words in the document
• or the frequency of the specific term in the corpus.
TF-IDF
TFIDF 𝑡, 𝑑 = TF 𝑡, 𝑑 × IDF 𝑡
• Inverse Document Frequency (IDF) of a term
Total number of documents
IDF 𝑡 = 1 + log
Number of documents containing 𝑡
TFIDF
Example: Jazz Musicians
• 15 prominent jazz musicians and excerpts of their
biographies from Wikipedia.
• Nearly 2,000 features after stemming and stop-word
removal!.
• Consider the sample phrase “Famous jazz saxophonist
born in Kansas who played bebop and latin” after
stemming.
Example: Jazz Musicians
Example: Jazz Musicians
Representation of the query “Famous jazz saxophonist born in Kansas who played
bebop and latin” after stopword removal and term frequency normalization.
Example: Jazz Musicians
Final TFIDF Representation of the query “Famous jazz saxophonist born in
Kansas who played bebop and latin”.
Example: Jazz Musicians
Beyond “Bag of Words”
• 𝑁 -gram Sequences
• Named Entity Extraction
• Topic Models
N-gram Sequences
• In some cases, word order is important and you want to preserve
some information about it in the representation
• A next step up in complexity is to include sequences of adjacent
words as terms
• Adjacent pairs are commonly called bi-grams
• Example: “The quick brown fox jumps”
• It would be transformed into {quick, brown, fox, jumps,
quick_brown, brown_fox, fox_jumps}
• N-grams they greatly increase the size of the feature set
Topic Models
Text Mining Example
Task: predict the stock market based on the stories that appear on
the news wires.
Mining News Stories to Predict Stock Price
Movement
Mining News Stories to Predict Stock Price
Movement
Mining News Stories to Predict Stock Price
Movement
Mining News Stories to Predict Stock Price
Movement
Mining News Stories to Predict Stock Price
Movement