Introduction To Text Mining
Introduction To Text Mining
CONTENTS
Text Mining
∙ How Can It Differentiate
Products and Services?
∙ Getting Started
∙ Development Environment
∙ Conclusion
CHRIS LAMB
PROFESSOR AND PRINCIPAL SCIENTIST
WHAT IS TEXT MINING? unstructured, and not accessible to modern data analysis techniques
Text mining is an ambiguous term for extracting useful information as a result. Mining text allows you to extract underlying information
from otherwise unstructured text. There are two particular terms from these unstructured data sources that you can then structure,
we need to pay close attention to when defining “text mining”— analyze, and process.
Unstructured text means that the information is not stored in a we read. Like driving a car, once we learn how to do it, we take it for
structured format like XML or a database table. The text is still granted.
For example, you may be analyzing English text, but that text may explosion of computation power, additional algorithm development,
1 Loper, Edward, Klien, Ewan, and Bird, Steven. Natural Language Processing with Python. O’Reilly, 2009
1
Reduce cost
and time to market
Embed OpenText intelligent information
features into your solutions
Capture and Digitize Analyze, Report Search and Discover Transform, View
and Predict and Communicate
and machine learning techniques. New advances in this area promise Due to that complexity, these topics are out of the scope of this
to revolutionize customer service, business intelligence, and a myriad Refcard, and instead, we will focus on traditional text analytic
of other fields. techniques.
Effective text mining opens up new application areas while Natural Language Toolkit,2 for the most part. R is another popular
improving the quality of existing ones. Customer service systems platform for text processing, but I prefer using Python because of its
with integrated text analysis and effective voice-to-text capabilities extensive collection of libraries. I suggest using Anaconda3 for this
can build analytic pipelines supporting real-time sentiment analysis, kind of work as well, as it will allow you to create custom isolated
allowing representatives to engage with customers knowing their Python environments that you can use for a variety of things.
I’ll omit this in the following examples, but you can insert this You can derive N-grams using native Python tools. However, this can
wherever needed. The downloader won’t download the books if be time-consuming, and the resulting list of N-grams needs to be
you’ve already done so, so you can include this at the top of any code processed into something useful.
you write.
Here, we generate an initial collection of Trigrams, and then we count
the most common ones, sorting the resulting list in descending
KEY METHODS AND TECHNIQUES
order. Note that we’ve converted all words into lower case prior to
WORD FREQUENCY processing to avoid differences between ‘Word’ and ‘word’.
Word frequency measures a given text and provides insight into the
COLLOCATION
topics discussed and key concepts.
A collocation is a sequence of words that occur more frequently than
from nltk.probability import FreqDist
you’d expect. They can provide insight into common terminology,
distribution = FreqDist(paradise_lost)
print(distribution.most_common(50))
overall sentiment, and the primary theme of a given corpus. Bigrams
distribution.plot(50, cumulative=False) and trigrams are examples of collocations. When we covered
N-grams, we did things the hard way. Now that we understand more
Here, we’re extracting the tokens from the text and graphing them.
clearly what they are, we can lean on NLTK to find these for us:
This allows you to see the most common tokens.
import nltk
We can also examine the characteristics of words, like so: fromnltk.collocationsimportBigramAssocMeasures,TrigramAssocMeasures,
BigramCollocationFinder, TrigramCollocationFinder
long_words = [w for w in paradise_lost if len(w) > 10] nltk.download('gutenberg')
distribution = FreqDist(long_words) paradise_lost=nltk.corpus.gutenberg.words('milton-paradise.txt')
print(distribution.most_common(50))
bigram_measures = BigramAssocMeasures()
The group of words in paradise_lost can be treated as a Python list, trigram_measures = TrigramAssocMeasures()
and then used as an argument to the various distribution tools, like
FreqDist or ConditionalFreqDist . bigram_finder=BigramCollocationFinder.from_words(paradise_lost)
bigrams = bigram_finder.nbest(bigram_measures.raw_freq, 10)
N-GRAMS
trigram_finder=TrigramCollocationFinder.from_words(paradise_lost)
N-grams are essentially lists of N size contiguous tokens in a corpus. trigrams = trigram_finder.nbest(trigram_measures.raw_freq, 10)
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigrams = generate_ngrams(n=3, corpus=paradise_lost)
trigram_measures = nltk.collocations.TrigramAssocMeasures()