0% found this document useful (0 votes)
7 views

Text Analysis

This document discusses text mining and word clouds. It explains that text mining is used to convert unstructured text data into a meaningful form by identifying and building features. The bag of words approach treats each document as a collection of words while ignoring word order and structure. Cleaning text involves steps like converting to lowercase, removing stopwords and punctuation, and reducing words to their root form. Word clouds then visually represent the most frequent terms in a document with the term sizes proportional to their frequencies.

Uploaded by

Grace Yin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Text Analysis

This document discusses text mining and word clouds. It explains that text mining is used to convert unstructured text data into a meaningful form by identifying and building features. The bag of words approach treats each document as a collection of words while ignoring word order and structure. Cleaning text involves steps like converting to lowercase, removing stopwords and punctuation, and reducing words to their root form. Word clouds then visually represent the most frequent terms in a document with the term sizes proportional to their frequencies.

Uploaded by

Grace Yin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Text Mining and Word Clouds

Text is Everywhere

• Medical Records
• Consumer Complaint Logs
• Product Inquiries
• Social Media Posts (Twitter feed, Emails, Facebook status,
Reddit comments, etc.)
• Personal Webpages

Text Mining deals with converting this vast amount of data to a


meaningful form.
• Structured data:
• Well organized
• Common agreed upon features in each data sample
• Formats: tables, relational databases etc.
• Sources: government, industry, CRM,, markets, etc.
• Unstructured data
• Not well organized, & unclear what are the features
• A lot of heterogeneity between data samples
• Formats: text, images, video, audio etc.
• Sources: social media, security cameras, etc.

• Unstructured data is unstructured because adding


structure is hard work
• It is not designed for analysis
Adding Structure

Identify and
Build Features

f1 f2 …
blah, blah, blah,
blah, blah, blah, Explore
val11 val12 … Explain
blah, blah, blah,
blah, blah, blah, Predict
blah, blah, blah,
blah, blah, blah,
val21 val22 …


Text Data is Difficult to Analyze

• Text data is “unstructured”: Does not come in well


formatted table with each field having a specific meaning!
• Text has a linguistic structure that is easily understood by
humans (not computers)
• Words vary in length and the order of words matters
• The data tends to have poor quality: spelling mistakes,
abbreviations, punctuation, etc.

Text data must undergo extensive prepossessing before being


used in any analytics algorithm/application.
Bag of Words Approach
• Treat a document as a collection of individual words, i.e.
Ignore Grammar, Word Order, Sentence Structure, etc.
Bag of Words Approach
• Treat a document as a collection of individual words, i.e.
Ignore Grammar, Word Order, Sentence Structure, etc.
• Each word is equally likely to be an important keyword.
• The words that appear the most in the document are the
most important keywords (the most valuable features).
• The term frequency TF(t, d) is count of number of times a
particular word t appears in a document d (may be also
normalized).

“all data mining involves the use of machine learning but not all
machine learning requires data mining”

Term Freq. Term Freq. Term Freq. Term Freq.


all 2 data 2 mining 2 involves 1
the 1 use 1 of 1 machine 2
learning 2 but 1 not 1 requires 1
Advantages,

Advantages
• A very simple representation
• Inexpensive to generate
• Works in many settings
• Often works surprisingly well!
Technical reports, prescriptions,…
• “a duck walked up to a lemonade stand”
• “a horse walked up to a lemonade
stand”
• “The Duck walks near the Lemonade
Stand”
The bag of words features:
According to bag of words:

[“a”, “duck”, “walked”, “up”, “to”, “a”,


“lemonade”, “stand”],
is similar to
[“a”, “horse”, “walked”, “up”, “to”, “a”,
“lemonade”, “stand”]

BUT
[“a”, “duck”, “walked”, “up”, “to”, “a”,
“lemonade”, “stand”],
not similar
[“The”, “Duck”, “walks”, “near”, “the”,
“Lemonade”, “Stand”]
Cleaning the Text

• Convert the text to lower case.


• Remove common stopwords like “the”, “we”, “and”, etc.

• “not” is not a good stop‐word, why?


• Remove numbers (or replace them with words).
• Remove punctuation like “.”, “,”, etc.
• Reduce the words to their root (word stemming). Example:
“announces”, “announced”, “announcing” are reduced to
“announc”.
• Remove unnecessary white space.
Cleaning the Text in R

Load the required libraries


library("tm") # Text Mining Library
library("SnowballC") # For reducing words to their root

Create the text document object


myDocument <- Corpus(VectorSource("All data mining involves the use of machine
learning, but not all machine learning requires data mining."))

Clean the Text


myDocument <- tm_map(myDocument, content_transformer(tolower)) #Convert to lower case
myDocument <- tm_map(myDocument, removeWords, stopwords("english")) #Remove stopwords
myDocument <- tm_map(myDocument, removeNumbers) #Remove numbers
myDocument <- tm_map(myDocument, removePunctuation) #Remove punctuation
myDocument <- tm_map(myDocument, stemDocument) #Reduce the words to their root
myDocument <- tm_map(myDocument, stripWhitespace) #Remove unnecessary white space
Getting Term Frequency Table in R

termMatrix = as.matrix(TermDocumentMatrix(myDocument)) #Get terms and Freq. matrix


sortedtermMatrix <- sort(rowSums(termMatrix),decreasing=TRUE) #Sort dec. order of Freq.
d <- data.frame("Term" = names(sortedtermMatrix),"Freq."=sortedtermMatrix,
row.names = NULL) #Store as Data Frame
print(d) #display data frame
Word Cloud
Word clouds are commonly used to visualize/highlight keywords
in documents
• Artistically place words with sizes proportional to their
frequency of occurrence.
• Typically, the exact position of the word does not mean
anything.
library("wordcloud") # Word Cloud Library
wordcloud(words = d$Term, freq = d$Freq., colors=brewer.pal(8, "Dark2"))

You might also like