Text Mining in R: A Tutorial
Text Mining in R: A Tutorial
This tutorial was built for people who wanted to learn the essential tasks required to process text
for meaningful analysis in R, one of the most popular and open source programming languages
for data science. At the end of this tutorial, you’ll have developed the skills to read in large files
with text and derive meaningful insights you can share from that analysis. You’ll have learned how
to do text mining in R, an essential data mining tool. The tutorial is built to be followed along with
tons of tangible code examples. The full repository with all of the files and data is here if you wish
to follow along.
Searching for a job using R? Check out our list of R Interview Questions first!
If you don’t have an R environment set up already, the easiest way to follow along would be to
use Jupyter with R. Jupyter offers an interactive R environment where you can easily modify inputs
and get the outputs demonstrated rapidly so you can rapidly get up to speed on text mining in R.
Text mining deals with helping computers understand the “meaning” of the text. Some of the
common text mining applications include sentiment analysis e.g if a Tweet about a movie says
something positive or not, text classification e.g classifying the mails you get as spam or ham etc.
In this tutorial, we’ll learn about text mining and use some R libraries to implement some common
text mining techniques. We’ll learn how to do sentiment analysis, how to build word clouds, and
how to process your text so that you can do meaningful analysis with it.
R
R is succinctly described as “a language and environment for statistical computing and graphics,”
which makes it worth knowing if you’re dabbling in the data science/art of statistics and
exploratory data analysis. R has a wide variety of useful packages.
Here, we’ll focus on R packages useful in understanding and extracting insights from the text and
text mining packages.
Text preprocessing
Before we dive into analyzing text, we need to preprocess it. Text data contains white spaces,
punctuations, stop words etc. These characters do not convey much information and are hard to
process. For example, English stop words like “the”, “is” etc. do not tell you much information
about the sentiment of the text, entities mentioned in the text, or relationships between those
entities. Depending upon the task at hand, we deal with such characters differently. This will help
isolate text mining in R on important words.
Word cloud
A word cloud is a simple yet informative way to understand textual data and to do text analysis. In
this example, we will try to visualize Hillary Clinton’s Emails. This will help us quantify the content
of the Emails and help us derive insights and better communicate our results Along the way, we’ll
also learn about some data preprocessing steps that will be immensely helpful in other text mining
tasks as well. Let’s start with getting the data. You can head over to Kaggle to download the
dataset.
Let’s read the data and learn to implement the preprocessing steps.
The above code reads in the “database.sqlite” file into R. SQLite is an embedded SQL database
engine. Unlike most other SQL databases, SQLite does not have a separate server process. SQLite
reads and writes directly to ordinary disk files. So, you can read an SQLite file just as you would
read a CSV or a text file. Accordingly, the same theory would apply to any type of CSV or text file or
input file that you can work with in R, though you would use a different approach.
This guide shows how you would read different file formats such as Excel, R and .txt files into R and
other data sources (including social media data).
Here, we’ll use the package RSQLite to read in a SQLite file containing all of Hillary Clinton’s emails.
Next, we will be querying the column containing the Email text body. Then we’ll be ready to do an
analysis of the Clinton emails that shaped this political season.
We’ll perform the following steps to make sure that the text mining in R we’re dealing with is clean:
Convert the text to lower case, so that words like “write” and “Write” are considered the same word
for analysis
Remove numbers
Remove English stopwords e.g “the”, “is”, “of”, etc
Remove punctuation e.g “,”, “?”, etc
Eliminate extra white spaces
Stemming our text
Stemming is the process of reducing inflected (or sometimes derived) words to their word stem,
base or root form. E.g changing “car”, “cars”, “car’s”, “cars’” to “car”. This can also help with
different verb tenses with the same semantic meaning such as digs, digging, and dig.
One very useful library to perform the aforementioned steps and text mining in R is the “tm”
package. The main structure for managing documents in tm is called a Corpus, which represents a
collection of text documents.
Once we have our email corpus (all of Hillary’s emails) stored in the variable “docs”, we’ll want to
modify the words within the emails in it with the techniques we discussed above such as stemming,
stopword removal and more. With the tm library, this can be done easily. Transformations are
done via the tm_map() function which applies a function to all elements of the corpus. Basically, all
transformations work on single text documents and tm_map() just applies them to all documents
in a corpus. If you wanted to convert all the text of Hillary’s emails into lowercase at once, you’d
use the tm library and the techniques below to do so easily.
Naturally, some documents may not contain a given term, so this matrix is sparse. The value in
each cell of the matrix is the term frequency.
tm makes it very easy to create the term-document matrix. With the document term matrix made,
we can then proceed to build a word cloud for Hillary’s emails, highlighting which words are the
most frequently made.
Sentiment analysis is the process of determining whether a piece of writing is positive, negative or
neutral. Here, we’ll work with the package “syuzhet”.
Just as the previous example, we’ll read the Emails from the database.
“syuzhet” uses NRC Emotion lexicon. The NRC emotion lexicon is a list of words and their
associations with eight emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust)
and two sentiments (negative and positive).
The get_nrc_sentiment function returns a data frame in which each row represents a sentence
from the original file. The columns include one for each emotion type was well as the positive or
negative sentiment valence. It allows us to take a body of text and return which emotions it
represents — and also whether the emotion is positive or negative.
Now, we’ll use “ggplot2” to create a bar graph. Each bar represents how prominent the each of the
emotion is in text.
You must have noticed YouTube’s auto-captioning feature. Auto-captioning is a speech recognition
problem. One of the features in being able to generate captions automatically from audio is to
predict what word comes after a given sequence of words. E.g
Hopefully, you concluded that the next word in the sequence is “call”. We do this by first analyzing
what words frequently co-occur. We formalize this by introducing N-grams. An n-gram is a
contiguous sequence of n items from a given sequence of text or speech. In other words, we’ll be
finding collocations. a collocation is a sequence of words or terms that co-occur more often than
would be expected by chance. An example of this would be the term “very much.”
In this section, we’ll use the R-library “quanteda” to compute tri-grams to find commonly occuring
sequence of 3 words.
We will use quanteda’s function collocations to do so. And, finally we’ll remove stopwords from the
collocations so we can get a full view of which are the most frequently used collection of three
words in Hillary’s emails.
Conclusion
We set out to inform you how to do some of the most common text mining in R tasks with
examples and sample code. Leave a comment below if you think we’re missing something or if you
want to add something to this text mining in R discussion!