0% found this document useful (0 votes)
9 views

Lecture 8

Uploaded by

as.designsas33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Lecture 8

Uploaded by

as.designsas33
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Learning R

TEXT ANALYTICS

Medical Information Systems


Lecture 8

Assoc. Prof. Dr. Timuçin AVŞAR


Text Analytics
• Pre-processing the data can be difficult, but, luckily, R's packages provide easy-to-use functions for the
most common tasks.

• In this section, we'll load and process our data in R. In your R console, let's load the data set tweets.csv
with the read.csv function.

• But since we're working with text data here, we need one extra argument, which is
stringsAsFactors=FALSE. So we'll call our data set tweets.
• And we'll use the read.csv function to read in the data file tweets.csv, but then we'll add the extra argument
stringsAsFactors=FALSE.

• You'll always need to add this extra argument when working on a text analytics problem so that the text is
read in properly.

• tweets= read.csv("tweets.csv", stringsAsFactors = FALSE)


• Now let's take a look at the structure of our data with the str function.

• str(tweets)

• We can see that we have 1,181 observations of two variables, the text of the tweet, called Tweet, and the
average sentiment score, called Avg for average.

• The tweet texts are real tweets that we found on the internet directed to Apple with a few cleaned up words.

25
Text Analytics
• We're more interested in being able to detect the tweets with clear negative sentiment, so let's define a new
variable in our data set tweets called Negative.

• And we'll set this equal to as.factor(tweets$Avg = -1).

• This will set tweets$Negative equal to true if the average sentiment score is less than or equal to negative 1
and will set tweets$Negative equal to false if the average sentiment score is greater than negative 1.

• tweets$Negative = as.factor(tweets$Avg <= -1)


• Let's look at a table of this new variable, Negative.

• table (tweets$Negative)
We can see that 182 of the 1,181 tweets, or about 15%, are negative.

• Now to pre-process our text data so that we can use the bag of words approach, we'll be using the tm text
mining package.
• We'll need to install and load two packages to do this.

• install.packages("tm")

• library(tm)

• Then we also need to install the package snowballC. This package helps us use the tm package.

• install.packages("SnowballC")
• library(SnowballC)
26
Pre-Processing in R
• One of the concepts introduced by the tm package is that of a corpus. A corpus is a collection of
documents.

• We'll need to convert our tweets to a corpus for pre-processing.

• tm can create a corpus in many different ways, but we'll create it from the tweet column of our data frame
using two functions, corpus and vector source.

• We'll call our corpus "corpus" and then use the corpus and the vector source functions called on our tweets
variable of our tweets data set. So that's tweets$Tweet.

• corpus = Corpus(VectorSource(tweets$Tweet))

• We can check that this has worked by typing corpus and seeing that our corpus has 1,181 text documents.

• And we can check that the documents match our tweets by using double brackets.

• So type corpus[[1]].
• This shows us the first tweet in our corpus.

• Now we're ready to start pre-processing our data.

• Pre-processing is easy in tm.

27
Pre-Processing in R
• Each operation, like stemming or removing stop words, can be done with one line in R, where we use the
tm_map function.

• Let's try it out by changing all of the text in our tweets to lowercase. To do that, we'll replace our corpus with
the output of the tm_map function, where the first argument is the name of our corpus and the second
argument is what we want to do. In this case, tolower.

• tolower is a standard function in R, and this is like when we pass mean to the tapply function.
• We're passing the tm_map function a function to use on our corpus.

• corpus=tm_map (corpus, tolower)

• tweets$Tweet<- iconv(tweets$Tweet, "ASCII", "UTF-8", sub=“byte") (OPTIONAL)

• corpus = tm_map(corpus, PlainTextDocument)

• Go ahead and hit the up arrow twice to get back to corpuscorpus[[1]] and now we can see that all of our
letters are lowercase.

• corpus[[1]]

28
Pre-Processing in R
• Now let's remove all punctuation.
• This is done in a very similar way, except this time we give the argument removePunctuation instead of
tolower.

• corpus = tm_map (corpus, removePunctuation)

• corpus[[1]]

• Let's see what this did to our first tweet again.


• Now the comma after "say", the exclamation point after "received", and the @ symbols before "Apple" are
all gone.

• Now we want to remove the stop words in our tweets.

• tm provides a list of stop words for the English language.

• We can check it out by typing


• stopwords(“english").

• We see that these are words like "I," "me," "my," "myself," et cetera.

• Removing words can be done with the removeWords argument to the tm_map function, but we need one
extra argument this time-- what the stop words are that we want to remove.

29
Pre-Processing in R
• We'll remove all of these English stop words, but we'll also remove the word "apple" since all of these
tweets have the word "apple" and it probably won't be very useful in our prediction problem.

• So go ahead and hit the up arrow to get back to the tm_map function, delete removePunctuation and,
instead, type removeWords.

• Then we need to add one extra argument, c("apple").

• corpus = tm_map (corpus, removeWords, c(“apple",stopwords("english")))


• This is us removing the word "apple." And then stopwords(“english"). So this will remove the word "apple"
and all of the English stop words.

• Let's take a look at our first tweet again to see what happened. Now we can see that we have significantly
fewer words, only the words that are not stop words. Lastly, we want to stem our document with the stem
document argument.
• Go ahead and scroll back up to the removePunctuation, delete removePunctuation, and type
stemDocument.

• corpus = tm_map (corpus, stemDocument)

• If you hit Enter and then look at the first tweet again, we can see that this took off the ending of "customer,"
"service," "received," and "appstore." In the next section, we'll investigate our corpus and prepare it for our
prediction problem.

30
Bag of Words in R
• In the previous section, we preprocessed our data, and we're now ready to extract the word frequencies to
be used in our prediction problem.

• The tm package provides a function called DocumentTermMatrix that generates a matrix where the rows
correspond to documents, in our case tweets, and the columns correspond to words in those tweets.

• The values in the matrix are the number of times that word appears in each argument.

• Let's go ahead and generate this matrix and call it "frequencies." So we'll use the DocumentTermMatrix
function calls on our corpus that we created in the previous section.

• frequencies = DocumentTermMatrix(corpus)

• Let's take a look at our matrix by typing frequencies.

• frequencies

• We can see that there are 3,289 terms or words in our matrix and 1,181 documents or tweets after
preprocessing.

• Let's see what this matrix looks like using the inspect function.

• inspect(frequencies[1:50,50:70])

• In this range we see that the word "cheap" appears in the tweets, but "cheap" doesn't appear in any of
these tweets.

31
Bag of Words in R
• This data is what we call sparse.
• This means that there are many zeros in our matrix.

• We can look at what the most popular terms are, or words, with the function findFreqTerms.

• We want to call this on our matrix frequencies, and then we want to give an argument lowFreq, which is
equal to the minimum number of times a term must appear to be displayed. Let's type 20.

• findFreqTerms(frequencies, lowfreq=20)
• We see here 56 different words.

• So out of the 3,289 words in our matrix, only 56 words appear at least 20 times in our tweets.

• This means that we probably have a lot of terms that will be pretty useless for our prediction model. The
number of terms is an issue for two main reasons. One is computational.

• More terms means more independent variables, which usually means it takes longer to build our models.
• The other is in building models, as we mentioned before, the ratio of independent variables to observations
will affect how good the model will generalize.

• So let's remove some terms that don't appear very often.

32
Bag of Words in R
• The sparsity threshold works as follows.
• sparse = removeSparseTerms(frequencies, 0.995)

• If we say 0.98, this means to only keep terms that appear in 2% or more of the tweets.

• If we say 0.99, that means to only keep terms that appear in 1% or more of the tweets.

• If we say 0.995, that means to only keep terms that appear in 0.5% or more of the tweets, about six or more
tweets.
• We'll go ahead and use this sparsity threshold.

• If you type sparse, you can see that there's only 309 terms in our sparse matrix.

• This is only about 9% of the previous count of 3,289.

• Now let's convert the sparse matrix into a data frame that we'll be able to use for our predictive models.

• We'll call it tweetsSparse and use the as.data.frame function called on the as.matrix function called on our
matrixsparse.

• tweetsSparse = as.data.frame(as.matrix(sparse))

• This convert sparse to a data frame called tweetsSparse.

33
Bag of Words in R
• Since R struggles with variable names that start with a number, and we probably have some words here
that start with a number, let's run the make names function to make sure all of our words are appropriate
variable names.

• To do this type COLnames and then in parentheses the name of our data frame,

• colnames(tweetsSparse) = make.names(colnames(tweetsSparse))

• This will just convert our variable names to make sure they're all appropriate names before we build our
predictive models.

• You should do this each time you've built a data frame using text analytics.

• Now let's add our dependent variable to this data set.

• We'll call it tweetsSparse$Negative = tweets$Negative.

• tweetsSparse$Negative = tweets$Negative

34
Bag of Words in R
• Lastly, let's split our data into a training set and a testing set, putting 70% of the data in the training set.
• First we'll have to load the library catools so that we can use the sample split function.

• library(caTools)

• Then let's set the seed to 123 and create our split using sample.split where a dependent variable is
tweetsSparse$Negative. set.seed(123)

• And then our split ratio will be 0.7. We'll put 70% of the data in the training set.
• split = sample.split(tweetsSparse$Negative, SplitRatio = 0.7)

• Then let's just use subset to create a treating set called trainSparse, and testSparse,

• trainSparse = subset(tweetsSparse, split==TRUE)

• testSparse = subset(tweetsSparse, split==FALSE)

• Our data is now ready, and we can build our predictive model.

• In the next section, we'll use CART and logistic regression to predict negative sentiment.

35
Predicting Sentiment
• Now that we've prepared our data set, let's use CART to build a predictive model.
• First, we need to load the necessary packages in our R Console by typing library(rpart), and then
library(rpart.plot).

• Now let's build our model.

• We'll call it tweetCART, and we'll use the rpart function to predict Negative using all of the other variables as
our independent variables and the data set trainSparse.
• We'll add one more argument here, which is method = "class" so that the rpart function knows to build a
classification model.

• tweetCART = rpart(Negative ~ ., data=trainSparse, method="class")

• We're just using the default parameter settings so we won't add anything for minbucket or cp. Now let's plot
the tree using the prp function.
• Our tree says that if the word "freak" is in the tweet, then predict TRUE, or negative sentiment. If the word
"freak" is not in the tweet, but the word "hate" is, again predict TRUE.

• If neither of these two words are in the tweet, but the word "wtf" is, also predict TRUE, or negative
sentiment.

• If none of these three words are in the tweet, then predict FALSE, or non-negative sentiment.
• This tree makes sense intuitively since these three words are generally seen as negative words.
36
Predicting Sentiment

37
Predicting Sentiment
• Now, let's go back to our R Console and evaluate the numerical performance of our model by making
predictions on the test set.

• We'll call our predictions predictCART.

• And we'll use the predict function to predict using our model tweetCART on the new data set testSparse.

• We'll add one more argument, which is type = "class" to make sure we get class predictions.

• predictCART = predict(tweetCART, newdata=testSparse, type="class")


• Now let's make our confusion matrix using the table function.

• We'll give as the first argument the actual outcomes, testSparse$Negative, and then as the second
argument, our predictions, predictCART.

• table(testSparse$Negative, predictCART)

• To compute the accuracy of our model, we add up the numbers on the diagonal, 294 plus 18-- these are the
observations we predicted correctly-- and divide by the total number of observations in the table, or the total
number of observations in our test set.

• So the accuracy of our CART model is about 0.88.

• Let's compare this to a simple baseline model that always predicts non-negative.

38
Predicting Sentiment
• To compute the accuracy of the baseline model, let's make a table of just the outcome variable Negative.
• So we'll type table, and then in parentheses, testSparse$Negative.

• table(testSparse$Negative)

• This tells us that in our test set we have 300 observations with non-negative sentiment and 55 observations
with negative sentiment.

• 300/(300+55)
• So the accuracy of a baseline model that always predicts non-negative would be 300 divided by 355, or
0.845. So our CART model does better than the simple baseline model.

• How about a random forest model?

• How well would that do?

39
Predicting Sentiment
• Let's first load the random forest package with library(randomForest), and then we'll set the seed to 123
so that we can replicate our model if we want to.

• Keep in mind that even if you set the seed to 123, you might get a different random forest model than me
depending on your operating system.

• Now, let's create our model.

• We'll call it tweetRF and use the randomForest function to predict Negative again using all of our other
variables as independent variables and the data set trainSparse.

• We'll again use the default parameter settings. The random forest model takes significantly longer to build
than the CART model.

• tweetRF = randomForest(Negative ~ ., data=trainSparse)

• We've seen this before when building CART and random forest models, but in this case, the difference is
particularly drastic.

• This is because we have so many independent variables, about 300 different words. So far in this course,
we haven't seen data sets with this many independent variables.

• So keep in mind that for text analytics problems, building a random forest model will take significantly longer
than building a CART model.

40
Predicting Sentiment
• Let's first load the random forest package with library(randomForest), and then we'll set the seed to 123
so that we can replicate our model if we want to.

• Keep in mind that even if you set the seed to 123, you might get a different random forest model than me
depending on your operating system.

• Now, let's create our model.

• We'll call it tweetRF and use the randomForest function to predict Negative again using all of our other
variables as independent variables and the data set trainSparse.

• We'll again use the default parameter settings. The random forest model takes significantly longer to build
than the CART model.

• tweetRF = randomForest(Negative ~ ., data=trainSparse)

• We've seen this before when building CART and random forest models, but in this case, the difference is
particularly drastic.

• This is because we have so many independent variables, about 300 different words. So far in this course,
we haven't seen data sets with this many independent variables.

• So keep in mind that for text analytics problems, building a random forest model will take significantly longer
than building a CART model.

41
Predicting Sentiment
• So now that our model's finished, let's make predictions on our test set.
• We'll call them predictRF, and again, we'll use the predict function to make predictions using the model
tweetRF this time, and again, the new data set testSparse.

• predictRF = predict(tweetRF, newdata=testSparse)

• Now let's make our confusion matrix using the table function, first giving the actual outcomes,
testSparse$Negative, and then giving our predictions, predictRF.
• table(testSparse$Negative, predictRF)

• To compute the accuracy of the random forest model, we again sum up the cases we got right, 293 plus 21,
and divide by the total number of observations in the table.

• (293+21)/(293+7+34+21)

• So our random forest model has an accuracy of 0.885.


• This is a little better than our CART model, but due to the interpretability of our CART model, I'd probably
prefer it over the random forest model.

• If you were to use cross-validation to pick the cp parameter for the CART model, the accuracy would
increase to about the same as the random forest model.

• So by using a bag-of-words approach and these models, we can reasonably predict sentiment even with a
relatively small data set of tweets.
42

You might also like