0% found this document useful (0 votes)

9 views

Lecture 8

Uploaded by

as.designsas33

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Lecture 8

Uploaded by

as.designsas33

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Learning R

TEXT ANALYTICS

Medical Information Systems

Lecture 8

Assoc. Prof. Dr. Timuçin AVŞAR

Text Analytics
• Pre-processing the data can be difficult, but, luckily, R's packages provide easy-to-use functions for the
most common tasks.

• In this section, we'll load and process our data in R. In your R console, let's load the data set tweets.csv
with the read.csv function.

• But since we're working with text data here, we need one extra argument, which is
stringsAsFactors=FALSE. So we'll call our data set tweets.
• And we'll use the read.csv function to read in the data file tweets.csv, but then we'll add the extra argument
stringsAsFactors=FALSE.

• You'll always need to add this extra argument when working on a text analytics problem so that the text is
read in properly.

• tweets= read.csv("tweets.csv", stringsAsFactors = FALSE)

• Now let's take a look at the structure of our data with the str function.

• str(tweets)

• We can see that we have 1,181 observations of two variables, the text of the tweet, called Tweet, and the
average sentiment score, called Avg for average.

• The tweet texts are real tweets that we found on the internet directed to Apple with a few cleaned up words.

25
Text Analytics
• We're more interested in being able to detect the tweets with clear negative sentiment, so let's define a new
variable in our data set tweets called Negative.

• And we'll set this equal to as.factor(tweets$Avg = -1).

• This will set tweets$Negative equal to true if the average sentiment score is less than or equal to negative 1
and will set tweets$Negative equal to false if the average sentiment score is greater than negative 1.

• tweets$Negative = as.factor(tweets$Avg <= -1)

• Let's look at a table of this new variable, Negative.

• table (tweets$Negative)
We can see that 182 of the 1,181 tweets, or about 15%, are negative.

• Now to pre-process our text data so that we can use the bag of words approach, we'll be using the tm text
mining package.
• We'll need to install and load two packages to do this.

• install.packages("tm")

• library(tm)

• Then we also need to install the package snowballC. This package helps us use the tm package.

• install.packages("SnowballC")
• library(SnowballC)
26
Pre-Processing in R
• One of the concepts introduced by the tm package is that of a corpus. A corpus is a collection of
documents.

• We'll need to convert our tweets to a corpus for pre-processing.

• tm can create a corpus in many different ways, but we'll create it from the tweet column of our data frame
using two functions, corpus and vector source.

• We'll call our corpus "corpus" and then use the corpus and the vector source functions called on our tweets
variable of our tweets data set. So that's tweets$Tweet.

• corpus = Corpus(VectorSource(tweets$Tweet))

• We can check that this has worked by typing corpus and seeing that our corpus has 1,181 text documents.

• And we can check that the documents match our tweets by using double brackets.

• So type corpus[[1]].
• This shows us the first tweet in our corpus.

• Now we're ready to start pre-processing our data.

• Pre-processing is easy in tm.

27
Pre-Processing in R
• Each operation, like stemming or removing stop words, can be done with one line in R, where we use the
tm_map function.

• Let's try it out by changing all of the text in our tweets to lowercase. To do that, we'll replace our corpus with
the output of the tm_map function, where the first argument is the name of our corpus and the second
argument is what we want to do. In this case, tolower.

• tolower is a standard function in R, and this is like when we pass mean to the tapply function.
• We're passing the tm_map function a function to use on our corpus.

• corpus=tm_map (corpus, tolower)

• tweets$Tweet<- iconv(tweets$Tweet, "ASCII", "UTF-8", sub=“byte") (OPTIONAL)

• corpus = tm_map(corpus, PlainTextDocument)

• Go ahead and hit the up arrow twice to get back to corpuscorpus[[1]] and now we can see that all of our
letters are lowercase.

• corpus[[1]]

28
Pre-Processing in R
• Now let's remove all punctuation.
• This is done in a very similar way, except this time we give the argument removePunctuation instead of
tolower.

• corpus = tm_map (corpus, removePunctuation)

• corpus[[1]]

• Let's see what this did to our first tweet again.

• Now the comma after "say", the exclamation point after "received", and the @ symbols before "Apple" are
all gone.

• Now we want to remove the stop words in our tweets.

• tm provides a list of stop words for the English language.

• We can check it out by typing

• stopwords(“english").

• We see that these are words like "I," "me," "my," "myself," et cetera.

• Removing words can be done with the removeWords argument to the tm_map function, but we need one
extra argument this time-- what the stop words are that we want to remove.

29
Pre-Processing in R
• We'll remove all of these English stop words, but we'll also remove the word "apple" since all of these
tweets have the word "apple" and it probably won't be very useful in our prediction problem.

• So go ahead and hit the up arrow to get back to the tm_map function, delete removePunctuation and,
instead, type removeWords.

• Then we need to add one extra argument, c("apple").

• corpus = tm_map (corpus, removeWords, c(“apple",stopwords("english")))

• This is us removing the word "apple." And then stopwords(“english"). So this will remove the word "apple"
and all of the English stop words.

• Let's take a look at our first tweet again to see what happened. Now we can see that we have significantly
fewer words, only the words that are not stop words. Lastly, we want to stem our document with the stem
document argument.
• Go ahead and scroll back up to the removePunctuation, delete removePunctuation, and type
stemDocument.

• corpus = tm_map (corpus, stemDocument)

• If you hit Enter and then look at the first tweet again, we can see that this took off the ending of "customer,"
"service," "received," and "appstore." In the next section, we'll investigate our corpus and prepare it for our
prediction problem.

30
Bag of Words in R
• In the previous section, we preprocessed our data, and we're now ready to extract the word frequencies to
be used in our prediction problem.

• The tm package provides a function called DocumentTermMatrix that generates a matrix where the rows
correspond to documents, in our case tweets, and the columns correspond to words in those tweets.

• The values in the matrix are the number of times that word appears in each argument.

• Let's go ahead and generate this matrix and call it "frequencies." So we'll use the DocumentTermMatrix
function calls on our corpus that we created in the previous section.

• frequencies = DocumentTermMatrix(corpus)

• Let's take a look at our matrix by typing frequencies.

• frequencies

• We can see that there are 3,289 terms or words in our matrix and 1,181 documents or tweets after
preprocessing.

• Let's see what this matrix looks like using the inspect function.

• inspect(frequencies[1:50,50:70])

• In this range we see that the word "cheap" appears in the tweets, but "cheap" doesn't appear in any of
these tweets.

31
Bag of Words in R
• This data is what we call sparse.
• This means that there are many zeros in our matrix.

• We can look at what the most popular terms are, or words, with the function findFreqTerms.

• We want to call this on our matrix frequencies, and then we want to give an argument lowFreq, which is
equal to the minimum number of times a term must appear to be displayed. Let's type 20.

• findFreqTerms(frequencies, lowfreq=20)
• We see here 56 different words.

• So out of the 3,289 words in our matrix, only 56 words appear at least 20 times in our tweets.

• This means that we probably have a lot of terms that will be pretty useless for our prediction model. The
number of terms is an issue for two main reasons. One is computational.

• More terms means more independent variables, which usually means it takes longer to build our models.
• The other is in building models, as we mentioned before, the ratio of independent variables to observations
will affect how good the model will generalize.

• So let's remove some terms that don't appear very often.

32
Bag of Words in R
• The sparsity threshold works as follows.
• sparse = removeSparseTerms(frequencies, 0.995)

• If we say 0.98, this means to only keep terms that appear in 2% or more of the tweets.

• If we say 0.99, that means to only keep terms that appear in 1% or more of the tweets.

• If we say 0.995, that means to only keep terms that appear in 0.5% or more of the tweets, about six or more
tweets.
• We'll go ahead and use this sparsity threshold.

• If you type sparse, you can see that there's only 309 terms in our sparse matrix.

• This is only about 9% of the previous count of 3,289.

• Now let's convert the sparse matrix into a data frame that we'll be able to use for our predictive models.

• We'll call it tweetsSparse and use the as.data.frame function called on the as.matrix function called on our
matrixsparse.

• tweetsSparse = as.data.frame(as.matrix(sparse))

• This convert sparse to a data frame called tweetsSparse.

33
Bag of Words in R
• Since R struggles with variable names that start with a number, and we probably have some words here
that start with a number, let's run the make names function to make sure all of our words are appropriate
variable names.

• To do this type COLnames and then in parentheses the name of our data frame,

• colnames(tweetsSparse) = make.names(colnames(tweetsSparse))

• This will just convert our variable names to make sure they're all appropriate names before we build our
predictive models.

• You should do this each time you've built a data frame using text analytics.

• Now let's add our dependent variable to this data set.

• We'll call it tweetsSparse$Negative = tweets$Negative.

• tweetsSparse$Negative = tweets$Negative

34
Bag of Words in R
• Lastly, let's split our data into a training set and a testing set, putting 70% of the data in the training set.
• First we'll have to load the library catools so that we can use the sample split function.

• library(caTools)

• Then let's set the seed to 123 and create our split using sample.split where a dependent variable is
tweetsSparse$Negative. set.seed(123)

• And then our split ratio will be 0.7. We'll put 70% of the data in the training set.
• split = sample.split(tweetsSparse$Negative, SplitRatio = 0.7)

• Then let's just use subset to create a treating set called trainSparse, and testSparse,

• trainSparse = subset(tweetsSparse, split==TRUE)

• testSparse = subset(tweetsSparse, split==FALSE)

• Our data is now ready, and we can build our predictive model.

• In the next section, we'll use CART and logistic regression to predict negative sentiment.

35
Predicting Sentiment
• Now that we've prepared our data set, let's use CART to build a predictive model.
• First, we need to load the necessary packages in our R Console by typing library(rpart), and then
library(rpart.plot).

• Now let's build our model.

• We'll call it tweetCART, and we'll use the rpart function to predict Negative using all of the other variables as
our independent variables and the data set trainSparse.
• We'll add one more argument here, which is method = "class" so that the rpart function knows to build a
classification model.

• tweetCART = rpart(Negative ~ ., data=trainSparse, method="class")

• We're just using the default parameter settings so we won't add anything for minbucket or cp. Now let's plot
the tree using the prp function.
• Our tree says that if the word "freak" is in the tweet, then predict TRUE, or negative sentiment. If the word
"freak" is not in the tweet, but the word "hate" is, again predict TRUE.

• If neither of these two words are in the tweet, but the word "wtf" is, also predict TRUE, or negative
sentiment.

• If none of these three words are in the tweet, then predict FALSE, or non-negative sentiment.
• This tree makes sense intuitively since these three words are generally seen as negative words.
36
Predicting Sentiment

37
Predicting Sentiment
• Now, let's go back to our R Console and evaluate the numerical performance of our model by making
predictions on the test set.

• We'll call our predictions predictCART.

• And we'll use the predict function to predict using our model tweetCART on the new data set testSparse.

• We'll add one more argument, which is type = "class" to make sure we get class predictions.

• predictCART = predict(tweetCART, newdata=testSparse, type="class")

• Now let's make our confusion matrix using the table function.

• We'll give as the first argument the actual outcomes, testSparse$Negative, and then as the second
argument, our predictions, predictCART.

• table(testSparse$Negative, predictCART)

• To compute the accuracy of our model, we add up the numbers on the diagonal, 294 plus 18-- these are the
observations we predicted correctly-- and divide by the total number of observations in the table, or the total
number of observations in our test set.

• So the accuracy of our CART model is about 0.88.

• Let's compare this to a simple baseline model that always predicts non-negative.

38
Predicting Sentiment
• To compute the accuracy of the baseline model, let's make a table of just the outcome variable Negative.
• So we'll type table, and then in parentheses, testSparse$Negative.

• table(testSparse$Negative)

• This tells us that in our test set we have 300 observations with non-negative sentiment and 55 observations
with negative sentiment.

• 300/(300+55)
• So the accuracy of a baseline model that always predicts non-negative would be 300 divided by 355, or
0.845. So our CART model does better than the simple baseline model.

• How about a random forest model?

• How well would that do?

39
Predicting Sentiment
• Let's first load the random forest package with library(randomForest), and then we'll set the seed to 123
so that we can replicate our model if we want to.

• Keep in mind that even if you set the seed to 123, you might get a different random forest model than me
depending on your operating system.

• Now, let's create our model.

• We'll call it tweetRF and use the randomForest function to predict Negative again using all of our other
variables as independent variables and the data set trainSparse.

• We'll again use the default parameter settings. The random forest model takes significantly longer to build
than the CART model.

• tweetRF = randomForest(Negative ~ ., data=trainSparse)

• We've seen this before when building CART and random forest models, but in this case, the difference is
particularly drastic.

• This is because we have so many independent variables, about 300 different words. So far in this course,
we haven't seen data sets with this many independent variables.

• So keep in mind that for text analytics problems, building a random forest model will take significantly longer
than building a CART model.

40
Predicting Sentiment
• Let's first load the random forest package with library(randomForest), and then we'll set the seed to 123
so that we can replicate our model if we want to.

• Keep in mind that even if you set the seed to 123, you might get a different random forest model than me
depending on your operating system.

• Now, let's create our model.

• We'll call it tweetRF and use the randomForest function to predict Negative again using all of our other
variables as independent variables and the data set trainSparse.

• We'll again use the default parameter settings. The random forest model takes significantly longer to build
than the CART model.

• tweetRF = randomForest(Negative ~ ., data=trainSparse)

• We've seen this before when building CART and random forest models, but in this case, the difference is
particularly drastic.

• This is because we have so many independent variables, about 300 different words. So far in this course,
we haven't seen data sets with this many independent variables.

• So keep in mind that for text analytics problems, building a random forest model will take significantly longer
than building a CART model.

41
Predicting Sentiment
• So now that our model's finished, let's make predictions on our test set.
• We'll call them predictRF, and again, we'll use the predict function to make predictions using the model
tweetRF this time, and again, the new data set testSparse.

• predictRF = predict(tweetRF, newdata=testSparse)

• Now let's make our confusion matrix using the table function, first giving the actual outcomes,
testSparse$Negative, and then giving our predictions, predictRF.
• table(testSparse$Negative, predictRF)

• To compute the accuracy of the random forest model, we again sum up the cases we got right, 293 plus 21,
and divide by the total number of observations in the table.

• (293+21)/(293+7+34+21)

• So our random forest model has an accuracy of 0.885.

• This is a little better than our CART model, but due to the interpretability of our CART model, I'd probably
prefer it over the random forest model.

• If you were to use cross-validation to pick the cp parameter for the CART model, the accuracy would
increase to about the same as the random forest model.

• So by using a bag-of-words approach and these models, we can reasonably predict sentiment even with a
relatively small data set of tweets.
42

R Programming Cheatsheet
100% (1)
R Programming Cheatsheet
6 pages
BizLink Manual - PhilHealth 1
No ratings yet
BizLink Manual - PhilHealth 1
18 pages
CS5228 Project 2 Twitter Sentiment Analysis Group No.: 29: 1 Problem Statement
No ratings yet
CS5228 Project 2 Twitter Sentiment Analysis Group No.: 29: 1 Problem Statement
15 pages
Answer:: Free Exam/Cram Practice Materials - Best Exam Practice Materials
No ratings yet
Answer:: Free Exam/Cram Practice Materials - Best Exam Practice Materials
2 pages
PDMS Draft: Introduction Course
No ratings yet
PDMS Draft: Introduction Course
72 pages
Text Mining Package and Datacleaning: #Cleaning The Text or Text Transformation
No ratings yet
Text Mining Package and Datacleaning: #Cleaning The Text or Text Transformation
6 pages
Big data
No ratings yet
Big data
5 pages
RDataMining Slides Text Mining
No ratings yet
RDataMining Slides Text Mining
34 pages
Chapter 26 Text Mining - Introduction To Data Science
No ratings yet
Chapter 26 Text Mining - Introduction To Data Science
20 pages
RDataMining Slides Text Mining
No ratings yet
RDataMining Slides Text Mining
35 pages
SMTA - Lab Record - Aim, Procedures and Results
No ratings yet
SMTA - Lab Record - Aim, Procedures and Results
31 pages
Introduction To The TM Package Text Mining in R: Ingo Feinerer April 20, 2024
No ratings yet
Introduction To The TM Package Text Mining in R: Ingo Feinerer April 20, 2024
8 pages
Supervised Learningclassification Part3
No ratings yet
Supervised Learningclassification Part3
42 pages
Lab5 Instructions
No ratings yet
Lab5 Instructions
51 pages
mod3 tables EPP
No ratings yet
mod3 tables EPP
9 pages
02 Lab 2 Instructions
No ratings yet
02 Lab 2 Instructions
37 pages
Step 1: Create A CSV File: # For Text Mining
No ratings yet
Step 1: Create A CSV File: # For Text Mining
9 pages
Introduction To The TM Package Text Mining in R: Ingo Feinerer June 10, 2014
No ratings yet
Introduction To The TM Package Text Mining in R: Ingo Feinerer June 10, 2014
7 pages
Text Analysis
No ratings yet
Text Analysis
15 pages
Sentiment 2
No ratings yet
Sentiment 2
7 pages
Submitted By-Pawan Yadav, Roll No. (18PT1-17)
No ratings yet
Submitted By-Pawan Yadav, Roll No. (18PT1-17)
4 pages
Text Mining With R
No ratings yet
Text Mining With R
15 pages
RTutorial1
No ratings yet
RTutorial1
12 pages
NLP Lab manual
No ratings yet
NLP Lab manual
2 pages
Week3 2020
No ratings yet
Week3 2020
20 pages
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
No ratings yet
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
26 pages
2019 06 27 - Muenster
No ratings yet
2019 06 27 - Muenster
218 pages
2013 - Notes - R Trinker'S - Notes
No ratings yet
2013 - Notes - R Trinker'S - Notes
274 pages
Week 8
No ratings yet
Week 8
24 pages
R Code NB
No ratings yet
R Code NB
3 pages
Introduction To Rlogistic
No ratings yet
Introduction To Rlogistic
135 pages
BA Notes
No ratings yet
BA Notes
5 pages
Data Cleaning Using Dataset
No ratings yet
Data Cleaning Using Dataset
12 pages
R Examples
No ratings yet
R Examples
56 pages
Sentiment Analysis Presentationnotes
No ratings yet
Sentiment Analysis Presentationnotes
4 pages
Learn Programming Using C#
From Everand
Learn Programming Using C#
Taurius Litvinavicius
No ratings yet
Rcourse3 PDF
No ratings yet
Rcourse3 PDF
35 pages
Descrição Do Pacote
No ratings yet
Descrição Do Pacote
10 pages
Basic R Dplyr Session 4 Demonstration
No ratings yet
Basic R Dplyr Session 4 Demonstration
18 pages
M3 Dar
No ratings yet
M3 Dar
52 pages
Data in R
No ratings yet
Data in R
7 pages
Data Types
No ratings yet
Data Types
27 pages
Text and Sentiment Analysis
No ratings yet
Text and Sentiment Analysis
41 pages
N2 Data in R
No ratings yet
N2 Data in R
7 pages
18-NLP-DTM Tokenization corpus BoW cloud
No ratings yet
18-NLP-DTM Tokenization corpus BoW cloud
14 pages
Hands-On Data Science With R Text Mining: 10th January 2016
No ratings yet
Hands-On Data Science With R Text Mining: 10th January 2016
47 pages
Section 03
No ratings yet
Section 03
20 pages
Order Tasks and Milestones Assignment
No ratings yet
Order Tasks and Milestones Assignment
6 pages
RProgramming1UnitQ&A
No ratings yet
RProgramming1UnitQ&A
19 pages
R Nuts and Bolts
No ratings yet
R Nuts and Bolts
9 pages
Text Analysis in R
No ratings yet
Text Analysis in R
21 pages
R Reference Card
No ratings yet
R Reference Card
6 pages
R Reference Card
No ratings yet
R Reference Card
6 pages
Data Science With R Text Mining by Graham Williams
No ratings yet
Data Science With R Text Mining by Graham Williams
21 pages
R Programming Cheat Sheet: Ata Tructures
No ratings yet
R Programming Cheat Sheet: Ata Tructures
2 pages
assign 5 tt
No ratings yet
assign 5 tt
13 pages
R Language PDF
100% (1)
R Language PDF
619 pages
Chapter 03 Wrangling
No ratings yet
Chapter 03 Wrangling
40 pages
Statistics With R Unit 1: Divya Arun Kumar
No ratings yet
Statistics With R Unit 1: Divya Arun Kumar
65 pages
RDataMining Slides Twitter Analysis
100% (1)
RDataMining Slides Twitter Analysis
40 pages
PHP programming
From Everand
PHP programming
Nino Paiotta
No ratings yet
Coding Interview Questions and Answers
From Everand
Coding Interview Questions and Answers
Chinmoy Mukherjee
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Eboni Mitchell Resume 7
No ratings yet
Eboni Mitchell Resume 7
4 pages
BOA Statement 2 PDF
No ratings yet
BOA Statement 2 PDF
7 pages
IR Assurance Statement 22 23
No ratings yet
IR Assurance Statement 22 23
3 pages
Java Presentation
No ratings yet
Java Presentation
30 pages
DAV Institute of Management, Faridabad DAV Institute of Management DAV Centenary College, Faridabad DAV Centenary College Nios Nios Nios Nios
No ratings yet
DAV Institute of Management, Faridabad DAV Institute of Management DAV Centenary College, Faridabad DAV Centenary College Nios Nios Nios Nios
1 page
r136 Troubleshooting 5543
No ratings yet
r136 Troubleshooting 5543
3 pages
Power System Oscillation Damping and Stability Enhancement Using Static Synchronous Series Compensator (SSSC)
No ratings yet
Power System Oscillation Damping and Stability Enhancement Using Static Synchronous Series Compensator (SSSC)
13 pages
Akanksha CV
No ratings yet
Akanksha CV
1 page
Western Springs Emails With BNSF Railway
No ratings yet
Western Springs Emails With BNSF Railway
42 pages
CSE109_Section_B_C_Outline
No ratings yet
CSE109_Section_B_C_Outline
3 pages
sm35 Specification Sheet English
No ratings yet
sm35 Specification Sheet English
1 page
ANSI+AGMA+6034-B92
No ratings yet
ANSI+AGMA+6034-B92
38 pages
Activity2Bernabe - Grade 10 ST - Elizabeth
No ratings yet
Activity2Bernabe - Grade 10 ST - Elizabeth
2 pages
Prs WTTX Introduction: Huawei Technologies Co., LTD
No ratings yet
Prs WTTX Introduction: Huawei Technologies Co., LTD
15 pages
Introduction To Linux
No ratings yet
Introduction To Linux
49 pages
Infographics PPT
100% (1)
Infographics PPT
35 pages
Plumbing Code 7 (1-25)
No ratings yet
Plumbing Code 7 (1-25)
25 pages
DOA - TEMPLATE - SBLC - Page 3
No ratings yet
DOA - TEMPLATE - SBLC - Page 3
1 page
Cyber Crime and Digital Forensics-Unit-I
No ratings yet
Cyber Crime and Digital Forensics-Unit-I
31 pages
BS ISO 3951-1-2013 - Overview
No ratings yet
BS ISO 3951-1-2013 - Overview
2 pages
Convolutional Neural Networks, Explained by Mayank Mishra Towards Data Science
No ratings yet
Convolutional Neural Networks, Explained by Mayank Mishra Towards Data Science
14 pages
ID2020 TAC Requirements v1.01
No ratings yet
ID2020 TAC Requirements v1.01
10 pages
Assignment 12: Data and Network Security Using Firewall: - Soham Salvi
No ratings yet
Assignment 12: Data and Network Security Using Firewall: - Soham Salvi
3 pages
SQE Assignment 1
No ratings yet
SQE Assignment 1
8 pages
EAB European Adhesive Bonder IFAM
No ratings yet
EAB European Adhesive Bonder IFAM
2 pages
Example of A Service Level Agreement
No ratings yet
Example of A Service Level Agreement
17 pages
Tech Note 1009 - How To Get The Application Server and Microsoft SQL Versions From The Galaxy Backup File
No ratings yet
Tech Note 1009 - How To Get The Application Server and Microsoft SQL Versions From The Galaxy Backup File
8 pages

Lecture 8

Uploaded by

Lecture 8

Uploaded by

Learning R

Medical Information Systems

Assoc. Prof. Dr. Timuçin AVŞAR

• tweets= read.csv("tweets.csv", stringsAsFactors = FALSE)

• And we'll set this equal to as.factor(tweets$Avg = -1).

• tweets$Negative = as.factor(tweets$Avg <= -1)

• We'll need to convert our tweets to a corpus for pre-processing.

• Now we're ready to start pre-processing our data.

• Pre-processing is easy in tm.

• corpus=tm_map (corpus, tolower)

• tweets$Tweet<- iconv(tweets$Tweet, "ASCII", "UTF-8", sub=“byte") (OPTIONAL)

• corpus = tm_map(corpus, PlainTextDocument)

• corpus = tm_map (corpus, removePunctuation)

• Let's see what this did to our first tweet again.

• Now we want to remove the stop words in our tweets.

• tm provides a list of stop words for the English language.

• We can check it out by typing

• Then we need to add one extra argument, c("apple").

• corpus = tm_map (corpus, removeWords, c(“apple",stopwords("english")))

• corpus = tm_map (corpus, stemDocument)

• Let's take a look at our matrix by typing frequencies.

• So let's remove some terms that don't appear very often.

• This is only about 9% of the previous count of 3,289.

• This convert sparse to a data frame called tweetsSparse.

• Now let's add our dependent variable to this data set.

• We'll call it tweetsSparse$Negative = tweets$Negative.

• trainSparse = subset(tweetsSparse, split==TRUE)

• testSparse = subset(tweetsSparse, split==FALSE)

• Now let's build our model.

• tweetCART = rpart(Negative ~ ., data=trainSparse, method="class")

• We'll call our predictions predictCART.

• predictCART = predict(tweetCART, newdata=testSparse, type="class")

• So the accuracy of our CART model is about 0.88.

• How about a random forest model?

• How well would that do?

• Now, let's create our model.

• tweetRF = randomForest(Negative ~ ., data=trainSparse)

• Now, let's create our model.

• tweetRF = randomForest(Negative ~ ., data=trainSparse)

• predictRF = predict(tweetRF, newdata=testSparse)

• So our random forest model has an accuracy of 0.885.

You might also like