0% found this document useful (0 votes)
5 views

Text Analysis

The document provides a comprehensive overview of text analysis using R, covering techniques such as tokenization, sentiment analysis, and topic modeling. It includes practical examples of analyzing text data from various sources, including customer reviews and CSV files, with detailed code snippets. The document emphasizes the importance of text analysis in extracting insights from large volumes of text data across different domains.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Text Analysis

The document provides a comprehensive overview of text analysis using R, covering techniques such as tokenization, sentiment analysis, and topic modeling. It includes practical examples of analyzing text data from various sources, including customer reviews and CSV files, with detailed code snippets. The document emphasizes the importance of text analysis in extracting insights from large volumes of text data across different domains.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Text

Analysis

R
1
Table of Contents
1 Text Analysis: Introduction ............................................................................................................. 3
2 Text Analysis in R............................................................................................................................. 5
2.1 Text Analysis of a text ta(a) ..................................................................................................... 5
2.2 Text Analysis of a text ta(b)..................................................................................................... 5
2.3 Text Analysis of a text ta(c) ..................................................................................................... 7
2.4 Sentiment Analysis of a text ta(d) ........................................................................................... 8
2.5 Sentiment Analysis of a CSV file ta(e) ................................................................................... 10
2.6 Sentiment Analysis of a PDF file ta(f) .................................................................................... 11
2.7 Sentiment Analysis of Chapter 7.1 (HKD) ............................................................................. 13
2.8 Sentiment Analysis of Chapter 7.2(HKD) .............................................................................. 13

2
1 Text Analysis: Introduction
Suppose you have a mountain of text data: customer reviews, news articles, books – the
possibilities are endless. Text analysis is like having a powerful magnifying glass and a set of
tools to sift through this mountain and uncover hidden patterns, understand the underlying
meaning, and extract valuable insights.

Think of it this way:

You have a box full of jigsaw puzzles. Each puzzle piece is a word, and the entire box is a
collection of texts.

Text analysis helps you:

• Find all the corner pieces: Identify the most frequent words (like "the," "a," "is")
– these are common but not always the most meaningful.
• Group similar pieces: Find words that often appear together (like "delicious" and
"food," "fast" and "delivery") to understand themes and topics.
• Determine the overall picture: Analyse the sentiment (positive, negative, neutral)
expressed in the text, identify the main topics discussed, and even predict future
trends.

Let’s take a simple example

Let's say you have a collection of customer reviews for a restaurant. You can use text analysis
to:

• Identify common words: "delicious," "tasty," "service," "slow," "friendly,"


"disappointed."
• Analyse sentiment: Determine if the overall sentiment of the reviews is positive,
negative, or neutral.
• Find common themes: Identify recurring themes, such as slow service, delicious
food, or friendly staff.

Key Techniques in Text Analysis:

• Tokenisation: Breaking down text into individual words or sentences.


• Sentiment Analysis: Determining the emotional tone of the text (positive, negative,
neutral).
• Topic Modeling: Identifying the main topics discussed in the text.
• Named Entity Recognition: Identifying and classifying named entities (people,
organizations, locations).

Tools for Text Analysis:

• R: A powerful programming language with many libraries for text analysis (like
tidytext, tm, sentimentr).
• Python: Another popular language with libraries like NLTK, spaCy, and scikit-learn.

3
Text analysis is a rapidly growing field with applications in various domains, including
business, marketing, social sciences, and even healthcare.

Text Analysis Flowchart

Import Data

dplyer, tidyverse, pdftools, VCorpus

Data Cleaning

tm, tidytext, textstem

Lemma

ggplot2
Plot

wordcloud

Word cloud

syuzhet

Sentiment
Analysis

4
2 Text Analysis in R
2.1 Text Analysis of a text ta(a)
library(dplyr)
library(tm)
library(ggplot2)
library(tidyverse)
library(tidytext)
library(pdftools)
library(wordcloud)
library(textstem)
# First Text mining Process
text_data <- data.frame(cbind(
id = 1:6,
text = c ( " I am Anubha - finance teacher eager to explore the potential of AI in education. ",
" My goal is to provide the best possible learning experience for my students.",
" I believe that AI tools can revolutionise the way we teach and learn.",
" I am open to collaborating with anyone who shares my passion for using AI to enhance
education.",
" I am a firm believer in student-centred learning and inquiry-based learning. ",
"I love data science. Data science is amazing!")
))

#A data frame source interprets each row of the data frame x as a document.
#The first column must be named "doc_id" and contain a unique string identifier for each document.
#The second column must be named "text"
colnames(text_data)=c('doc_id','text')
crp<- VCorpus(DataframeSource(text_data))
#VCorpus stands for Volatile Corpus (VCorpus is suitable for smaller datasets that can be comfortably
held in memory)
print(crp[[1]])
#tm_map() allows you to apply a specified function to each document within a corpus.
crp<- tm_map(crp, content_transformer(tolower))
print(crp[[1]]$content)

2.2 Text Analysis of a text ta(b)


library(dplyr)
library(tm)
library(ggplot2)
library(tidyverse)
library(tidytext)
library(pdftools)
library(wordcloud)
library(textstem)
# First Text mining Process

5
text_data <- data.frame(cbind(
id = 1:6,
text = c ( " I am Anubha - finance teacher eager to explore the potential of AI in education. ",
" My goal is to provide the best possible learning experience for my students.",
" I believe that AI tools can revolutionise the way we teach and learn.",
" I am open to collaborating with anyone who shares my passion for using AI to enhance
education.",
" I am a firm believer in student-centred learning and inquiry-based learning. ",
"I love data science. Data science is amazing!")
))

#A data frame source interprets each row of the data frame x as a document.
#The first column must be named "doc_id" and contain a unique string identifier for each document.
#The second column must be named "text"
colnames(text_data)=c('doc_id','text')
crp<- VCorpus(DataframeSource(text_data))
#VCorpus stands for Volatile Corpus (VCorpus is suitable for smaller datasets that can be comfortably
held in memory)
print(crp[[1]])
#tm_map() allows you to apply a specified function to each document within a corpus.
crp<- tm_map(crp, content_transformer(tolower))
print(crp[[1]]$content)
crp<- tm_map(crp, stripWhitespace) # removes whitespaces
crp<- tm_map(crp, removePunctuation) # removes punctuations
crp<- tm_map(crp, removeNumbers) # removes numbers
crp<- tm_map(crp, removeWords, stopwords("english"))
#Examples of common English stop words:
##Articles: a, an, the
##Prepositions: in, on, at, to, from, with, for
##Conjunctions: and, but, or, if, because
##Pronouns: I, you, he, she, it, they, we, me, him, her, them, us
##Other: no, not, only, very, this, that, these, those
mystopwords<- c(stopwords("english"),"anubha")
crp<- tm_map(crp, removeWords, mystopwords)

# Lemmatization (Lemmatization in Natural Language Processing (NLP) is the process of reducing a


word to its base or dictionary form, known as the lemma)
crp<- tm_map(crp, content_transformer(lemmatize_strings))
print(crp[[1]]$content)
review_corpus<- crp

tdm<- TermDocumentMatrix(review_corpus, control = list(wordlengths = c(1,Inf)))


# inspect frequent words
freq_terms<- findFreqTerms(tdm, lowfreq=1)

term_freq<- rowSums(as.matrix(tdm))
term_freq<- subset(term_freq, term_freq>=1)
df<- data.frame(term = names(term_freq), freq = term_freq)

6
2.3 Text Analysis of a text ta(c)
library(dplyr)
library(tm)
library(ggplot2)
library(tidyverse)
library(tidytext)
library(pdftools)
library(wordcloud)
library(textstem)
# First Text mining Process
text_data <- data.frame(cbind(
id = 1:6,
text = c ( " I am Anubha - finance teacher eager to explore the potential of AI in education. ",
" My goal is to provide the best possible learning experience for my students.",
" I believe that AI tools can revolutionise the way we teach and learn.",
" I am open to collaborating with anyone who shares my passion for using AI to enhance
education.",
" I am a firm believer in student-centred learning and inquiry-based learning. ",
"I love data science. Data science is amazing!")
))

#A data frame source interprets each row of the data frame x as a document.
#The first column must be named "doc_id" and contain a unique string identifier for each document.
#The second column must be named "text"
colnames(text_data)=c('doc_id','text')
crp<- VCorpus(DataframeSource(text_data))

print(crp[[1]])

crp<- tm_map(crp, content_transformer(tolower))


print(crp[[1]]$content)
crp<- tm_map(crp, stripWhitespace) # removes whitespaces
crp<- tm_map(crp, removePunctuation) # removes punctuations
crp<- tm_map(crp, removeNumbers) # removes numbers
crp<- tm_map(crp, removeWords, stopwords("english"))
mystopwords<- c(stopwords("english"),"anubha", "godara")
crp<- tm_map(crp, removeWords, mystopwords)

# Lemmatization
crp<- tm_map(crp, content_transformer(lemmatize_strings))
print(crp[[1]]$content)
review_corpus<- crp

tdm<- TermDocumentMatrix(review_corpus, control = list(wordlengths = c(1,Inf)))


# inspect frequent words
freq_terms<- findFreqTerms(tdm, lowfreq=1)

7
term_freq<- rowSums(as.matrix(tdm))
term_freq<- subset(term_freq, term_freq>=1)
df<- data.frame(term = names(term_freq), freq = term_freq)

#association of words
find_assocs= findAssocs(tdm,"text",corlimit = 0.1)

# Now plotting the top frequent words


library(ggplot2)

df_plot<- df %>%
top_n(10)

# Plot word frequency


ggplot(df_plot, aes(x = reorder(term, +freq), y = freq, fill = freq)) +
geom_bar(stat = "identity")+ scale_colour_gradientn(colors = terrain.colors(10))+ xlab("Terms")+
ylab("Count")+coord_flip()

# Create word cloud


wordcloud(words = df$term, freq = df$freq, min.freq = 1,
random.order = FALSE, colors = brewer.pal(8, "Dark2"))

2.4 Sentiment Analysis of a text ta(d)


library(dplyr)
library(tm)
library(ggplot2)
library(tidyverse)
library(tidytext)
library(pdftools)
library(wordcloud)
library(textstem)
# First Text mining Process
text_data <- data.frame(cbind(
id = 1:6,
text = c ( " I am Anubha - finance teacher eager to explore the potential of AI in education. ",
" My goal is to provide the best possible learning experience for my students.",
" I believe that AI tools can revolutionise the way we teach and learn.",
" I am open to collaborating with anyone who shares my passion for using AI to enhance
education.",
" I am a firm believer in student-centred learning and inquiry-based learning. ",
"I love data science. Data science is amazing!")
))

#A data frame source interprets each row of the data frame x as a document.
#The first column must be named "doc_id" and contain a unique string identifier for each document.
#The second column must be named "text"

8
colnames(text_data)=c('doc_id','text')
crp<- VCorpus(DataframeSource(text_data))

print(crp[[1]])

crp<- tm_map(crp, content_transformer(tolower))


print(crp[[1]]$content)
crp<- tm_map(crp, stripWhitespace) # removes whitespaces
crp<- tm_map(crp, removePunctuation) # removes punctuations
crp<- tm_map(crp, removeNumbers) # removes numbers
crp<- tm_map(crp, removeWords, stopwords("english"))
mystopwords<- c(stopwords("english"),"anubha")
crp<- tm_map(crp, removeWords, mystopwords)

# Lemmatization
crp<- tm_map(crp, content_transformer(lemmatize_strings))
print(crp[[1]]$content)
review_corpus<- crp

tdm<- TermDocumentMatrix(review_corpus, control = list(wordlengths = c(1,Inf)))


# inspect frequent words
freq_terms<- findFreqTerms(tdm, lowfreq=1)

term_freq<- rowSums(as.matrix(tdm))
term_freq<- subset(term_freq, term_freq>=1)
df<- data.frame(term = names(term_freq), freq = term_freq)

#association of words
find_assocs= findAssocs(tdm,"text",corlimit = 0.1)

# Now plotting the top frequent words


library(ggplot2)

df_plot<- df %>%
top_n(10)

# Plot word frequency


ggplot(df_plot, aes(x = reorder(term, +freq), y = freq, fill = freq)) +
geom_bar(stat = "identity")+ scale_colour_gradientn(colors = terrain.colors(10))+ xlab("Terms")+
ylab("Count")+coord_flip()

# Create word cloud


wordcloud(words = df$term, freq = df$freq, min.freq = 1,
random.order = FALSE, colors = brewer.pal(8, "Dark2"))

# Get sentiment lexicon


sentiment_lexicon <- get_sentiments("bing")

9
colnames(df)[1]='word'
# Perform sentiment analysis
sentiment_analysis <- df %>%
inner_join(sentiment_lexicon, by=("word")) %>%
count(word, sentiment, sort = TRUE)

# View sentiment analysis


print(sentiment_analysis)

2.5 Sentiment Analysis of a CSV file ta(e)


library(dplyr)
library(tm)
library(ggplot2)
library(tidyverse)
library(tidytext)
library(pdftools)
library(wordcloud)
library(textstem)
#Analysing Excel Text
data=read.csv("C:/Users/ADMIN/OneDrive/Desktop/R/AM/amazon_vfl_reviews_session2.csv")
summary(data)
str(data)
data$sn=seq(1,nrow(data))

colnames(data)[c(6,5)]=c('doc_id','text')

crp<- VCorpus(DataframeSource(data))

print(crp[[1]])
crp<- tm_map(crp, content_transformer(tolower))
print(crp[[1]]$content)
crp<- tm_map(crp, stripWhitespace) # removes whitespaces
crp<- tm_map(crp, removePunctuation) # removes punctuations
crp<- tm_map(crp, removeNumbers) # removes numbers
crp<- tm_map(crp, removeWords, stopwords("english"))
mystopwords<- c(stopwords("english"),"book","people")
crp<- tm_map(crp, removeWords, mystopwords)

# Lemmatization
crp<- tm_map(crp, content_transformer(lemmatize_strings))
print(crp[[1]]$content)
review_corpus<- crp

tdm<- TermDocumentMatrix(review_corpus, control = list(wordlengths = c(1,Inf)))


# inspect frequent words
freq_terms<- findFreqTerms(tdm, lowfreq=1)

10
term_freq<- rowSums(as.matrix(tdm))
term_freq<- subset(term_freq, term_freq>=1)
df<- data.frame(term = names(term_freq), freq = term_freq)

#association of words
find_assocs= findAssocs(tdm,"text",corlimit = 0.1)

# Now plotting the top frequent words


library(ggplot2)

df_plot<- df %>%
top_n(10)

# Plot word frequency


ggplot(df_plot, aes(x = reorder(term, +freq), y = freq, fill = freq)) +
geom_bar(stat = "identity")+ scale_colour_gradientn(colors = terrain.colors(10))+ xlab("Terms")+
ylab("Count")+coord_flip()

# Create word cloud


wordcloud(words = df$term, freq = df$freq, min.freq = 1,
random.order = FALSE, colors = brewer.pal(8, "Dark2"))

# Get sentiment lexicon


sentiment_lexicon <- get_sentiments("bing")

colnames(df)[1]='word'
# Perform sentiment analysis
sentiment_analysis <- df %>%
inner_join(sentiment_lexicon, by=("word")) %>%
count(word, sentiment, sort = TRUE)

# View sentiment analysis


print(sentiment_analysis)

2.6 Sentiment Analysis of a PDF file ta(f)


#Read PDF Files
##Reading PDF Files From location
#identifying multiple pdf files from folder
library(pdftools)
library(tm)
stop_words2=c(stopwords("en"),"makes")
setwd("C:/Users/ADMIN/OneDrive/Desktop/R/AM/PDF")

files<- list.files(pattern = "pdf$")


files #files contain the named vector of pdf files

11
read_function<- readPDF(control=list(text="-layout"))
read_corpus<- Corpus(URISource(files[1:5]),readerControl = list(reader=read_function))

read_corpus<-tm_map(read_corpus,removePunctuation)

dtm <- DocumentTermMatrix(read_corpus, control = list(removePunctuation = TRUE, stopwords =


TRUE, tolower = TRUE, removeNumbers = TRUE, stemDocument = TRUE, bounds = list(global = c(3,
Inf))))

dtm_matrix<-as.matrix(dtm) # converting dtm to a matrix so that data becomes viewable


#some inverted commas, hastags etc are not removed from "remove punctuation"
# so we can use textclean package for those cases.

View(dtm_matrix) # running this might take 5 to 10 seconds as it shows the word count of each
word in 15 pdfs

dtm_matrix<-t(dtm_matrix) # to show the data in strutured format

number_occurance<- rowSums(dtm_matrix) #use rowSums not rowsum as this is matrix


number_occurance[1:20] #using the squared brackets to fix the number of words

number_occurance_sorted <- sort(number_occurance,decreasing = TRUE)


number_occurance_sorted[1:20] #using the squared brackets to fix the number of words

library(wordcloud)
set.seed(123)
wordcloud(names(number_occurance_sorted), number_occurance_sorted, max.words=25,
scale=c(3, .1), colors=brewer.pal(6, "Dark2"))

cor_word <- findAssocs(dtm, "marketing" , corlimit = 0.2)


cor_word$marketing[1:20] #as we are corelating with marketing

library(treemap)

data_frame<- data.frame(word=names(number_occurance_sorted),
freq=number_occurance_sorted)
data_frame[1:20,]

#Enter the minimum Frequency in a Word Tree


treemap(subset(data_frame,number_occurance_sorted>10), index = c('word'), vSize = 'freq')

#Enter How many words you want to enter


treemap(data_frame[1:10,], index = c('word'), vSize = 'freq')

#cluster analysis
distance<-dist(data_frame[1:20,] )

12
distance
clust<- hclust(distance)
plot(clust) #hang=-1 for symetric cluster roots

2.7 Sentiment Analysis of Chapter 7.1 (HKD)


#install readxl
install.packages("readxl")
library(readxl)
#Replace "our_pdf_file.xlxs" with the actual path to your EXCEL file
reviews<-read_excel("C:/Users/ADMIN/OneDrive/Desktop/R/R Data/socialmediareviews.xlsx")
#install tm
install.packages("tm")
library(tm)
review_corp<-VCorpus(VectorSource(reviews$reviews))
review_corp[1][2]
review_corp<-
tm_map(review_corp,removeWords,c("now","Know","took","that's","air","away","war","Know","jo
b","one","like","actually","new","guy","don't","things","lot","try","bit","don't","don't","anything","t
hing","say","also","can","get","used","got","take","just","now","will","it's","want","whatever","beco
me","that's","said","given","give","much"))
head(reviews)
tail(reviews)
tdm <- TermDocumentMatrix (review_corp, control = list (removePunctuation = TRUE, stopwords =
TRUE))
tdm_matrix<- as.matrix(tdm)
tdm_matrix<-t(tdm_matrix)
tdm_matrix[1:20]
number_occurrence <- rowSums(tdm_matrix)
number_occurrence[1:20]
number_occurrence_sorted<-sort(number_occurrence ,decreasing=TRUE)
number_occurrence_sorted[1:60]

2.8 Sentiment Analysis of Chapter 7.2(HKD)


#install readxl
install.packages("readxl")
library(readxl)
#Replace "our_pdf_file.xlxs" with the actual path to your EXCEL file
reviews<-read_excel("C:/Users/ADMIN/OneDrive/Desktop/R/R Data/socialmediareviews.xlsx")
#install tm
install.packages("tm")
library(tm)
review_corp<-VCorpus(VectorSource(reviews$reviews))
review_corp[1][2]
review_corp<-
tm_map(review_corp,removeWords,c("now","Know","took","that's","air","away","war","Know","jo
b","one","like","actually","new","guy","don't","things","lot","try","bit","don't","don't","anything","t

13
hing","say","also","can","get","used","got","take","just","now","will","it's","want","whatever","beco
me","that's","said","given","give","much"))
head(reviews)
tail(reviews)
tdm <- TermDocumentMatrix (review_corp, control = list (removePunctuation = TRUE, stopwords =
TRUE))
tdm_matrix<- as.matrix(tdm)
tdm_matrix<-t(tdm_matrix)
tdm_matrix[1:20]
number_occurrence <- rowSums(tdm_matrix)
number_occurrence[1:20]
number_occurrence_sorted<-sort(number_occurrence ,decreasing=TRUE)
number_occurrence_sorted[1:60]

library(wordcloud)
wordcloud(names(number_occurrence_sorted), number_occurrence_sorted, max.words=25,
scale=c(3, .1), colors=brewer.pal(6, "Dark2"))

#association of words
cor_word <- findAssocs(dtm, "time" , corlimit = 0.1)
cor_word$time

library(treemap)
data_frame<- data.frame(word=names(number_occurrence_sorted),
freq=number_occurrence_sorted)
data_frame[1:20,]

#Enter the minimum Frequency in a Word Tree


treemap(subset(data_frame,number_occurrence_sorted>10), index = c('word'), vSize = 'freq')

#Enter How many words you want to enter


treemap(data_frame[1:10,], index = c('word'), vSize = 'freq')

#cluster analysis
distance<-dist(data_frame[1:20,] )
distance
clust<- hclust(distance)
plot(clust)

#sentiment analysis
library(syuzhet)
sent_corpus<- iconv(reviews$reviews)
review_sent<- get_nrc_sentiment(sent_corpus)
head(review_sent)
sentiment_counts <- colSums(review_sent)
barplot(sentiment_counts, las=2, col=rainbow(10), ylab='Count', main= 'Sentiment reviews')

14
#Important Notes:
#Accuracy: Sentiment analysis accuracy depends heavily on the quality of the text data, the
chosen lexicon or model, and the complexity of the text.
#Lexicon Selection: The sentimentr package uses a built-in lexicon. You can explore other
lexicons (e.g., Bing Liu, AFINN) for potentially better results.
#Advanced Techniques: For more sophisticated sentiment analysis, consider using machine
learning models like Naive Bayes or Support Vector Machines.
#Error Handling: Implement robust error handling for potential issues like invalid PDF files
or unexpected text formats.

15

You might also like