Text Analysis
Text Analysis
Analysis
R
1
Table of Contents
1 Text Analysis: Introduction ............................................................................................................. 3
2 Text Analysis in R............................................................................................................................. 5
2.1 Text Analysis of a text ta(a) ..................................................................................................... 5
2.2 Text Analysis of a text ta(b)..................................................................................................... 5
2.3 Text Analysis of a text ta(c) ..................................................................................................... 7
2.4 Sentiment Analysis of a text ta(d) ........................................................................................... 8
2.5 Sentiment Analysis of a CSV file ta(e) ................................................................................... 10
2.6 Sentiment Analysis of a PDF file ta(f) .................................................................................... 11
2.7 Sentiment Analysis of Chapter 7.1 (HKD) ............................................................................. 13
2.8 Sentiment Analysis of Chapter 7.2(HKD) .............................................................................. 13
2
1 Text Analysis: Introduction
Suppose you have a mountain of text data: customer reviews, news articles, books – the
possibilities are endless. Text analysis is like having a powerful magnifying glass and a set of
tools to sift through this mountain and uncover hidden patterns, understand the underlying
meaning, and extract valuable insights.
You have a box full of jigsaw puzzles. Each puzzle piece is a word, and the entire box is a
collection of texts.
• Find all the corner pieces: Identify the most frequent words (like "the," "a," "is")
– these are common but not always the most meaningful.
• Group similar pieces: Find words that often appear together (like "delicious" and
"food," "fast" and "delivery") to understand themes and topics.
• Determine the overall picture: Analyse the sentiment (positive, negative, neutral)
expressed in the text, identify the main topics discussed, and even predict future
trends.
Let's say you have a collection of customer reviews for a restaurant. You can use text analysis
to:
• R: A powerful programming language with many libraries for text analysis (like
tidytext, tm, sentimentr).
• Python: Another popular language with libraries like NLTK, spaCy, and scikit-learn.
3
Text analysis is a rapidly growing field with applications in various domains, including
business, marketing, social sciences, and even healthcare.
Import Data
Data Cleaning
Lemma
ggplot2
Plot
wordcloud
Word cloud
syuzhet
Sentiment
Analysis
4
2 Text Analysis in R
2.1 Text Analysis of a text ta(a)
library(dplyr)
library(tm)
library(ggplot2)
library(tidyverse)
library(tidytext)
library(pdftools)
library(wordcloud)
library(textstem)
# First Text mining Process
text_data <- data.frame(cbind(
id = 1:6,
text = c ( " I am Anubha - finance teacher eager to explore the potential of AI in education. ",
" My goal is to provide the best possible learning experience for my students.",
" I believe that AI tools can revolutionise the way we teach and learn.",
" I am open to collaborating with anyone who shares my passion for using AI to enhance
education.",
" I am a firm believer in student-centred learning and inquiry-based learning. ",
"I love data science. Data science is amazing!")
))
#A data frame source interprets each row of the data frame x as a document.
#The first column must be named "doc_id" and contain a unique string identifier for each document.
#The second column must be named "text"
colnames(text_data)=c('doc_id','text')
crp<- VCorpus(DataframeSource(text_data))
#VCorpus stands for Volatile Corpus (VCorpus is suitable for smaller datasets that can be comfortably
held in memory)
print(crp[[1]])
#tm_map() allows you to apply a specified function to each document within a corpus.
crp<- tm_map(crp, content_transformer(tolower))
print(crp[[1]]$content)
5
text_data <- data.frame(cbind(
id = 1:6,
text = c ( " I am Anubha - finance teacher eager to explore the potential of AI in education. ",
" My goal is to provide the best possible learning experience for my students.",
" I believe that AI tools can revolutionise the way we teach and learn.",
" I am open to collaborating with anyone who shares my passion for using AI to enhance
education.",
" I am a firm believer in student-centred learning and inquiry-based learning. ",
"I love data science. Data science is amazing!")
))
#A data frame source interprets each row of the data frame x as a document.
#The first column must be named "doc_id" and contain a unique string identifier for each document.
#The second column must be named "text"
colnames(text_data)=c('doc_id','text')
crp<- VCorpus(DataframeSource(text_data))
#VCorpus stands for Volatile Corpus (VCorpus is suitable for smaller datasets that can be comfortably
held in memory)
print(crp[[1]])
#tm_map() allows you to apply a specified function to each document within a corpus.
crp<- tm_map(crp, content_transformer(tolower))
print(crp[[1]]$content)
crp<- tm_map(crp, stripWhitespace) # removes whitespaces
crp<- tm_map(crp, removePunctuation) # removes punctuations
crp<- tm_map(crp, removeNumbers) # removes numbers
crp<- tm_map(crp, removeWords, stopwords("english"))
#Examples of common English stop words:
##Articles: a, an, the
##Prepositions: in, on, at, to, from, with, for
##Conjunctions: and, but, or, if, because
##Pronouns: I, you, he, she, it, they, we, me, him, her, them, us
##Other: no, not, only, very, this, that, these, those
mystopwords<- c(stopwords("english"),"anubha")
crp<- tm_map(crp, removeWords, mystopwords)
term_freq<- rowSums(as.matrix(tdm))
term_freq<- subset(term_freq, term_freq>=1)
df<- data.frame(term = names(term_freq), freq = term_freq)
6
2.3 Text Analysis of a text ta(c)
library(dplyr)
library(tm)
library(ggplot2)
library(tidyverse)
library(tidytext)
library(pdftools)
library(wordcloud)
library(textstem)
# First Text mining Process
text_data <- data.frame(cbind(
id = 1:6,
text = c ( " I am Anubha - finance teacher eager to explore the potential of AI in education. ",
" My goal is to provide the best possible learning experience for my students.",
" I believe that AI tools can revolutionise the way we teach and learn.",
" I am open to collaborating with anyone who shares my passion for using AI to enhance
education.",
" I am a firm believer in student-centred learning and inquiry-based learning. ",
"I love data science. Data science is amazing!")
))
#A data frame source interprets each row of the data frame x as a document.
#The first column must be named "doc_id" and contain a unique string identifier for each document.
#The second column must be named "text"
colnames(text_data)=c('doc_id','text')
crp<- VCorpus(DataframeSource(text_data))
print(crp[[1]])
# Lemmatization
crp<- tm_map(crp, content_transformer(lemmatize_strings))
print(crp[[1]]$content)
review_corpus<- crp
7
term_freq<- rowSums(as.matrix(tdm))
term_freq<- subset(term_freq, term_freq>=1)
df<- data.frame(term = names(term_freq), freq = term_freq)
#association of words
find_assocs= findAssocs(tdm,"text",corlimit = 0.1)
df_plot<- df %>%
top_n(10)
#A data frame source interprets each row of the data frame x as a document.
#The first column must be named "doc_id" and contain a unique string identifier for each document.
#The second column must be named "text"
8
colnames(text_data)=c('doc_id','text')
crp<- VCorpus(DataframeSource(text_data))
print(crp[[1]])
# Lemmatization
crp<- tm_map(crp, content_transformer(lemmatize_strings))
print(crp[[1]]$content)
review_corpus<- crp
term_freq<- rowSums(as.matrix(tdm))
term_freq<- subset(term_freq, term_freq>=1)
df<- data.frame(term = names(term_freq), freq = term_freq)
#association of words
find_assocs= findAssocs(tdm,"text",corlimit = 0.1)
df_plot<- df %>%
top_n(10)
9
colnames(df)[1]='word'
# Perform sentiment analysis
sentiment_analysis <- df %>%
inner_join(sentiment_lexicon, by=("word")) %>%
count(word, sentiment, sort = TRUE)
colnames(data)[c(6,5)]=c('doc_id','text')
crp<- VCorpus(DataframeSource(data))
print(crp[[1]])
crp<- tm_map(crp, content_transformer(tolower))
print(crp[[1]]$content)
crp<- tm_map(crp, stripWhitespace) # removes whitespaces
crp<- tm_map(crp, removePunctuation) # removes punctuations
crp<- tm_map(crp, removeNumbers) # removes numbers
crp<- tm_map(crp, removeWords, stopwords("english"))
mystopwords<- c(stopwords("english"),"book","people")
crp<- tm_map(crp, removeWords, mystopwords)
# Lemmatization
crp<- tm_map(crp, content_transformer(lemmatize_strings))
print(crp[[1]]$content)
review_corpus<- crp
10
term_freq<- rowSums(as.matrix(tdm))
term_freq<- subset(term_freq, term_freq>=1)
df<- data.frame(term = names(term_freq), freq = term_freq)
#association of words
find_assocs= findAssocs(tdm,"text",corlimit = 0.1)
df_plot<- df %>%
top_n(10)
colnames(df)[1]='word'
# Perform sentiment analysis
sentiment_analysis <- df %>%
inner_join(sentiment_lexicon, by=("word")) %>%
count(word, sentiment, sort = TRUE)
11
read_function<- readPDF(control=list(text="-layout"))
read_corpus<- Corpus(URISource(files[1:5]),readerControl = list(reader=read_function))
read_corpus<-tm_map(read_corpus,removePunctuation)
View(dtm_matrix) # running this might take 5 to 10 seconds as it shows the word count of each
word in 15 pdfs
library(wordcloud)
set.seed(123)
wordcloud(names(number_occurance_sorted), number_occurance_sorted, max.words=25,
scale=c(3, .1), colors=brewer.pal(6, "Dark2"))
library(treemap)
data_frame<- data.frame(word=names(number_occurance_sorted),
freq=number_occurance_sorted)
data_frame[1:20,]
#cluster analysis
distance<-dist(data_frame[1:20,] )
12
distance
clust<- hclust(distance)
plot(clust) #hang=-1 for symetric cluster roots
13
hing","say","also","can","get","used","got","take","just","now","will","it's","want","whatever","beco
me","that's","said","given","give","much"))
head(reviews)
tail(reviews)
tdm <- TermDocumentMatrix (review_corp, control = list (removePunctuation = TRUE, stopwords =
TRUE))
tdm_matrix<- as.matrix(tdm)
tdm_matrix<-t(tdm_matrix)
tdm_matrix[1:20]
number_occurrence <- rowSums(tdm_matrix)
number_occurrence[1:20]
number_occurrence_sorted<-sort(number_occurrence ,decreasing=TRUE)
number_occurrence_sorted[1:60]
library(wordcloud)
wordcloud(names(number_occurrence_sorted), number_occurrence_sorted, max.words=25,
scale=c(3, .1), colors=brewer.pal(6, "Dark2"))
#association of words
cor_word <- findAssocs(dtm, "time" , corlimit = 0.1)
cor_word$time
library(treemap)
data_frame<- data.frame(word=names(number_occurrence_sorted),
freq=number_occurrence_sorted)
data_frame[1:20,]
#cluster analysis
distance<-dist(data_frame[1:20,] )
distance
clust<- hclust(distance)
plot(clust)
#sentiment analysis
library(syuzhet)
sent_corpus<- iconv(reviews$reviews)
review_sent<- get_nrc_sentiment(sent_corpus)
head(review_sent)
sentiment_counts <- colSums(review_sent)
barplot(sentiment_counts, las=2, col=rainbow(10), ylab='Count', main= 'Sentiment reviews')
14
#Important Notes:
#Accuracy: Sentiment analysis accuracy depends heavily on the quality of the text data, the
chosen lexicon or model, and the complexity of the text.
#Lexicon Selection: The sentimentr package uses a built-in lexicon. You can explore other
lexicons (e.g., Bing Liu, AFINN) for potentially better results.
#Advanced Techniques: For more sophisticated sentiment analysis, consider using machine
learning models like Naive Bayes or Support Vector Machines.
#Error Handling: Implement robust error handling for potential issues like invalid PDF files
or unexpected text formats.
15