0% found this document useful (1 vote)

191 views4 pages

Text Mining in R (Intro)

This document provides an overview of the tm package in R, which provides functionality for text mining and management of text documents. It summarizes the key capabilities of the tm package, including reading in documents from various sources, preprocessing texts, creating term-document matrices, and applying text mining techniques like clustering and classification. It then presents two examples - an authorship attribution analysis of books from the Wizard of Oz series using stylometric techniques, and an analysis of topics and authors in the R-help mailing list. The tm package enables efficient text mining in R.

Uploaded by

Carlos Flores

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (1 vote)

191 views4 pages

Text Mining in R (Intro)

Uploaded by

Carlos Flores

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Vol.

8/2, October 2008

An Introduction to Text Mining in R

Ingo Feinerer

Introduction
Text mining has gained big interest both in academic research as in business intelligence applications within the last decade. There is an enormous
amount of textual data available in machine readable
format which can be easily accessed via the Internet
or databases. This ranges from scientific articles, abstracts and books to memos, letters, online forums,
mailing lists, blogs, and other communication media
delivering sensible information.
Text mining is a highly interdisciplinary research
field utilizing techniques from computer science, linguistics, and statistics. For the latter R is one of the
leading computing environments offering a broad
range of statistical methods. However, until recently,
R has lacked an explicit framework for text mining
purposes. This has changed with the tm (Feinerer,
2008; Feinerer et al., 2008) package which provides
a text mining infrastructure for R. This allows R
users to work efficiently with texts and corresponding meta data and transform the texts into structured
representations where existing R methods can be applied, e.g., for clustering or classification.
In this article we will give a short overview of the
tm package. Then we will present two exemplary
text mining applications, one in the field of stylometry and authorship attribution, the second is an analysis of a mailing list. Our aim is to show that tm provides the necessary abstraction over the actual text
management so that the reader can use his own texts
for a wide variety of text mining techniques.

The tm package
The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated
database backend support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched
document sets. With the package ships native support for handling the Reuters 21578 data set, Gmane
RSS feeds, e-mails, and several classic file formats
(e.g. plain text, CSV text, or PDFs). The data structures and algorithms can be extended to fit custom
demands, since the package is designed in a modular way to enable easy integration of new file formats, readers, transformations and filter operations.
tm provides easy access to preprocessing and manipulation mechanisms such as whitespace removal,
R News

stemming, or conversion between file formats. Further a generic filter architecture is available in order
to filter documents for certain criteria, or perform
full text search. The package supports the export
from document collections to term-document matrices, and string kernels can be easily constructed from
text documents.

Wizard of Oz stylometry
The Wizard of Oz book series has been among the
most popular childrens novels in the last century.
The first book was created and written by Lyman
Frank Baum, published in 1900. A series of Oz books
followed until Baum died in 1919. After his death
Ruth Plumly Thompson continued the story of Oz
books, but there was a longstanding doubt which
was the first Oz book written by Thompson. Especially the authorship of the 15th book of OzThe
Royal Book of Ozhas been longly disputed amongst
literature experts. Today it is commonly attributed
to Thompson as her first Oz book, supported by several statistical stylometric analyses within the last
decade.
Based on some techniques shown in the work of
Binongo (2003) we will investigate a subset of the Oz
book series for authorship attribution. Our data set
consists of five books of Oz, three attributed to Lyman Frank Baum, and two attributed to Ruth Plumly
Thompson:
The Wonderful Wizard of Oz is the first Oz book
written by Baum and was published in 1900.
The Marvelous Land of Oz is the second Oz book
by Baum published in 1904.
Ozma of Oz was published in 1907. It is authored
by Baum and forms the third book in the Oz
series.
The Royal Book of Oz is nowadays attributed to
Thompson, but its authorship has been disputed for decades. It was published in 1921
and is considered as the 15th book in the series
of Oz novels.
Ozoplaning with the Wizard of Oz was written by
Thompson, is the 33rd book of Oz, and was
published in 1939.
Most books of Oz, including the five books we use
as corpus for our analysis, can be freely downloaded at the Gutenberg Project website (http://
www.gutenberg.org/) or at the Wonderful Wizard of
Oz website (http://thewizardofoz.info/).
The main data structure in tm is a corpus consisting of text document s and additional meta data. Socalled reader functions can be used to read in texts
ISSN 1609-3631

Vol. 8/2, October 2008

from various sources, like text files on a hard disk.

E.g., following code creates the corpus oz by reading
in the text files from the directory OzBooks/, utilizing a standard reader for plain texts.
oz <- Corpus(DirSource("OzBooks/"))

The directory OzBooks/ contains five text files in

plain text format holding the five above described Oz
books. So the resulting corpus has five elements representing them:
> oz
A text document collection with 5 text documents

Since our input are plain texts without meta data annotations we might want to add the meta information manually. Technically we use the meta() function to operate on the meta data structures of the individual text documents. In our case the meta data is
stored locally in explicit slots of the text documents
(type = "local"):
meta(oz, tag = "Author", type = "local") <c(rep("Lyman Frank Baum", 3),
rep("Ruth Plumly Thompson", 2))
meta(oz, "Heading", "local") <c("The Wonderful Wizard of Oz",
"The Marvelous Land of Oz",
"Ozma of Oz",
"The Royal Book of Oz",
"Ozoplaning with the Wizard of Oz")

E.g. for our first book in our corpus this yields

> meta(oz[[1]])
Available meta data pairs are:
Author
: Lyman Frank Baum
Cached
: TRUE
DateTimeStamp: 2008-04-21 16:23:52
ID
: 1
Heading
: The Wonderful Wizard of Oz
Language
: en_US
URI
: file
OzBooks//01_TheWonderfulWizardOfOz.txt UTF-8

The first step in our analysis will be the extraction

of the most frequent terms, similarly to a procedure
by Binongo (2003). At first we create a so-called termdocument matrix, that is a representation of the corpus in form of a matrix with documents as rows and
terms as columns. The matrix elements are frequencies counting how often a term occurs in a document.
ozMatBaum <- TermDocMatrix(oz[1:3])
ozMatRoyal <- TermDocMatrix(oz[4])
ozMatThompson <- TermDocMatrix(oz[5])

After creating three term-document matrices for the

three cases we want to distinguishfirst, books by
Baum, second, the Royal Book of Oz, and third,
a book by Thompsonwe can use the function
findFreqTerms(mat, low, high) to compute the
terms in the matrix mat occurring at least low times
and at most high times (defaults to Inf).
R News

baum <- findFreqTerms(ozMatBaum, 70)

royal <- findFreqTerms(ozMatRoyal, 50)
thomp <- findFreqTerms(ozMatThompson, 40)

The lower bounds have been chosen in a way that

we obtain about 100 terms from each term-document
matrix. A simple test of intersection within the 100
most common words shows that the Royal Book of
Oz has more matches with the Thompson book than
with the Baum books, being a first indication for
Thompsons authorship.
> length(intersect(thomp, royal))
[1] 73
> length(intersect(baum, royal))
[1] 65

The next step applies a Principal Component

Analysis (PCA), where we use the kernlab (Karatzoglou et al., 2004) package for computation and visualization. Instead of using the whole books we
create equally sized chunks to consider stylometric
fluctuations within single books. In detail, the function makeChunks(corpus, chunksize) takes a corpus and the chunk size as input and returns a new
corpus containing the chunks. Then we can compute
a term-document matrix for a corpus holding chunks
of 500 lines length. Note that we use a binary weighting of the matrix elements, i.e., multiple term occurrences are only counted once.
ozMat <- TermDocMatrix(makeChunks(oz, 500),
list(weighting = weightBin))

This matrix is the input for the Kernel PCA, where

we use a standard Gaussian kernel, and request two
feature dimensions for better visualization.
k <- kpca(as.matrix(ozMat), features = 2)
plot(rotated(k),
col = c(rep("black", 10), rep("red", 14),
rep("blue", 10),
rep("yellow", 6), rep("green", 4)),
pty = "s",
xlab = "1st Principal Component",
ylab = "2nd Principal Component")

Figure 1 shows the results. Circles in black are

chunks from the first book of Oz by Baum, red
circles denote chunks from the second book by
Baum, and blue circles correspond to chunks of the
third book by Baum. Yellow circles depict chunks
from the long disputed 15th book (by Thompson), whereas green circles correspond to the 33rd
book of Oz by Thompson.
The results show
that there is a visual correspondence between the
15th and the 33rd book (yellow and green), this
means that both books tend to be authored by
Thompson. Anyway, the results also unveil that
ISSN 1609-3631

Vol. 8/2, October 2008

there are parts matching with books from Baum.

Our second example deals with aspects of analyzing

a mailing list. As data set we use the R-help mailing
list since its archives are publicly available and can
be accessed via an RSS feed or can be downloaded
in mailbox format. For the RSS feed we can use the
Gmane (Ingebrigtsen, 2008) service, so e.g., with

2nd Principal Component

1st Principal Component

Figure 1: Principal component plot for five Oz books

using 500 line chunks.
Similarly, we can redo this plot using chunks of
100 lines. This time we use a weighting scheme
called term frequency inverse document frequency,
a weighting very common in text mining and information retrieval.
ozMat <- TermDocMatrix(makeChunks(oz, 100),
list(weighting = weightTfIdf))

Figure 2 shows the corresponding plot using the

same colors as before. Again we see that there
is a correspondence between the 15th book and
the 33rd book by Thompson (yellow and green),
but also matching parts with books by Baum.

> meta(rss[[1]])
Available meta data pairs are:
Author
: Peter Dalgaard
Cached
: TRUE
DateTimeStamp: 2008-04-22 08:16:54
ID
: http://permalink.gmane.org/
gmane.comp.lang.r.general/112060
Heading
: R 2.7.0 is released
Language
: en_US
Origin
: Gmane Mailing List Archive
URI
: file http://rss.gmane.org/
gmane.comp.lang.r.general UTF-8

for a recent mail announcing the availability of R

2.7.0.
If we want to work with a bigger collection
of mails we use the mailing list archives in mailbox format. For demonstration we downloaded the
file 2008-January.txt.gz from https://stat.ethz.
ch/pipermail/r-help/. In the next step we convert
the single file containing all mails to a collection of
files each containing a single mail. This is especially
useful when adding new files or when lazy loading
of the mail corpora is intended, i.e., when the loading of the mail content into memory is deferred until it is first accessed. For the conversion we use
the function convertMboxEml() which extracts the Rhelp January 2008 mailing list postings to the directory Mails/2008-January/:
convertMboxEml(
gzfile("Mails/2008-January.txt.gz"),
"Mails/2008-January/")

2nd Principal Component

rss <- Corpus(GmaneSource("http://rss.gmane.org/

gmane.comp.lang.r.general"))

we get the latest news from the mailing list. tm ships

with a variety of sources for different purposes, including a source for RSS feeds in the format delivered
by Gmane. The source automatically uses a reader
function for newsgroups as it detects the RSS feed
structure. This means that the meta data in the e-mail
headers is extracted, e.g., we obtain

R-help mailing list analysis

Then we can create the corpus using the created directory. This time we explicitly define the reader to
be used as the directory source cannot infer automatically the internal structures of its files:

1.0

0.5

0.0

0.5

1.0

1st Principal Component

Figure 2: Principal component plot for five Oz books

using 100 line chunks.
R News

rhelp <- Corpus(DirSource("Mails/2008-January/"),

list(reader = readNewsgroup))

The newly created corpus rhelp representing the

January archive contains almost 2500 documents:
ISSN 1609-3631

Vol. 8/2, October 2008

> rhelp
A text document collection with 2486 documents

Then we can perform a set of transformations, like

converting the e-mails to plain text, strip extra
whitespace, convert the mails to lower case, or remove e-mail signatures (i.e., text after a -- (two hyphens, followed by a space) mark).
rhelp
rhelp
rhelp
rhelp

<<<<-

tmMap(rhelp,
tmMap(rhelp,
tmMap(rhelp,
tmMap(rhelp,

asPlain)
stripWhitespace)
tmTolower)
removeSignature)

This simple preprocessing already enables us to

work with the e-mails in a comfortable way, concentrating on the main content in the mails. This way it
circumvents the manual handling and parsing of the
internals of newsgroup postings.
Since we have the meta data available we can
find out who wrote the most postings during January 2008 in the R-help mailing list. We extract the
author information from all documents in the corpus
and normalize multiple entries (i.e., several lines for
the same mail) to a single line:
authors <- lapply(rhelp, Author)
authors <- sapply(authors, paste, collapse = " ")

Then we can easily find out the most active writers:

> sort(table(authors), decreasing = TRUE)[1:3]
Gabor Grothendieck 100
Prof Brian Ripley 97
Duncan Murdoch 63

Finally we perform a full text search on the corpus. We want to find out the percentage of mails
dealing with problems. In detail we search for those
documents in the corpus explicitly containing the
term problem. The function tmIndex() filters out
those documents and returns their index where the
full text search matches.
p <- tmIndex(rhelp, FUN = searchFullText,
"problem", doclevel = TRUE)

The return variable p is a logical vector of the same

size as the corpus indicating for each document
whether the search has matched or not. So we obtain
> sum(p) / length(rhelp)
[1] 0.2373290

as the percentage of explicit problem mails in relation

to the whole corpus.

Outlook
The tm package provides the basic infrastructure for
text mining applications in R. However, there are

R News

open challenges for future research: First, larger data

sets easily consist of several thousand documents resulting in large term-document matrices. This causes
a severe memory problem as soon as dense data
structures are computed from sparse term-document
matrices. Anyway, in many cases we can significantly reduce the size of a term-document matrix
by either removing stopwords, i.e., words with low
information entropy, or by using a controlled vocabulary. Both techniques are supported by tm via
the stopwords and dictionary arguments for the
TermDocMatrix() constructor. Second, operations
on large corpora are time consuming and should be
avoided as far as possible. A first solution resulted
in so-called lazy transformations, which materialize
operations only when documents are later accessed,
but further improvement is necessary. Finally, we
have to work on better integration with other packages for natural language processing, e.g., with packages for tag based annotation.

Acknowledgments
I would like to thank Kurt Hornik, David Meyer,
and Vince Carey for their constructive comments and
feedback.

Bibliography
J. N. G. Binongo. Who wrote the 15th book of Oz? An
application of multivariate analysis to authorship
attribution. Chance, 16(2):917, 2003.
I. Feinerer. tm: Text Mining Package, 2008. URL http:
//CRAN.R-project.org/package=tm. R package
version 0.3-1.
I. Feinerer, K. Hornik, and D. Meyer. Text mining
infrastructure in R. Journal of Statistical Software,
25(5):154, March 2008. ISSN 1548-7660. URL
http://www.jstatsoft.org/v25/i05.
L. M. Ingebrigtsen. Gmane: A mailing list archive,
2008. URL http://gmane.org/.
A. Karatzoglou, A. Smola, K. Hornik, and A. Zeileis.
kernlab an S4 package for kernel methods in R.
Journal of Statistical Software, 11(9):120, 2004. URL
http://www.jstatsoft.org/v11/i09/.

Ingo Feinerer
Wirtschaftsuniversitt Wien, Austria
[email protected]

ISSN 1609-3631

Doctor's Secret To Hair Growth
100% (1)
Doctor's Secret To Hair Growth
42 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Pattern Recognition - Unit - 1&2
100% (1)
Pattern Recognition - Unit - 1&2
41 pages
Knowledge Cartography 2014
No ratings yet
Knowledge Cartography 2014
555 pages
Unit-3: Non-Linear Data Structure
No ratings yet
Unit-3: Non-Linear Data Structure
23 pages
Data Warehousing and Data Mining - Handbook
0% (2)
Data Warehousing and Data Mining - Handbook
27 pages
IRSEM-Version3-June-2021 CBTC RAMS Indian Railwways
100% (1)
IRSEM-Version3-June-2021 CBTC RAMS Indian Railwways
551 pages
Brief - Data Governance
No ratings yet
Brief - Data Governance
20 pages
4 Data Mining & Preprocessing L 11,12,13,14,15,16
No ratings yet
4 Data Mining & Preprocessing L 11,12,13,14,15,16
100 pages
Indian History PDF
No ratings yet
Indian History PDF
188 pages
Topic 1 Etw3482
100% (2)
Topic 1 Etw3482
69 pages
Practical Data Science
No ratings yet
Practical Data Science
121 pages
DataMining S
No ratings yet
DataMining S
103 pages
Python Natural Language Processing Cookbook: Over 60 recipes for building powerful NLP solutions using Python and LLM libraries
From Everand
Python Natural Language Processing Cookbook: Over 60 recipes for building powerful NLP solutions using Python and LLM libraries
Zhenya Antić
No ratings yet
MC4411 Project Work - Format
No ratings yet
MC4411 Project Work - Format
65 pages
HTML Tables and Forms (PDFDrive)
100% (1)
HTML Tables and Forms (PDFDrive)
68 pages
Mining Frequent Itemset-Association Analysis
No ratings yet
Mining Frequent Itemset-Association Analysis
59 pages
Social Media Data Mining and Analytics
From Everand
Social Media Data Mining and Analytics
Gabor Szabo
No ratings yet
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
No ratings yet
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
40 pages
Natural Language Processing
100% (1)
Natural Language Processing
12 pages
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
No ratings yet
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
23 pages
Applied Text Analysis
No ratings yet
Applied Text Analysis
13 pages
Data Mining
No ratings yet
Data Mining
27 pages
Solutions Manual Using R Introductory ST
No ratings yet
Solutions Manual Using R Introductory ST
33 pages
DataMining Lecture 1
No ratings yet
DataMining Lecture 1
35 pages
Ultimate Salesforce Data Cloud for Customer Experience: Explore, Implement and Elevate B2C Experiences Through Customer Data Innovations Using Salesforce Data Cloud
From Everand
Ultimate Salesforce Data Cloud for Customer Experience: Explore, Implement and Elevate B2C Experiences Through Customer Data Innovations Using Salesforce Data Cloud
Gourab Mukherjee
No ratings yet
Data Mining: Books
No ratings yet
Data Mining: Books
14 pages
SPARQL
No ratings yet
SPARQL
39 pages
WD801
No ratings yet
WD801
2 pages
02 - Data Analytics Prefessional Course
100% (1)
02 - Data Analytics Prefessional Course
16 pages
An Introduction To Text: Mining
No ratings yet
An Introduction To Text: Mining
39 pages
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
2017 Fuzzy Information Retrieval
No ratings yet
2017 Fuzzy Information Retrieval
83 pages
DataMiningForTheMasses (001 158)
No ratings yet
DataMiningForTheMasses (001 158)
158 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
91 pages
DWDM R13 Unit 1 PDF
No ratings yet
DWDM R13 Unit 1 PDF
10 pages
Kernel-Based-Data-Fusion-For-Machine-Learning - Methods-And-Applications-In-Bioinformatics-And-Text-Mining - (Yu,-Tranchevent,-De-Moor - Moreau-2011-03-26) - (Cuuduongthancong - Com) PDF
No ratings yet
Kernel-Based-Data-Fusion-For-Machine-Learning - Methods-And-Applications-In-Bioinformatics-And-Text-Mining - (Yu,-Tranchevent,-De-Moor - Moreau-2011-03-26) - (Cuuduongthancong - Com) PDF
228 pages
Bda - 2 Unit
No ratings yet
Bda - 2 Unit
12 pages
NICU Discharge Plan
No ratings yet
NICU Discharge Plan
58 pages
Deep Learning and CNNFYTGS5101-Guoyangxie
No ratings yet
Deep Learning and CNNFYTGS5101-Guoyangxie
42 pages
SAS Presentation
No ratings yet
SAS Presentation
49 pages
CH 6
No ratings yet
CH 6
72 pages
Web Mining
No ratings yet
Web Mining
53 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
34 pages
Disciple Making and Church Planting
100% (1)
Disciple Making and Church Planting
5 pages
Java-Important Questions
100% (3)
Java-Important Questions
3 pages
Cedric Tutt Dentistry CPD - Caroline Edit
No ratings yet
Cedric Tutt Dentistry CPD - Caroline Edit
100 pages
Lecture 3 Data Mining
No ratings yet
Lecture 3 Data Mining
30 pages
ER Practical 7r
No ratings yet
ER Practical 7r
5 pages
A Guide To Teaching Data Science PDF
No ratings yet
A Guide To Teaching Data Science PDF
26 pages
Data Science New
No ratings yet
Data Science New
9 pages
2011 ED03 Burbank Hoberman PDF
No ratings yet
2011 ED03 Burbank Hoberman PDF
49 pages
Glenn Gould RCM
No ratings yet
Glenn Gould RCM
5 pages
Data Visualization Techniques
No ratings yet
Data Visualization Techniques
20 pages
Unit 3 Notes UDS23201J Query Processing
No ratings yet
Unit 3 Notes UDS23201J Query Processing
38 pages
Machine Learning Guide Line
No ratings yet
Machine Learning Guide Line
10 pages
Introduction To Data Mining
No ratings yet
Introduction To Data Mining
19 pages
The Age of Big Data: Kayvan Tirdad
No ratings yet
The Age of Big Data: Kayvan Tirdad
26 pages
Over Load Protection For Transformer
No ratings yet
Over Load Protection For Transformer
45 pages
Text and Sentiment Analysis
No ratings yet
Text and Sentiment Analysis
41 pages
Chapter 17 Embankments
No ratings yet
Chapter 17 Embankments
67 pages
Lab Manual
No ratings yet
Lab Manual
46 pages
Anyaya NG Imperyalista Philippine Writers Series by Elynia S. Mabanglo
No ratings yet
Anyaya NG Imperyalista Philippine Writers Series by Elynia S. Mabanglo
9 pages
Beeswax
100% (1)
Beeswax
4 pages
Shivendra Frontpage
No ratings yet
Shivendra Frontpage
10 pages
Literature and Literary Criticism
No ratings yet
Literature and Literary Criticism
27 pages
Independent Component Analysis: Bhagesh Bhutani (20) Chayan Sharma (21) Deepak
No ratings yet
Independent Component Analysis: Bhagesh Bhutani (20) Chayan Sharma (21) Deepak
15 pages
Outline: Problem Statement Definitions & Examples Strategies
No ratings yet
Outline: Problem Statement Definitions & Examples Strategies
7 pages
INRDeals APIDocumentation
No ratings yet
INRDeals APIDocumentation
23 pages
Mercedeslist17 2 24
No ratings yet
Mercedeslist17 2 24
27 pages
Eco Chill Leaflet Final - Web - 20.02.2024
No ratings yet
Eco Chill Leaflet Final - Web - 20.02.2024
6 pages
Form 2 Reqbot
No ratings yet
Form 2 Reqbot
12 pages
Simon's Favorite Factoring Trick: Eugenis May 31, 2015
No ratings yet
Simon's Favorite Factoring Trick: Eugenis May 31, 2015
2 pages
Newton Raphson Method - Formula, Solved Examples
No ratings yet
Newton Raphson Method - Formula, Solved Examples
10 pages
Text Mining With R - Twitter Data Analysis
No ratings yet
Text Mining With R - Twitter Data Analysis
24 pages
Burdwan University Economics PH D List
No ratings yet
Burdwan University Economics PH D List
8 pages
A Survey On Data Mining
No ratings yet
A Survey On Data Mining
4 pages
Global Map of R&D Investment v3
No ratings yet
Global Map of R&D Investment v3
18 pages
Lian Gong Mi Jue
100% (20)
Lian Gong Mi Jue
165 pages
Book Report
No ratings yet
Book Report
5 pages
Physics Class 12
No ratings yet
Physics Class 12
9 pages
Factors in R
No ratings yet
Factors in R
6 pages
Test 2 Questions
No ratings yet
Test 2 Questions
6 pages
Text Mining Techniques Applications and Issues2
No ratings yet
Text Mining Techniques Applications and Issues2
5 pages
A Simplified Method of Three Dimensional Technique For The Detection of AmpC Beta-Lactamases
No ratings yet
A Simplified Method of Three Dimensional Technique For The Detection of AmpC Beta-Lactamases
7 pages
Different Text Mining Techniques
No ratings yet
Different Text Mining Techniques
4 pages
PROPOSED - Date Sheet For Mid-Term Examination. March 2024
No ratings yet
PROPOSED - Date Sheet For Mid-Term Examination. March 2024
5 pages
List of Experiments OOPM16
No ratings yet
List of Experiments OOPM16
3 pages
Research References Sample APA
No ratings yet
Research References Sample APA
6 pages
Data Scientist - KD PDF
No ratings yet
Data Scientist - KD PDF
1 page
Proximity Search Operators Guidelines
No ratings yet
Proximity Search Operators Guidelines
2 pages
Wepik Geometric Blue Tom Resume 20230928140657REcc
No ratings yet
Wepik Geometric Blue Tom Resume 20230928140657REcc
1 page
Toefl - Klaudio Fersely Sareng - 110017058
No ratings yet
Toefl - Klaudio Fersely Sareng - 110017058
1 page
Internal Column Section External Column Section
No ratings yet
Internal Column Section External Column Section
1 page

Text Mining in R (Intro)

Uploaded by

Text Mining in R (Intro)

Uploaded by

Vol.

8/2, October 2008

An Introduction to Text Mining in R

Vol. 8/2, October 2008

from various sources, like text files on a hard disk.

The directory OzBooks/ contains five text files in

E.g. for our first book in our corpus this yields

The first step in our analysis will be the extraction

After creating three term-document matrices for the

baum <- findFreqTerms(ozMatBaum, 70)

The lower bounds have been chosen in a way that

The next step applies a Principal Component

This matrix is the input for the Kernel PCA, where

Figure 1 shows the results. Circles in black are

Vol. 8/2, October 2008

there are parts matching with books from Baum.

Our second example deals with aspects of analyzing

2nd Principal Component

1st Principal Component

Figure 1: Principal component plot for five Oz books

Figure 2 shows the corresponding plot using the

for a recent mail announcing the availability of R

2nd Principal Component

rss <- Corpus(GmaneSource("http://rss.gmane.org/

we get the latest news from the mailing list. tm ships

R-help mailing list analysis

1st Principal Component

Figure 2: Principal component plot for five Oz books

rhelp <- Corpus(DirSource("Mails/2008-January/"),

The newly created corpus rhelp representing the

Vol. 8/2, October 2008

Then we can perform a set of transformations, like

This simple preprocessing already enables us to

Then we can easily find out the most active writers:

The return variable p is a logical vector of the same

as the percentage of explicit problem mails in relation

open challenges for future research: First, larger data

You might also like