Text Mining in R (Intro)
Text Mining in R (Intro)
19
Introduction
Text mining has gained big interest both in academic research as in business intelligence applications within the last decade. There is an enormous
amount of textual data available in machine readable
format which can be easily accessed via the Internet
or databases. This ranges from scientific articles, abstracts and books to memos, letters, online forums,
mailing lists, blogs, and other communication media
delivering sensible information.
Text mining is a highly interdisciplinary research
field utilizing techniques from computer science, linguistics, and statistics. For the latter R is one of the
leading computing environments offering a broad
range of statistical methods. However, until recently,
R has lacked an explicit framework for text mining
purposes. This has changed with the tm (Feinerer,
2008; Feinerer et al., 2008) package which provides
a text mining infrastructure for R. This allows R
users to work efficiently with texts and corresponding meta data and transform the texts into structured
representations where existing R methods can be applied, e.g., for clustering or classification.
In this article we will give a short overview of the
tm package. Then we will present two exemplary
text mining applications, one in the field of stylometry and authorship attribution, the second is an analysis of a mailing list. Our aim is to show that tm provides the necessary abstraction over the actual text
management so that the reader can use his own texts
for a wide variety of text mining techniques.
The tm package
The tm package offers functionality for managing text documents, abstracts the process of document manipulation and eases the usage of heterogeneous text formats in R. The package has integrated
database backend support to minimize memory demands. An advanced meta data management is implemented for collections of text documents to alleviate the usage of large and with meta data enriched
document sets. With the package ships native support for handling the Reuters 21578 data set, Gmane
RSS feeds, e-mails, and several classic file formats
(e.g. plain text, CSV text, or PDFs). The data structures and algorithms can be extended to fit custom
demands, since the package is designed in a modular way to enable easy integration of new file formats, readers, transformations and filter operations.
tm provides easy access to preprocessing and manipulation mechanisms such as whitespace removal,
R News
stemming, or conversion between file formats. Further a generic filter architecture is available in order
to filter documents for certain criteria, or perform
full text search. The package supports the export
from document collections to term-document matrices, and string kernels can be easily constructed from
text documents.
Wizard of Oz stylometry
The Wizard of Oz book series has been among the
most popular childrens novels in the last century.
The first book was created and written by Lyman
Frank Baum, published in 1900. A series of Oz books
followed until Baum died in 1919. After his death
Ruth Plumly Thompson continued the story of Oz
books, but there was a longstanding doubt which
was the first Oz book written by Thompson. Especially the authorship of the 15th book of OzThe
Royal Book of Ozhas been longly disputed amongst
literature experts. Today it is commonly attributed
to Thompson as her first Oz book, supported by several statistical stylometric analyses within the last
decade.
Based on some techniques shown in the work of
Binongo (2003) we will investigate a subset of the Oz
book series for authorship attribution. Our data set
consists of five books of Oz, three attributed to Lyman Frank Baum, and two attributed to Ruth Plumly
Thompson:
The Wonderful Wizard of Oz is the first Oz book
written by Baum and was published in 1900.
The Marvelous Land of Oz is the second Oz book
by Baum published in 1904.
Ozma of Oz was published in 1907. It is authored
by Baum and forms the third book in the Oz
series.
The Royal Book of Oz is nowadays attributed to
Thompson, but its authorship has been disputed for decades. It was published in 1921
and is considered as the 15th book in the series
of Oz novels.
Ozoplaning with the Wizard of Oz was written by
Thompson, is the 33rd book of Oz, and was
published in 1939.
Most books of Oz, including the five books we use
as corpus for our analysis, can be freely downloaded at the Gutenberg Project website (http://
www.gutenberg.org/) or at the Wonderful Wizard of
Oz website (http://thewizardofoz.info/).
The main data structure in tm is a corpus consisting of text document s and additional meta data. Socalled reader functions can be used to read in texts
ISSN 1609-3631
Since our input are plain texts without meta data annotations we might want to add the meta information manually. Technically we use the meta() function to operate on the meta data structures of the individual text documents. In our case the meta data is
stored locally in explicit slots of the text documents
(type = "local"):
meta(oz, tag = "Author", type = "local") <c(rep("Lyman Frank Baum", 3),
rep("Ruth Plumly Thompson", 2))
meta(oz, "Heading", "local") <c("The Wonderful Wizard of Oz",
"The Marvelous Land of Oz",
"Ozma of Oz",
"The Royal Book of Oz",
"Ozoplaning with the Wizard of Oz")
20
21
> meta(rss[[1]])
Available meta data pairs are:
Author
: Peter Dalgaard
Cached
: TRUE
DateTimeStamp: 2008-04-22 08:16:54
ID
: http://permalink.gmane.org/
gmane.comp.lang.r.general/112060
Heading
: R 2.7.0 is released
Language
: en_US
Origin
: Gmane Mailing List Archive
URI
: file http://rss.gmane.org/
gmane.comp.lang.r.general UTF-8
Then we can create the corpus using the created directory. This time we explicitly define the reader to
be used as the directory source cannot infer automatically the internal structures of its files:
1.0
0.5
0.0
0.5
1.0
22
> rhelp
A text document collection with 2486 documents
<<<<-
tmMap(rhelp,
tmMap(rhelp,
tmMap(rhelp,
tmMap(rhelp,
asPlain)
stripWhitespace)
tmTolower)
removeSignature)
Finally we perform a full text search on the corpus. We want to find out the percentage of mails
dealing with problems. In detail we search for those
documents in the corpus explicitly containing the
term problem. The function tmIndex() filters out
those documents and returns their index where the
full text search matches.
p <- tmIndex(rhelp, FUN = searchFullText,
"problem", doclevel = TRUE)
Outlook
The tm package provides the basic infrastructure for
text mining applications in R. However, there are
R News
Acknowledgments
I would like to thank Kurt Hornik, David Meyer,
and Vince Carey for their constructive comments and
feedback.
Bibliography
J. N. G. Binongo. Who wrote the 15th book of Oz? An
application of multivariate analysis to authorship
attribution. Chance, 16(2):917, 2003.
I. Feinerer. tm: Text Mining Package, 2008. URL http:
//CRAN.R-project.org/package=tm. R package
version 0.3-1.
I. Feinerer, K. Hornik, and D. Meyer. Text mining
infrastructure in R. Journal of Statistical Software,
25(5):154, March 2008. ISSN 1548-7660. URL
http://www.jstatsoft.org/v25/i05.
L. M. Ingebrigtsen. Gmane: A mailing list archive,
2008. URL http://gmane.org/.
A. Karatzoglou, A. Smola, K. Hornik, and A. Zeileis.
kernlab an S4 package for kernel methods in R.
Journal of Statistical Software, 11(9):120, 2004. URL
http://www.jstatsoft.org/v11/i09/.
Ingo Feinerer
Wirtschaftsuniversitt Wien, Austria
[email protected]
ISSN 1609-3631