0% found this document useful (0 votes)
56 views

A Tutorial of Text Mining in R Using TM Package

This document provides a tutorial for performing text mining in R using the TM package. It explains the basics of text mining concepts like corpora, document term matrices, stemming, stop words, and n-grams. The tutorial then walks through the steps of loading libraries, reading text files, cleaning the corpus, exploratory analysis including word clouds, and generating n-grams and histograms of the top n-grams. The goal is to provide an introductory guide for performing basic text mining in R.

Uploaded by

Angel Montilla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views

A Tutorial of Text Mining in R Using TM Package

This document provides a tutorial for performing text mining in R using the TM package. It explains the basics of text mining concepts like corpora, document term matrices, stemming, stop words, and n-grams. The tutorial then walks through the steps of loading libraries, reading text files, cleaning the corpus, exploratory analysis including word clouds, and generating n-grams and histograms of the top n-grams. The goal is to provide an introductory guide for performing basic text mining in R.

Uploaded by

Angel Montilla
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

A Tutorial of Text Mining in R Using TM

Package
Among all things for the people working on Data Analytics, one thing they will surely come across is
Data Mining. Data Mining is all about examining huge to extremely huge amount of structured and
unstructured data to form actionable insights.
This article is your guide to get started with Text Mining in R using TM package. It explains enormous
power that R and its packages have to offer on Text Mining. A person with elementary R knowledge
can use this article to get started with Text Mining. It guides user till exploratory data analysis and N-
Grams generation.
Important Terms:
Before we dig dip into Text Mining, we need to get familiar with some of the important concepts
related to Text Mining.
a. TM package: R package for Text Mining [1]
b. Corpus & Corpora: Corpus is a large collection of text. It is a body of written or spoken material
upon which a linguistic analysis is based. Plural form of Corpus is Corpora which essentially is
collections of documents containing natural language text. [2]
c. Document Term Matrix (DTM): A Document Term Matrix is a mathematical matrix that describes
the frequency of terms that occur in a collection of documents. It has documents in rows and word
frequencies in columns.
d. Stemming: Stemming is the process of converting words into their basis form making it easier for
analysis e.g. Words like win, winning and winner are converted and counted to their basic form i.e.
win.
e. Stop Words: These are most common words in a language that get repeated. However, they add little
value to text mining e.g. I, our, they’ll, etc. There are 174 stop words in English.
f. Bad Words: These are offensive words which need to be removed before we start data mining.
With above introduction and basics, let’s get started with implementing Text Mining in R.
Step 1: Install & load necessary libraries. Out of these, TM is R’s text mining package. Other packages
are supplementary packages that are used for reading lines from file, plotting, preparing word clouds,
N-Gram generation, etc.
Note: If any of above libraries are not installed, use install.packages() to get those installed.
Set constants that are to be used multiple times. This is considered as good programming practice.

Step 2: Read text file contents [3]. Optional — Gather and display basic file attributes viz. file size,
number of lines in file, number of words in file.

Step 3: Create file corpus, clean the corpus


Step 4: This step illustrates few basic exploratory data analysis steps that can act as reference for
detailed exploratory data analysis.

Output is not shown.


Step 5: Visualize frequency of words occurring in text file by using word clouds. Following code
snippet generates two word clouds to show un-stemmed and stemmed corpus word clouds:
Step 6: Last step of this guide is to generate N-Grams (uni, bi and tri grams) and plot histograms of top
10 occurring N-Grams.
Further steps could be use above generated N-Grams text mining activities like word predictions, etc.
References:
a. [1] TM package — https://cran.r-project.org/web/packages/tm/tm.pdf
b. [2] Corpus & Corpora — http://language.worldofcomputing.net/linguistics/introduction/what-is-
corpus.html
c. Text file referred in this guide uses text dump of following WIKI page —
https://en.wikipedia.org/wiki/Text_mining

Tomado de: https://medium.com/text-mining-in-data-science-a-tutorial-of-text/text-mining-in-data-


science-51299e4e594

You might also like