A Tutorial of Text Mining in R Using TM Package

This document provides a tutorial for performing text mining in R using the TM package. It explains the basics of text mining concepts like corpora, document term matrices, stemming, stop words, and n-grams. The tutorial then walks through the steps of loading libraries, reading text files, cleaning the corpus, exploratory analysis including word clouds, and generating n-grams and histograms of the top n-grams. The goal is to provide an introductory guide for performing basic text mining in R.

Uploaded by

Angel Montilla

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views

A Tutorial of Text Mining in R Using TM Package

Uploaded by

Angel Montilla

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

A Tutorial of Text Mining in R Using TM

Package
Among all things for the people working on Data Analytics, one thing they will surely come across is
Data Mining. Data Mining is all about examining huge to extremely huge amount of structured and
unstructured data to form actionable insights.
This article is your guide to get started with Text Mining in R using TM package. It explains enormous
power that R and its packages have to offer on Text Mining. A person with elementary R knowledge
can use this article to get started with Text Mining. It guides user till exploratory data analysis and N-
Grams generation.
Important Terms:
Before we dig dip into Text Mining, we need to get familiar with some of the important concepts
related to Text Mining.
a. TM package: R package for Text Mining [1]
b. Corpus & Corpora: Corpus is a large collection of text. It is a body of written or spoken material
upon which a linguistic analysis is based. Plural form of Corpus is Corpora which essentially is
collections of documents containing natural language text. [2]
c. Document Term Matrix (DTM): A Document Term Matrix is a mathematical matrix that describes
the frequency of terms that occur in a collection of documents. It has documents in rows and word
frequencies in columns.
d. Stemming: Stemming is the process of converting words into their basis form making it easier for
analysis e.g. Words like win, winning and winner are converted and counted to their basic form i.e.
win.
e. Stop Words: These are most common words in a language that get repeated. However, they add little
value to text mining e.g. I, our, they’ll, etc. There are 174 stop words in English.
f. Bad Words: These are offensive words which need to be removed before we start data mining.
With above introduction and basics, let’s get started with implementing Text Mining in R.
Step 1: Install & load necessary libraries. Out of these, TM is R’s text mining package. Other packages
are supplementary packages that are used for reading lines from file, plotting, preparing word clouds,
N-Gram generation, etc.
Note: If any of above libraries are not installed, use install.packages() to get those installed.
Set constants that are to be used multiple times. This is considered as good programming practice.

Step 2: Read text file contents [3]. Optional — Gather and display basic file attributes viz. file size,
number of lines in file, number of words in file.

Step 3: Create file corpus, clean the corpus

Step 4: This step illustrates few basic exploratory data analysis steps that can act as reference for
detailed exploratory data analysis.

Output is not shown.

Step 5: Visualize frequency of words occurring in text file by using word clouds. Following code
snippet generates two word clouds to show un-stemmed and stemmed corpus word clouds:
Step 6: Last step of this guide is to generate N-Grams (uni, bi and tri grams) and plot histograms of top
10 occurring N-Grams.
Further steps could be use above generated N-Grams text mining activities like word predictions, etc.
References:
a. [1] TM package — https://cran.r-project.org/web/packages/tm/tm.pdf
b. [2] Corpus & Corpora — http://language.worldofcomputing.net/linguistics/introduction/what-is-
corpus.html
c. Text file referred in this guide uses text dump of following WIKI page —
https://en.wikipedia.org/wiki/Text_mining

Tomado de: https://medium.com/text-mining-in-data-science-a-tutorial-of-text/text-mining-in-data-

science-51299e4e594

Text Mining Code
No ratings yet
Text Mining Code
3 pages
Acoustic Typology of Vowel Inventories and Dispersion Theory
No ratings yet
Acoustic Typology of Vowel Inventories and Dispersion Theory
235 pages
Text Analysis
No ratings yet
Text Analysis
13 pages
Hands-On Data Science With R Text Mining
No ratings yet
Hands-On Data Science With R Text Mining
41 pages
Hands-On Data Science With R Text Mining: 10th January 2016
No ratings yet
Hands-On Data Science With R Text Mining: 10th January 2016
47 pages
AFM_Module 4
No ratings yet
AFM_Module 4
48 pages
Text Mining PPT Merged
100% (1)
Text Mining PPT Merged
58 pages
5 Paso S Text Mining
No ratings yet
5 Paso S Text Mining
4 pages
EBUS622 - Week 5 - Lecture - Text Preparation
No ratings yet
EBUS622 - Week 5 - Lecture - Text Preparation
40 pages
Data Science With R Text Mining by Graham Williams
No ratings yet
Data Science With R Text Mining by Graham Williams
21 pages
Text Mining Code
No ratings yet
Text Mining Code
2 pages
RDataMining Slides Text Mining
No ratings yet
RDataMining Slides Text Mining
35 pages
Text and Web Mining
No ratings yet
Text and Web Mining
44 pages
Text Mining in R: A Tutorial
No ratings yet
Text Mining in R: A Tutorial
7 pages
Text Analytics Notes
No ratings yet
Text Analytics Notes
12 pages
Text Mining
No ratings yet
Text Mining
41 pages
FDS-Content Beyond Syllabus
No ratings yet
FDS-Content Beyond Syllabus
15 pages
Text Mining: Data Mining - Volinsky - 2011 - Columbia University
No ratings yet
Text Mining: Data Mining - Volinsky - 2011 - Columbia University
63 pages
Text Mining Assignment
No ratings yet
Text Mining Assignment
12 pages
Text Analysis
No ratings yet
Text Analysis
15 pages
Unit I –Text Mining
No ratings yet
Unit I –Text Mining
48 pages
Module 8 - Text - Update
No ratings yet
Module 8 - Text - Update
42 pages
Stewart LabHandout
No ratings yet
Stewart LabHandout
11 pages
Lecture 6-Text Mining and Sentiment Analysis
No ratings yet
Lecture 6-Text Mining and Sentiment Analysis
57 pages
Text Mining: Seminar Submitted by
No ratings yet
Text Mining: Seminar Submitted by
22 pages
Text Mining With R
No ratings yet
Text Mining With R
15 pages
Big data
No ratings yet
Big data
5 pages
DATA MINING IN BUSINESS INTELLIGENCE
No ratings yet
DATA MINING IN BUSINESS INTELLIGENCE
63 pages
UNIT - 1 Text Mining
No ratings yet
UNIT - 1 Text Mining
18 pages
Data Mining Assignment
No ratings yet
Data Mining Assignment
6 pages
Text Mining - Analytics
No ratings yet
Text Mining - Analytics
35 pages
Tmcode Text Mining
No ratings yet
Tmcode Text Mining
2 pages
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
No ratings yet
WINSEM2023-24 BCSE206L TH VL2023240501787 2024-02-19 Reference-Material-I
42 pages
Datamining 1
No ratings yet
Datamining 1
11 pages
Lecture 6 - From Unstructured Texts to Structure Data I
No ratings yet
Lecture 6 - From Unstructured Texts to Structure Data I
17 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
82 pages
Screenshot 2024-06-04 at 12.02.17 AM
No ratings yet
Screenshot 2024-06-04 at 12.02.17 AM
23 pages
Text Mining Package and Datacleaning: #Cleaning The Text or Text Transformation
No ratings yet
Text Mining Package and Datacleaning: #Cleaning The Text or Text Transformation
6 pages
Lec1 PDF
No ratings yet
Lec1 PDF
20 pages
RDataMining Slides Text Mining
No ratings yet
RDataMining Slides Text Mining
34 pages
Text Mining With Bag of Words in R - 1 PDF
No ratings yet
Text Mining With Bag of Words in R - 1 PDF
17 pages
Text Mining
No ratings yet
Text Mining
25 pages
08-Text_Mining
No ratings yet
08-Text_Mining
38 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
Lecture 5- Text Mining Sentiment and Social Media Analytics
No ratings yet
Lecture 5- Text Mining Sentiment and Social Media Analytics
52 pages
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
No ratings yet
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
122 pages
Dept. of ISE, Acit 1
No ratings yet
Dept. of ISE, Acit 1
12 pages
Text Mining
No ratings yet
Text Mining
12 pages
Text Mining
No ratings yet
Text Mining
85 pages
Text and Sentiment Analysis
No ratings yet
Text and Sentiment Analysis
41 pages
Text Mining Methodologies
No ratings yet
Text Mining Methodologies
45 pages
Trend Analysis in Machine Learning Research
No ratings yet
Trend Analysis in Machine Learning Research
6 pages
Vector Space Model
No ratings yet
Vector Space Model
24 pages
Chapter 5 Predictive Analytics II Text^j Web^j and Social Media Analytics
No ratings yet
Chapter 5 Predictive Analytics II Text^j Web^j and Social Media Analytics
5 pages
BDA3
No ratings yet
BDA3
61 pages
mod 3
No ratings yet
mod 3
56 pages
Dissertation Text Mining
100% (2)
Dissertation Text Mining
4 pages
Chapter 07 - in class
No ratings yet
Chapter 07 - in class
49 pages
Beginning R: The Statistical Programming Language
From Everand
Beginning R: The Statistical Programming Language
Mark Gardener
4.5/5 (4)
Python For Data Science
From Everand
Python For Data Science
Kevin Clark
No ratings yet
Domain-Specific Languages in R: Advanced Statistical Programming
From Everand
Domain-Specific Languages in R: Advanced Statistical Programming
Thomas Mailund
No ratings yet
CWB Encoding Tutorial
No ratings yet
CWB Encoding Tutorial
13 pages
A Comparative Study of Literary Translation From Arabic Into English and French
100% (1)
A Comparative Study of Literary Translation From Arabic Into English and French
44 pages
Corpus Linguistics
No ratings yet
Corpus Linguistics
23 pages
100k Samples
No ratings yet
100k Samples
85 pages
576 1 1179 1 10 20181220
No ratings yet
576 1 1179 1 10 20181220
15 pages
Cognates in SI
No ratings yet
Cognates in SI
37 pages
N-Gram Models For Language Detection
No ratings yet
N-Gram Models For Language Detection
14 pages
CTT_2024_proceedings
No ratings yet
CTT_2024_proceedings
96 pages
Della Ida
No ratings yet
Della Ida
6 pages
Speach To Text Transcription
No ratings yet
Speach To Text Transcription
15 pages
Communicating Creativity: The Discursive Facilitation of Creative Activity in Arts 1st Edition Darryl Hocking (Auth.)
100% (1)
Communicating Creativity: The Discursive Facilitation of Creative Activity in Arts 1st Edition Darryl Hocking (Auth.)
57 pages
ERC Books Titles1
No ratings yet
ERC Books Titles1
22 pages
Colour Words
100% (1)
Colour Words
384 pages
Quanteda PDF
No ratings yet
Quanteda PDF
2 pages
Corpus Design and Types of Corpora
No ratings yet
Corpus Design and Types of Corpora
68 pages
240 Paper
No ratings yet
240 Paper
6 pages
Exploring Corpora Task 1 - 2023
No ratings yet
Exploring Corpora Task 1 - 2023
13 pages
English To Malayalam Translation A Statistical Approach
No ratings yet
English To Malayalam Translation A Statistical Approach
7 pages
05 Multiword Expressions
No ratings yet
05 Multiword Expressions
26 pages
PDF Meaningful Texts The Extraction of Semantic Information from Monolingual and Multilingual Corpora 1st Edition Geoff Barnbrook download
100% (4)
PDF Meaningful Texts The Extraction of Semantic Information from Monolingual and Multilingual Corpora 1st Edition Geoff Barnbrook download
71 pages
Pinto Evaluating N-Gram Models For A Bilingual Word Sense Disambiguation Task
No ratings yet
Pinto Evaluating N-Gram Models For A Bilingual Word Sense Disambiguation Task
12 pages
2ndLITU CULI2016 Handbook
No ratings yet
2ndLITU CULI2016 Handbook
98 pages
Introduction To Antconc by Tahir Shah
No ratings yet
Introduction To Antconc by Tahir Shah
20 pages
Arabic NLP 1 s2.0 S1319157818310553 Main
No ratings yet
Arabic NLP 1 s2.0 S1319157818310553 Main
11 pages
Using Corpora in The Language Classroom Hardback Sample Pages
50% (2)
Using Corpora in The Language Classroom Hardback Sample Pages
10 pages
Language Technology in Tamil
No ratings yet
Language Technology in Tamil
38 pages
Learner Corpora SLA PPT.pptx
No ratings yet
Learner Corpora SLA PPT.pptx
18 pages
Instant download Using Corpora to Analyze Gender Paul Baker pdf all chapter
100% (2)
Instant download Using Corpora to Analyze Gender Paul Baker pdf all chapter
65 pages
Aijmer & Altenberg - Advances in Corpus Linguistics
100% (1)
Aijmer & Altenberg - Advances in Corpus Linguistics
395 pages