0% found this document useful (0 votes)

12 views13 pages

Text Analysis

This document discusses text mining and word clouds. It explains that text mining is used to convert unstructured text data into a meaningful form by identifying and building features. The bag of words approach treats each document as a collection of words while ignoring word order and structure. Cleaning text involves steps like converting to lowercase, removing stopwords and punctuation, and reducing words to their root form. Word clouds then visually represent the most frequent terms in a document with the term sizes proportional to their frequencies.

Uploaded by

Grace Yin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views13 pages

Text Analysis

Uploaded by

Grace Yin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Text Mining and Word Clouds

Text is Everywhere

• Medical Records
• Consumer Complaint Logs
• Product Inquiries
• Social Media Posts (Twitter feed, Emails, Facebook status,
Reddit comments, etc.)
• Personal Webpages

Text Mining deals with converting this vast amount of data to a

meaningful form.
• Structured data:
• Well organized
• Common agreed upon features in each data sample
• Formats: tables, relational databases etc.
• Sources: government, industry, CRM,, markets, etc.
• Unstructured data
• Not well organized, & unclear what are the features
• A lot of heterogeneity between data samples
• Formats: text, images, video, audio etc.
• Sources: social media, security cameras, etc.

• Unstructured data is unstructured because adding

structure is hard work
• It is not designed for analysis
Adding Structure

Identify and
Build Features

f1 f2 …
blah, blah, blah,
blah, blah, blah, Explore
val11 val12 … Explain
blah, blah, blah,
blah, blah, blah, Predict
blah, blah, blah,
blah, blah, blah,
val21 val22 …
…
…
Text Data is Difficult to Analyze

• Text data is “unstructured”: Does not come in well

formatted table with each field having a specific meaning!
• Text has a linguistic structure that is easily understood by
humans (not computers)
• Words vary in length and the order of words matters
• The data tends to have poor quality: spelling mistakes,
abbreviations, punctuation, etc.

Text data must undergo extensive prepossessing before being

used in any analytics algorithm/application.
Bag of Words Approach
• Treat a document as a collection of individual words, i.e.
Ignore Grammar, Word Order, Sentence Structure, etc.
Bag of Words Approach
• Treat a document as a collection of individual words, i.e.
Ignore Grammar, Word Order, Sentence Structure, etc.
• Each word is equally likely to be an important keyword.
• The words that appear the most in the document are the
most important keywords (the most valuable features).
• The term frequency TF(t, d) is count of number of times a
particular word t appears in a document d (may be also
normalized).

“all data mining involves the use of machine learning but not all
machine learning requires data mining”

Term Freq. Term Freq. Term Freq. Term Freq.

all 2 data 2 mining 2 involves 1
the 1 use 1 of 1 machine 2
learning 2 but 1 not 1 requires 1
Advantages,

Advantages
• A very simple representation
• Inexpensive to generate
• Works in many settings
• Often works surprisingly well!
Technical reports, prescriptions,…
• “a duck walked up to a lemonade stand”
• “a horse walked up to a lemonade
stand”
• “The Duck walks near the Lemonade
Stand”
The bag of words features:
According to bag of words:

[“a”, “duck”, “walked”, “up”, “to”, “a”,

“lemonade”, “stand”],
is similar to
[“a”, “horse”, “walked”, “up”, “to”, “a”,
“lemonade”, “stand”]

BUT
[“a”, “duck”, “walked”, “up”, “to”, “a”,
“lemonade”, “stand”],
not similar
[“The”, “Duck”, “walks”, “near”, “the”,
“Lemonade”, “Stand”]
Cleaning the Text

• Convert the text to lower case.

• Remove common stopwords like “the”, “we”, “and”, etc.

• “not” is not a good stop‐word, why?

• Remove numbers (or replace them with words).
• Remove punctuation like “.”, “,”, etc.
• Reduce the words to their root (word stemming). Example:
“announces”, “announced”, “announcing” are reduced to
“announc”.
• Remove unnecessary white space.
Cleaning the Text in R

Load the required libraries

library("tm") # Text Mining Library
library("SnowballC") # For reducing words to their root

Create the text document object

myDocument <- Corpus(VectorSource("All data mining involves the use of machine
learning, but not all machine learning requires data mining."))

Clean the Text

myDocument <- tm_map(myDocument, content_transformer(tolower)) #Convert to lower case
myDocument <- tm_map(myDocument, removeWords, stopwords("english")) #Remove stopwords
myDocument <- tm_map(myDocument, removeNumbers) #Remove numbers
myDocument <- tm_map(myDocument, removePunctuation) #Remove punctuation
myDocument <- tm_map(myDocument, stemDocument) #Reduce the words to their root
myDocument <- tm_map(myDocument, stripWhitespace) #Remove unnecessary white space
Getting Term Frequency Table in R

termMatrix = as.matrix(TermDocumentMatrix(myDocument)) #Get terms and Freq. matrix

sortedtermMatrix <- sort(rowSums(termMatrix),decreasing=TRUE) #Sort dec. order of Freq.
d <- data.frame("Term" = names(sortedtermMatrix),"Freq."=sortedtermMatrix,
row.names = NULL) #Store as Data Frame
print(d) #display data frame
Word Cloud
Word clouds are commonly used to visualize/highlight keywords
in documents
• Artistically place words with sizes proportional to their
frequency of occurrence.
• Typically, the exact position of the word does not mean
anything.
library("wordcloud") # Word Cloud Library
wordcloud(words = d$Term, freq = d$Freq., colors=brewer.pal(8, "Dark2"))

B2 First Unit 6 Test: Section 1: Vocabulary
No ratings yet
B2 First Unit 6 Test: Section 1: Vocabulary
1 page
电子书
No ratings yet
电子书
474 pages
Unit 1: Vectors: Math 1229A/B
No ratings yet
Unit 1: Vectors: Math 1229A/B
177 pages
A Tutorial of Text Mining in R Using TM Package
No ratings yet
A Tutorial of Text Mining in R Using TM Package
6 pages
AFM_Module 4
No ratings yet
AFM_Module 4
48 pages
Unit I –Text Mining
No ratings yet
Unit I –Text Mining
48 pages
Text Mining
No ratings yet
Text Mining
35 pages
Introduction To Text Mining
No ratings yet
Introduction To Text Mining
82 pages
Text Mining: Data Mining - Volinsky - 2011 - Columbia University
No ratings yet
Text Mining: Data Mining - Volinsky - 2011 - Columbia University
63 pages
5 Paso S Text Mining
No ratings yet
5 Paso S Text Mining
4 pages
Lecture 6 - From Unstructured Texts to Structure Data I
No ratings yet
Lecture 6 - From Unstructured Texts to Structure Data I
17 pages
Text and Sentiment Analysis
No ratings yet
Text and Sentiment Analysis
41 pages
Week 12
No ratings yet
Week 12
19 pages
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
No ratings yet
Business Intelligence and Data Mining: by Dr. Atanu Rakshit Email: Atanu - Rakshit@iimrohtak - Ac.in
122 pages
Module 8 - Text - Update
No ratings yet
Module 8 - Text - Update
42 pages
week_1-4_Text_an
No ratings yet
week_1-4_Text_an
74 pages
Analytics Concepts Social Listening
No ratings yet
Analytics Concepts Social Listening
10 pages
Text Mining PPT Merged
100% (1)
Text Mining PPT Merged
58 pages
Text Analysis
No ratings yet
Text Analysis
15 pages
Lgt2425 Introduction To Business Analytics: Lecture 5: Text Mining
No ratings yet
Lgt2425 Introduction To Business Analytics: Lecture 5: Text Mining
12 pages
Basic Textual Analysis in R
No ratings yet
Basic Textual Analysis in R
2 pages
Dept. of ISE, Acit 1
No ratings yet
Dept. of ISE, Acit 1
12 pages
Session 11-12 - Text Analytics
No ratings yet
Session 11-12 - Text Analytics
38 pages
EBUS622 - Week 5 - Lecture - Text Preparation
No ratings yet
EBUS622 - Week 5 - Lecture - Text Preparation
40 pages
Text Mining
No ratings yet
Text Mining
41 pages
Text Mining
No ratings yet
Text Mining
85 pages
Exam-2
No ratings yet
Exam-2
5 pages
08-Text_Mining
No ratings yet
08-Text_Mining
38 pages
Text Mining Code
No ratings yet
Text Mining Code
2 pages
Semantic Analysis Theory1
No ratings yet
Semantic Analysis Theory1
16 pages
Efficient Preprocessing and Patterns Identification Approach For Text Mining
No ratings yet
Efficient Preprocessing and Patterns Identification Approach For Text Mining
6 pages
Text Mining
No ratings yet
Text Mining
25 pages
Text Mining: Seminar Submitted by
No ratings yet
Text Mining: Seminar Submitted by
22 pages
BDA3
No ratings yet
BDA3
61 pages
FDS-Content Beyond Syllabus
No ratings yet
FDS-Content Beyond Syllabus
15 pages
Text Mining Notes
No ratings yet
Text Mining Notes
24 pages
Text Analysis: Why Do We Need Text Analytics
No ratings yet
Text Analysis: Why Do We Need Text Analytics
2 pages
Text Mining Package and Datacleaning: #Cleaning The Text or Text Transformation
No ratings yet
Text Mining Package and Datacleaning: #Cleaning The Text or Text Transformation
6 pages
Text Mining
No ratings yet
Text Mining
62 pages
Text and Web Mining
No ratings yet
Text and Web Mining
44 pages
organized
No ratings yet
organized
12 pages
Chapter 1: Text Mining: Big Data Analytics (15CS82)
No ratings yet
Chapter 1: Text Mining: Big Data Analytics (15CS82)
12 pages
BCSE206L_FDS_MODULE-4_SMSATAPATHY
No ratings yet
BCSE206L_FDS_MODULE-4_SMSATAPATHY
50 pages
Text Analysis
No ratings yet
Text Analysis
15 pages
Text Mining in R: A Tutorial
No ratings yet
Text Mining in R: A Tutorial
7 pages
ML UNIT-II
No ratings yet
ML UNIT-II
27 pages
Text Mining Assignment
No ratings yet
Text Mining Assignment
12 pages
Chapter 03---Sharda 11e Full Accessible Ppt 07
No ratings yet
Chapter 03---Sharda 11e Full Accessible Ppt 07
29 pages
TEXT ANALYTICS With Python
No ratings yet
TEXT ANALYTICS With Python
37 pages
CSE442 Text
No ratings yet
CSE442 Text
89 pages
RDataMining Slides Text Mining
No ratings yet
RDataMining Slides Text Mining
35 pages
A New Approach To Represent Textual Documents Using CVSM
No ratings yet
A New Approach To Represent Textual Documents Using CVSM
6 pages
Text Mining With R
No ratings yet
Text Mining With R
15 pages
Text Mining With R - Twitter Data Analysis
No ratings yet
Text Mining With R - Twitter Data Analysis
24 pages
Bda Mod5
No ratings yet
Bda Mod5
20 pages
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
No ratings yet
Text Mining: Lecturer: Dr. Nguyen Thi Ngoc Anh
27 pages
Lecture 6-Text Mining and Sentiment Analysis
No ratings yet
Lecture 6-Text Mining and Sentiment Analysis
57 pages
DATA MINING IN BUSINESS INTELLIGENCE
No ratings yet
DATA MINING IN BUSINESS INTELLIGENCE
63 pages
Module 3
No ratings yet
Module 3
40 pages
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
No ratings yet
43.IJCSCN PreprocessingTechniquesforTextMining Ilamathi Nithya
11 pages
Microsoft Access: Database Creation and Management through Microsoft Access
From Everand
Microsoft Access: Database Creation and Management through Microsoft Access
Steven Bright
No ratings yet
Jump Start MySQL: Master the Database That Powers the Web
From Everand
Jump Start MySQL: Master the Database That Powers the Web
Timothy Boronczyk
No ratings yet
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
From Everand
SQLite Database Programming for Xamarin: Cross-platform C# database development for iOS and Android using SQLite.XM
Anthony Serpico
No ratings yet
Competing With Analytics: Hamid Elahi
No ratings yet
Competing With Analytics: Hamid Elahi
6 pages
Elec Price Data
No ratings yet
Elec Price Data
2,497 pages
Introduction To Exceptions in Java
No ratings yet
Introduction To Exceptions in Java
35 pages
Descriptive Analytics
100% (1)
Descriptive Analytics
4 pages
Using Sets: Math 1228A/B Online
No ratings yet
Using Sets: Math 1228A/B Online
218 pages
Experiment 1: Introduction To Arduino UNO
No ratings yet
Experiment 1: Introduction To Arduino UNO
6 pages
PIC Trainer Kit User Manual
100% (2)
PIC Trainer Kit User Manual
43 pages
Grade-9-Introduction-to-C-Programming (1)
No ratings yet
Grade-9-Introduction-to-C-Programming (1)
12 pages
Hymes 1962 The Ethnography of Speaking
No ratings yet
Hymes 1962 The Ethnography of Speaking
40 pages
Svarteaske
100% (2)
Svarteaske
16 pages
XFOIL User Guide
No ratings yet
XFOIL User Guide
62 pages
Face in The Dark
No ratings yet
Face in The Dark
4 pages
Noun Clauses Proposiciones Subordinadas Sustantivas
No ratings yet
Noun Clauses Proposiciones Subordinadas Sustantivas
9 pages
Telemedicine Documentation
No ratings yet
Telemedicine Documentation
100 pages
FinGPT: Democratizing Internet-Scale Financial Data With LLMs
No ratings yet
FinGPT: Democratizing Internet-Scale Financial Data With LLMs
7 pages
Class 7 2nd Term
No ratings yet
Class 7 2nd Term
17 pages
Importance of Yoga in Indian Philosophy
100% (1)
Importance of Yoga in Indian Philosophy
6 pages
FIDP 21st Century
No ratings yet
FIDP 21st Century
10 pages
RRES
No ratings yet
RRES
100 pages
Chapter 3 (Philippine History)
89% (28)
Chapter 3 (Philippine History)
22 pages
Dlubal Software Overview
100% (3)
Dlubal Software Overview
84 pages
English - Core SR - Sec 2020-21
No ratings yet
English - Core SR - Sec 2020-21
14 pages
1°&2° Grade - Test Unit 3 & 4
No ratings yet
1°&2° Grade - Test Unit 3 & 4
2 pages
Question Bank KBK
No ratings yet
Question Bank KBK
10 pages
DSWCI Short Wave News Last Issue
No ratings yet
DSWCI Short Wave News Last Issue
40 pages
DLL Philosophy QUARTER 1 WEEK 4
No ratings yet
DLL Philosophy QUARTER 1 WEEK 4
4 pages
Salt &amp Pepper Noise &amp All Filters (Matlab Code)
67% (3)
Salt &amp Pepper Noise &amp All Filters (Matlab Code)
4 pages
RELIGION 8 Reviewer For Second Quarterly Exam
No ratings yet
RELIGION 8 Reviewer For Second Quarterly Exam
4 pages
0924 Further Read - 3. Certeau - Possession at Loudun
No ratings yet
0924 Further Read - 3. Certeau - Possession at Loudun
280 pages
SEM1 SESI 1718 - UHF2111 - Speaking Test 1 - V2 - Student
No ratings yet
SEM1 SESI 1718 - UHF2111 - Speaking Test 1 - V2 - Student
6 pages
Ass in Math
No ratings yet
Ass in Math
2 pages
PH46283 - MS18404 - Nhóm 2
No ratings yet
PH46283 - MS18404 - Nhóm 2
26 pages
Sentencify Function Creation
No ratings yet
Sentencify Function Creation
6 pages
L9 -An Equal Share (1)
No ratings yet
L9 -An Equal Share (1)
2 pages

Text Analysis

Uploaded by

Text Analysis

Uploaded by

Text Mining and Word Clouds

Text Mining deals with converting this vast amount of data to a

• Unstructured data is unstructured because adding

• Text data is “unstructured”: Does not come in well

Text data must undergo extensive prepossessing before being

Term Freq. Term Freq. Term Freq. Term Freq.

[“a”, “duck”, “walked”, “up”, “to”, “a”,

• Convert the text to lower case.

• “not” is not a good stop‐word, why?

Load the required libraries

Create the text document object

Clean the Text

termMatrix = as.matrix(TermDocumentMatrix(myDocument)) #Get terms and Freq. matrix

You might also like