R Code NB

This document summarizes the process of building a Naive Bayes classifier to perform text classification on the 20 Newsgroups dataset. It involves preprocessing the text by converting to lowercase, removing punctuation, numbers and stopwords. Then it creates document-term matrices for training and test data, builds a Naive Bayes model on the training data, makes predictions on the test data and evaluates the model using a confusion matrix.

Uploaded by

brahmesh_sm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views

R Code NB

Uploaded by

brahmesh_sm

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

# Text

classification
using a Naive
Bayes scheme
# Data : 20 Newsgroups
# Download link : http://www.cs.umb.edu/~smimarog/textmining/datasets/

# Load all the required libraries. Note : Packages need to be installed first.
library(dplyr)
library(caret)
library(tm)
library(RTextTools)
library(doMC)
library(e1071)
registerDoMC(cores=detectCores())
# Load data.
# We will use the 'train-all-terms' file which contains over 11300 messages.
# Read file as a dataframe
ng.df <- read.table("20ng-train-all-terms.txt", header=FALSE, sep="\t", quote="",
stringsAsFactors=FALSE, col.names = c("topic", "text"))

# Preview the dataframe

# head(ng.df) # or use View(ng.df)
# How many messages do each of the 20 categories contain?
table(ng.df$topic)
# Read topic variable as a factor variable
ng.df$topic <- as.factor(ng.df$topic)

# Randomize : Shuffle rows randomly.

set.seed(2016)
ng.df <- ng.df[sample(nrow(ng.df)), ]
ng.df <- ng.df[sample(nrow(ng.df)), ]
# Create corpus of the entire text
corpus <- Corpus(VectorSource(ng.df$text))

# Total size of the corpus

length(corpus)

# Inspect the corpus

inspect(corpus[1:5])
# Tidy up the corpus using 'tm_map' function. Make the following transformations on
the corpus : change to lower case, removing numbers,
# punctuation and white space. We also eliminate common english stop words like
"his", "our", "hadn't", couldn't", etc using the
# stopwords() function.
# Use 'dplyr' package's excellent pipe utility to do this neatly
corpus.clean <- corpus %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(removeWords, stopwords(kind="en")) %>%
tm_map(stripWhitespace)
# Create document term matrix
dtm <- DocumentTermMatrix(corpus.clean)
dim(dtm)
# Create a 75:25 data partition. Note : 5000 (~50% of the entire set) messages were
used for this analysis.

ng.df.train <- ng.df[1:8470,]

ng.df.test <- ng.df[8471:11293,]

dtm.train <- dtm[1:8470,]

dtm.test <- dtm[8471:11293,]
dim(dtm.test)
corpus.train <- corpus.clean[1:8470]
corpus.test <- corpus.clean[8471:11293]
# Find frequent words which appear five times or more

fivefreq <- findFreqTerms(dtm.train, 5)

length(fivefreq)
dim(dtm.train)
# Build dtm using fivefreq words only. Reduce number of features to
length(fivefreq)
system.time( dtm.train.five <- DocumentTermMatrix(corpus.train, control =
list(dictionary=fivefreq)) )
system.time( dtm.test.five <- DocumentTermMatrix(corpus.test, control =
list(dictionary=fivefreq)) )
# converting word counts (0 or more) to presence or absense (yes or no) for each
word
convert_count <- function(x) {
y <- ifelse(x > 0, 1,0)
y <- factor(y, levels=c(0,1), labels=c("No", "Yes"))
y
}
# Apply yes/no function to get final training and testing dtms
system.time( ng.train <- apply(dtm.train.five, 2, convert_count) )
system.time ( ng.test <- apply(dtm.test.five, 2, convert_count) )
# Build the NB classifier
system.time (ngclassifier <- naiveBayes(ng.train, ng.df.train$topic))

# Make predictions on the test set

system.time( predictions <- predict(ngclassifier, newdata=ng.test) )
predictions
cm <- confusionMatrix(predictions, ng.df.test$topic )
cm

Text Mining Code
No ratings yet
Text Mining Code
3 pages
DM chapter 3
No ratings yet
DM chapter 3
6 pages
Big data
No ratings yet
Big data
5 pages
Supervised Learningclassification Part3
No ratings yet
Supervised Learningclassification Part3
42 pages
Text Mining Code
No ratings yet
Text Mining Code
2 pages
Text Analysis
No ratings yet
Text Analysis
15 pages
Spam Classification2
No ratings yet
Spam Classification2
21 pages
Text Mining Package and Datacleaning: #Cleaning The Text or Text Transformation
No ratings yet
Text Mining Package and Datacleaning: #Cleaning The Text or Text Transformation
6 pages
Naive Bayes
No ratings yet
Naive Bayes
11 pages
Bradzil Classif withTM
No ratings yet
Bradzil Classif withTM
16 pages
Stewart LabHandout
No ratings yet
Stewart LabHandout
11 pages
RDataMining Slides Text Mining
No ratings yet
RDataMining Slides Text Mining
35 pages
Data Science With R Text Mining by Graham Williams
No ratings yet
Data Science With R Text Mining by Graham Williams
21 pages
Semantic Analysis Theory1
No ratings yet
Semantic Analysis Theory1
16 pages
Blue Doodle Project Presentation
No ratings yet
Blue Doodle Project Presentation
15 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Tmcode Text Mining
No ratings yet
Tmcode Text Mining
2 pages
Text and Sentiment Analysis
No ratings yet
Text and Sentiment Analysis
41 pages
Order Tasks and Milestones Assignment
No ratings yet
Order Tasks and Milestones Assignment
6 pages
A Tutorial of Text Mining in R Using TM Package
No ratings yet
A Tutorial of Text Mining in R Using TM Package
6 pages
Data Science
No ratings yet
Data Science
25 pages
AI Phash3
No ratings yet
AI Phash3
11 pages
Lecture 8
No ratings yet
Lecture 8
45 pages
daima jieshi
No ratings yet
daima jieshi
5 pages
Text Classification MLND Project Report Prasann Pandya
No ratings yet
Text Classification MLND Project Report Prasann Pandya
17 pages
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Step 1: Create A CSV File: # For Text Mining
No ratings yet
Step 1: Create A CSV File: # For Text Mining
9 pages
Business Analytics CA3
No ratings yet
Business Analytics CA3
11 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Module 8 - Text - Update
No ratings yet
Module 8 - Text - Update
42 pages
Information Security Awareness - Refresher Course
100% (2)
Information Security Awareness - Refresher Course
83 pages
Text Analysis
No ratings yet
Text Analysis
13 pages
Unstructured
No ratings yet
Unstructured
37 pages
RDataMining Slides Text Mining
No ratings yet
RDataMining Slides Text Mining
34 pages
Hands-On Data Science With R Text Mining
No ratings yet
Hands-On Data Science With R Text Mining
41 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Statistical Learning and Text Classification With NLTK and Scikit-Learn
No ratings yet
Statistical Learning and Text Classification With NLTK and Scikit-Learn
24 pages
Chapter Veera 6
No ratings yet
Chapter Veera 6
4 pages
Pipeline
No ratings yet
Pipeline
9 pages
Perl One-Liners: 130 Programs That Get Things Done
From Everand
Perl One-Liners: 130 Programs That Get Things Done
Peteris Krumins
4/5 (3)
18-NLP-DTM Tokenization corpus BoW cloud
No ratings yet
18-NLP-DTM Tokenization corpus BoW cloud
14 pages
NLP For ML - Spam Classifier
No ratings yet
NLP For ML - Spam Classifier
14 pages
SL-3_Assignment No 7
No ratings yet
SL-3_Assignment No 7
14 pages
5 Paso S Text Mining
No ratings yet
5 Paso S Text Mining
4 pages
EBUS622 - Week 5 - Lecture - Text Preparation
No ratings yet
EBUS622 - Week 5 - Lecture - Text Preparation
40 pages
PPPT
No ratings yet
PPPT
20 pages
Thesis Final - Pham Dung - Quang Anh - ver2
No ratings yet
Thesis Final - Pham Dung - Quang Anh - ver2
30 pages
Building A Powered Ai and Spam Caller
No ratings yet
Building A Powered Ai and Spam Caller
7 pages
Text Analysis: Why Do We Need Text Analytics
No ratings yet
Text Analysis: Why Do We Need Text Analytics
2 pages
Module 3
No ratings yet
Module 3
40 pages
TopicClassifierbyDavidCaleb
No ratings yet
TopicClassifierbyDavidCaleb
7 pages
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
Lab Session VI-RTextTools (P8)
No ratings yet
Lab Session VI-RTextTools (P8)
13 pages
Text Mining
No ratings yet
Text Mining
31 pages
Lab5 Example Fall 23
No ratings yet
Lab5 Example Fall 23
4 pages
Authors:: Apoorv Agarwal Boyi Xie Ilia Vovsha Owen Rambow Rebecca Passonneau
No ratings yet
Authors:: Apoorv Agarwal Boyi Xie Ilia Vovsha Owen Rambow Rebecca Passonneau
9 pages
Machine Learning Slide_Group 16
No ratings yet
Machine Learning Slide_Group 16
32 pages
Quanteda PDF
No ratings yet
Quanteda PDF
2 pages
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Natural Language Processing
No ratings yet
Natural Language Processing
22 pages
Assignment 8
No ratings yet
Assignment 8
10 pages
LP - SCM 2019 7 Sem
No ratings yet
LP - SCM 2019 7 Sem
2 pages
Lesson 6 Recap
No ratings yet
Lesson 6 Recap
6 pages
Code Optimization
0% (1)
Code Optimization
90 pages
Emulator PDF
No ratings yet
Emulator PDF
20 pages
Android - A Beginner's Guide: Setup Eclipse and The Android SDK
No ratings yet
Android - A Beginner's Guide: Setup Eclipse and The Android SDK
8 pages
C# Language
0% (1)
C# Language
81 pages
Java Development Tools
No ratings yet
Java Development Tools
1 page
Java Loop Control
No ratings yet
Java Loop Control
5 pages
Chapter 1: Language Fundamentals (Next
No ratings yet
Chapter 1: Language Fundamentals (Next
6 pages
What Is Virtual Company
No ratings yet
What Is Virtual Company
1 page
Att Ed
No ratings yet
Att Ed
2 pages
Network Installation Guide
No ratings yet
Network Installation Guide
13 pages
Electronics - Burglar Alarm Project
No ratings yet
Electronics - Burglar Alarm Project
33 pages
Oracle Lead2pass 1z0-071 Sample Question 2021-Feb-21 by Dave 190q Vce
No ratings yet
Oracle Lead2pass 1z0-071 Sample Question 2021-Feb-21 by Dave 190q Vce
28 pages
Slow Internet Connection
100% (1)
Slow Internet Connection
2 pages
Table of Contents
No ratings yet
Table of Contents
4 pages
CP_Harmony_Browse_AdminGuide
No ratings yet
CP_Harmony_Browse_AdminGuide
168 pages
OPS401I Chapter 3
100% (1)
OPS401I Chapter 3
13 pages
Muhammad Siraj: Experience Summary
No ratings yet
Muhammad Siraj: Experience Summary
5 pages
Living in IT Era History of Computers
No ratings yet
Living in IT Era History of Computers
8 pages
LOINC Toolkit Release Note
No ratings yet
LOINC Toolkit Release Note
2 pages
Web Development Lesson 8..
No ratings yet
Web Development Lesson 8..
10 pages
Put Css Knc401
No ratings yet
Put Css Knc401
2 pages
RFP Main Document
No ratings yet
RFP Main Document
91 pages
Task 4 - Resource- Finance Optimization Steps__
No ratings yet
Task 4 - Resource- Finance Optimization Steps__
3 pages
Monit TC100 manual 1_1
No ratings yet
Monit TC100 manual 1_1
32 pages
IAITAM S Certified Software Asset Manager Course Syllabus
No ratings yet
IAITAM S Certified Software Asset Manager Course Syllabus
5 pages
Timing Jitter Tutorial & Measurement Guide
No ratings yet
Timing Jitter Tutorial & Measurement Guide
31 pages
Ict114 Lecture 2
No ratings yet
Ict114 Lecture 2
7 pages
Simatic Engineering Tools S7-PLCSIM V15.1 Installation Notes
No ratings yet
Simatic Engineering Tools S7-PLCSIM V15.1 Installation Notes
3 pages
GP490 Product Brief PDF
No ratings yet
GP490 Product Brief PDF
3 pages
UNIT 1 Introduction to E-Commerce
No ratings yet
UNIT 1 Introduction to E-Commerce
14 pages
8051 Microcontroller Unit 4
No ratings yet
8051 Microcontroller Unit 4
49 pages
LAB5
No ratings yet
LAB5
5 pages
NetSDK - JAVA Programming Manual (Intelligent Buliding)
No ratings yet
NetSDK - JAVA Programming Manual (Intelligent Buliding)
148 pages
Ir Remote Control Switch Report Print
No ratings yet
Ir Remote Control Switch Report Print
36 pages
An Improved Office Building Cooling Load Prediction Model Based Onmultivariable Linear RegressionQiang
No ratings yet
An Improved Office Building Cooling Load Prediction Model Based Onmultivariable Linear RegressionQiang
11 pages
SOP and Preliminary Research Ideas - Copy (2)
No ratings yet
SOP and Preliminary Research Ideas - Copy (2)
4 pages
Search Results: Screen-Reader Users, Click Here To Turn Off Google Instant
No ratings yet
Search Results: Screen-Reader Users, Click Here To Turn Off Google Instant
3 pages
All Google Drive in One Page by Mayank-2
No ratings yet
All Google Drive in One Page by Mayank-2
4 pages

R Code NB

Uploaded by

R Code NB

Uploaded by

# Text

# Preview the dataframe

# Randomize : Shuffle rows randomly.

# Total size of the corpus

# Inspect the corpus

ng.df.train <- ng.df[1:8470,]

dtm.train <- dtm[1:8470,]

fivefreq <- findFreqTerms(dtm.train, 5)

# Make predictions on the test set

You might also like