0% found this document useful (0 votes)

80 views12 pages

Data Cleaning Using Dataset

The document describes the process of data cleaning using the airquality dataset as an example. It involves loading the data, removing NA values by imputing the median, and summarizing the cleaned data. Topic modelling is also summarized as a technique for exploring topics in a collection of text documents by identifying word clusters and patterns across clusters. Source code provided applies LDA topic modelling to Twitter data involving data loading, cleaning, creating a document-term matrix, and calculating the optimal number of topics.

Uploaded by

PRATHIUSHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

80 views12 pages

Data Cleaning Using Dataset

Uploaded by

PRATHIUSHA

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

DATA CLEANING:

It is a process of transforming raw data into consistent data that can be analysed .

Data Cleaning using Dataset:

Install the required packages and Import the required packages:

library(tidyverse)

library(grid)

library(gridExtra)

library(forcats)

library(modelr)

library(caret)

library(kknn)

1.Load Data:

airquality

summary(airquality)

Outcome:
Ozone Solar.R Wind Temp Month
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 Min. :5.000
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00 1st Qu.:6.000
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Median :7.000
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88 Mean :6.993
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00 3rd Qu.:8.000
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 Max. :9.000
NA's :37 NA's :7
2. Remove and Clean NA’s:

air=airquality

summary(air)

air$Ozone=ifelse(is.na(air$Ozone),median(air$Ozone,na.rm=TRUE),air$Ozone)

summary(air)

air$Solar.R=ifelse(is.na(air$Solar.R),median(air$Solar.R,na.rm=TRUE),air$Solar.R)

summary(air)

Outcome:
Ozone Solar.R Wind Temp Month
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00 Min. :5.000
1st Qu.: 21.00 1st Qu.:120.0 1st Qu.: 7.400 1st Qu.:72.00 1st Qu.:6.000
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00 Median :7.000
Mean : 39.56 Mean :186.8 Mean : 9.958 Mean :77.88 Mean :6.993
3rd Qu.: 46.00 3rd Qu.:256.0 3rd Qu.:11.500 3rd Qu.:85.00 3rd Qu.:8.000
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00 Max. :9.000

Reference: https://data.world/cdc/air-quality-measures
WORD CLOUD

DATA: Text of Finance Minister Nirmala Sitharaman's Union Budget 2019 speech

CODE:

library("wordcloud")

library("tm")

library(RColorBrewer)

abcd=readLines("C:/Users/hp/Desktop/rs.txt")

abcd

corpus=Corpus(VectorSource(abcd))

inspect(corpus)

data=tm_map(corpus,tolower)

data=tm_map(data,removeNumbers)

data=tm_map(data,removePunctuation)

data=tm_map(data,removeWords,stopwords("english"))

data=tm_map(corpus,stripWhitespace)

inspect(data)

dtm<- TermDocumentMatrix(data)

dtm

m<-as.matrix(dtm)

v<-sort(rowSums(m),decreasing = TRUE)

d<-data.frame(word=names(v),freq=v)
head(d)

wordcloud(d$word,freq = d$freq,random.order = FALSE,colors = brewer.pal(1,"Dark2") ).

OUT COME:
TOPIC MODELLING:
A quantitative approach to the exploration of abstracts / topics from a selection of text documents
focused on each word's statistics. Simply put, the process of examining a large collection of
documents, identifying clusters of words and grouping them together based on similarity and
identifying patterns in multiple clusters.

Twitter Data Analysis

SOURCE CODE:

1. Install and load packages:

library("twitteR")

install.packages("tm")

library("tm")

install.packages("wordcloud")

library("wordcloud")

install.packages("RColorBrewer")

library("RColorBrewer")

install.packages("slam")

library("slam")

install.packages("topicmodels")

library("topicmodels")

2. Load the data and clean the data in the R environment

tweee

tweee=gsub("(RT|via)((?:\b*@)+)","",tweee)

tweee=gsub("http[^[:blank:]]+","",tweee)

tweee=gsub("@+","",tweee)

tweee=gsub("[t]{2,}","",tweee)

tweee=gsub("^+|+$","",tweee)

tweee=gsub("[[:punct:]]","",tweee)
corpus=Corpus(VectorSource(tweee))

corpus = Corpus(VectorSource(tweee))

corpus = tm_map(corpus,removePunctuation)

corpus = tm_map(corpus,stripWhitespace)

corpus = tm_map(corpus,tolower)

corpus=tm_map(corpus,removeWords,stopwords("english"))

Create a Term Document Matrix and Calculate TF-IDF

tdm = DocumentTermMatrix(corpus) # Creating a Term document Matrix

term_tfidf <- tapply(tdm$v/row_sums(tdm)[tdm$i], tdm$j, mean) * log2(nDocs(tdm)/col_sums(tdm >

0))

summary(term_tfidf)

tdm <- tdm[,term_tfidf >= 0.1]

tdm <- tdm[row_sums(tdm) > 0,]

summary(col_sums(tdm))

best.model <- lapply(seq(2, 50, by = 1), function(d){LDA(tdm, d)})

best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))

doc.lengths <- rowSums(as.matrix(DocumentTermMatrix(corpus)))

dtm <- DocumentTermMatrix(corpus[doc.lengths > 0])

3. Calculate the optimal Number of topics (K) in the Corpus and Apply LDA method
using topic models Package

models <- list(

CTM = CTM(dtm, k = k, control = list(seed = SEED, var = list(tol = 10^-4), em = list(tol = 10^-
3))),

VEM = LDA(dtm, k = k, control = list(seed = SEED)),

VEM_Fixed = LDA(dtm, k = k, control = list(estimate. alpha = FALSE, seed = SEED)),

Gibbs = LDA(dtm, k = k, method = "Gibbs", control = list(seed = SEED, burnin = 1000,

thin = 100, iter = 1000)))

lapply(models, terms, 10)

assignments <- sapply(models, topics)

head(assignments, n=10)

OUTCOME:
$CTM
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8
[1,] "true" "46562444" "false" "1990" "46562444" "false" "82278" "82278"
[2,] "ctrue" "46562447" "cfalse" "6973" "46562534" "cfalse" "82286" "82277"
[3,] "82281" "46562434" "c45" "586" "46562449" "586" "82285" "82286"
[4,] "82278" "46562443" "6973" "c24" "46562489" "c45" "82290" "82291"
[5,] "82277" "46562441" "1990" "3270" "46562443" "1203" "82275" "82280"
[6,] "82291" "46562449" "586" "7945" "46562432" "46562463" "82288" "82290"
[7,] "82288" "46562445" "1203" "1203" "46562441" "46562444" "82291" "82281"
[8,] "82290" "46562463" "3270" "c45" "46562483" "46562500" "82281" "82285"
[9,] "82275" "46562573" "c24" "36760" "46562560" "46562545" "82280" "82269"
[10,] "82269" "46562473" "7945" "46562556" "46562545" "1990" "82287" "82287"
Topic 9 Topic 10
[1,] "46562444" "36760"
[2,] "46562449" "11022"
[3,] "46562550" "5244"
[4,] "46562443" "20966"
[5,] "46562434" "c36760"
[6,] "46562547" "10403"
[7,] "46562473" "46562441"
[8,] "46562560" "46562534"
[9,] "46562463" "46562550"
[10,] "46562447" "46562443"

$VEM
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7
[1,] "586" "82278" "586" "46562444" "false" "1990" "46562502"
[2,] "c45" "82286" "1203" "46562443" "36760" "6973" "46562444"
[3,] "1203" "82291" "36760" "46562449" "11022" "7945" "46562534"
[4,] "1990" "82281" "1990" "46562534" "5244" "c24" "46562443"
[5,] "82278" "82277" "82278" "46562485" "cfalse" "3270" "46562531"
[6,] "82286" "82290" "46562502" "46562560" "10403" "586" "46562447"
[7,] "82277" "82287" "82291" "46562473" "20966" "c45" "46562464"
[8,] "46562444" "82275" "c45" "46562459" "c36760" "1203" "46562449"
[9,] "3270" "82288" "46562444" "46562477" "586" "46562531" "46562550"
[10,] "false" "82280" "82286" "46562433" "46562502" "46562502" "46562560"
Topic 8 Topic 9 Topic 10
[1,] "true" "46562434" "82278"
[2,] "ctrue" "46562447" "82285"
[3,] "586" "46562444" "82286"
[4,] "c45" "46562550" "82290"
[5,] "1203" "46562449" "82275"
[6,] "36760" "46562441" "82288"
[7,] "1990" "46562463" "82280"
[8,] "false" "46562483" "82281"
[9,] "3270" "46562432" "82277"
[10,] "c24" "46562445" "82292"
$VEM_Fixed
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7 Topic 8
[1,] "82278" "82278" "46562502" "46562444" "false" "1990" "46562502" "true"
[2,] "82286" "82286" "46562444" "46562443" "36760" "6973" "46562444" "ctrue"
[3,] "1990" "82291" "46562450" "46562449" "11022" "586" "46562534" "586"
[4,] "586" "82281" "46562447" "46562534" "5244" "7945" "46562443" "c45"
[5,] "c45" "82277" "46562434" "46562485" "cfalse" "c24" "46562531" "1990"
[6,] "1203" "82290" "46562449" "46562560" "10403" "3270" "46562447" "1203"
[7,] "82277" "82287" "46562464" "46562473" "20966" "c45" "46562464" "36760"
[8,] "82288" "82275" "46562499" "46562459" "c36760" "1203" "46562449" "3270"
[9,] "82285" "82288" "46562534" "46562477" "586" "46562531" "46562560" "c24"
[10,] "82287" "82280" "46562556" "46562433" "1990" "46562502" "46562550" "6973"
Topic 9 Topic 10
[1,] "46562434" "82278"
[2,] "46562447" "82285"
[3,] "46562444" "82286"
[4,] "46562550" "82290"
[5,] "46562449" "82275"
[6,] "46562441" "82288"
[7,] "46562463" "82280"
[8,] "46562483" "82281"
[9,] "46562432" "82277"
[10,] "46562445" "82292"

$Gibbs
Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6 Topic 7
[1,] "36760" "82278" "false" "46562449" "true" "46562441" "1990"
[2,] "11022" "82286" "cfalse" "46562447" "cfalse" "46562452" "586"
[3,] "5244" "82290" "1742667" "46562483" "false" "46551946" "6973"
[4,] "20966" "82285" "2449959" "46562550" "1742667" "1742667" "1203"
[5,] "c36760" "82291" "465128" "46562534" "2449959" "46562466" "c45"
[6,] "cfalse" "82275" "46551916" "46562547" "465128" "46562502" "2185"
[7,] "false" "82281" "46551946" "46562432" "46551916" "46562445" "36760"
[8,] "1742667" "82288" "46551959" "46562502" "46551946" "46562449" "3270"
[9,] "2449959" "82280" "46562401" "46562579" "46551959" "46562484" "7945"
[10,] "465128" "82277" "46562405" "46562472" "46562401" "46562485" "c24"
Topic 8 Topic 9 Topic 10
[1,] "46562447" "46562445" "46562444"
[2,] "46562472" "46562556" "46562443"
[3,] "46562531" "46562563" "46562434"
[4,] "46562539" "46562477" "46562441"
[5,] "46562477" "46562450" "46562473"
[6,] "46562528" "46562545" "46562560"
[7,] "46562450" "46562454" "46562463"
[8,] "46562483" "46562460" "46562489"
[9,] "46562521" "46562501" "46562485"
[10,] "46562548" "46562535" "46562555

CTM VEM VEM_Fixed Gibbs

1 6 5 5 3
2 9 9 9 10
3 4 1 1 7
4 7 10 10 2
5 10 5 5 1
6 4 6 6 7
7 1 8 8 5

REFERENCE: https://github.com/mkearney/trumptweets
SENTIMENT ANALYSIS
Sentiment Analysis is a method in which views of different polarities are obtained. We mean
positive, negative or neutral by polarities. It is also known as mining opinion and
identification of polarity. You can find out the type of perception expressed in reports, blogs,
social media feeds, etc. with the aid of sentiment analysis. Sentiment Analysis is a
classification method in which the data are categorized into various classes. Such classes may
be binary in nature (positive or negative) or multiple (happy, sad, angry, etc.) classes may be
available.

SOURCE CODE:

Required packages:

library(twitteR)

library(sentiment)

library(plyr)

library(ggplot2)

library(wordcloud)

library(RColorBrewer)

some_tweets = searchTwitter("Trump", n=1500, lang="en")

some_txt = sapply(some_tweets, function(x) x$getText())

Clean data

> some_txt = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", some_txt)

> some_txt = gsub("@\\w+", "", some_txt)

> # remove punctuation

> some_txt = gsub("[[:punct:]]", "", some_txt)

> # remove numbers

> some_txt = gsub("[[:digit:]]", "", some_txt)

> # remove html links

> some_txt = gsub("http\\w+", "", some_txt)

> # remove unnecessary spaces

> some_txt = gsub("[ \t]{2,}", "", some_txt)

> some_txt = gsub("^\\s+|\\s+$", "", some_txt)

Classify emotion

> class_emo = classify_emotion(some_txt, algorithm="bayes", prior=1.0)

> emotion = class_emo[,7]

> emotion[is.na(emotion)] = "unknown"

> # classify polarity

> class_pol = classify_polarity(some_txt, algorithm="bayes")

> # get polarity best fit

> polarity = class_pol[,4]

> # data frame with results

> sent_df = data.frame(text=some_txt, emotion=emotion,

+ polarity=polarity, stringsAsFactors=FALSE)

> sent_df = within(sent_df,

+ emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE))))

> head(sent_df)

Plot distribution of emotions

> ggplot(sent_df, aes(x=emotion)) +

+ geom_bar(aes(y=..count.., fill=emotion)) +

+ scale_fill_brewer(palette="Dark2") +

+ labs(x="emotion categories", y="number of tweets") +

+ ggtitle("Sentiment Analysis of Tweets about Trump\n(classification by emotion)")

Plot distribution of polarity

> ggplot(sent_df, aes(x=polarity)) +

+ geom_bar(aes(y=..count.., fill=polarity)) +

+ scale_fill_brewer(palette="RdGy") +

+ labs(x="polarity categories", y="number of tweets") +

+ ggtitle("Sentiment Analysis of Tweets about Starbucks\n(classification by polarity)")

# separating text by emotion

> emos = levels(factor(sent_df$emotion))

> nemo = length(emos)

> emo.docs = rep("", nemo)

> for (i in 1:nemo)

+ tmp = some_txt[emotion == emos[i]]

+ emo.docs[i] = paste(tmp, collapse=" ")

> # remove stopwords

> emo.docs = removeWords(emo.docs, stopwords("english"))

> # create corpus

> corpus = Corpus(VectorSource(emo.docs))

> tdm = TermDocumentMatrix(corpus)

> tdm = as.matrix(tdm)

> colnames(tdm) = emos

OUTCOME:

The bar graph above depicts twitter user’s sentiment score, negative score denoted by the (-)
symbol, which indicates unhappiness with the trump statement, whereas the positive score
denotes that users are quite happy. Whereas, zero here represents that users are neutral.

(PDF) The Analysis of Figurative Language Used in 'Still I Rise' Poem by Maya Angelou - MUHAMMAD RAUUF OKTAVIAN NUR - Academia - Edu
100% (2)
(PDF) The Analysis of Figurative Language Used in 'Still I Rise' Poem by Maya Angelou - MUHAMMAD RAUUF OKTAVIAN NUR - Academia - Edu
7 pages
R Functions
No ratings yet
R Functions
8 pages
Curso Básico de Iniciación A La Programación Con R Álvaro Mauricio Bustamante Lozano
No ratings yet
Curso Básico de Iniciación A La Programación Con R Álvaro Mauricio Bustamante Lozano
9 pages
BPS21018 SEC Practical
No ratings yet
BPS21018 SEC Practical
92 pages
Matrix, Dataframes, List
No ratings yet
Matrix, Dataframes, List
8 pages
1 - 4 Subsetting
No ratings yet
1 - 4 Subsetting
14 pages
Introduction To The R Programming Language
No ratings yet
Introduction To The R Programming Language
14 pages
R Subnetting
No ratings yet
R Subnetting
16 pages
Analysis Using Statistical: Introduction & Data Exploration
No ratings yet
Analysis Using Statistical: Introduction & Data Exploration
23 pages
Cleaning Data3
No ratings yet
Cleaning Data3
41 pages
Mda Practical2 Eda
No ratings yet
Mda Practical2 Eda
50 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
1 - Tidying Data - R - Primary
No ratings yet
1 - Tidying Data - R - Primary
13 pages
Intro To Data Science Lecture 4
No ratings yet
Intro To Data Science Lecture 4
13 pages
STA1040 MidSem Exam
No ratings yet
STA1040 MidSem Exam
12 pages
DS Exp3
No ratings yet
DS Exp3
22 pages
18 3 24 Upto Week 6 A B Latest 1
No ratings yet
18 3 24 Upto Week 6 A B Latest 1
25 pages
A Short List of Some Useful R Commands: Input and Display
No ratings yet
A Short List of Some Useful R Commands: Input and Display
2 pages
R Assignment
No ratings yet
R Assignment
9 pages
RLAB KP
No ratings yet
RLAB KP
16 pages
R - Tutorial: Matrices Are Vectors
No ratings yet
R - Tutorial: Matrices Are Vectors
13 pages
FE418 RLectureNotes1
No ratings yet
FE418 RLectureNotes1
15 pages
R For Machine Learning Lab Practical Work: Master of Business Administration in Business Analytics
0% (1)
R For Machine Learning Lab Practical Work: Master of Business Administration in Business Analytics
9 pages
Data Cleaning
No ratings yet
Data Cleaning
2 pages
R Intro STAT5000
No ratings yet
R Intro STAT5000
17 pages
R Examples
No ratings yet
R Examples
56 pages
Base R
No ratings yet
Base R
9 pages
Data Science
No ratings yet
Data Science
20 pages
R Programming
No ratings yet
R Programming
34 pages
Dev Record Aids
No ratings yet
Dev Record Aids
24 pages
R Program3
No ratings yet
R Program3
21 pages
Expt. No. Basic Math Date
No ratings yet
Expt. No. Basic Math Date
24 pages
R Tutorial2
No ratings yet
R Tutorial2
23 pages
Assignment 2 Tidyr
No ratings yet
Assignment 2 Tidyr
2 pages
7 DS Assignment 1
No ratings yet
7 DS Assignment 1
9 pages
Subset Creation in R
No ratings yet
Subset Creation in R
27 pages
Problem Set 1 Solution Numerical Methods
No ratings yet
Problem Set 1 Solution Numerical Methods
32 pages
Summarizing Data
No ratings yet
Summarizing Data
20 pages
Data Cleaning Using R
No ratings yet
Data Cleaning Using R
26 pages
CleaningData Chapter 3
No ratings yet
CleaningData Chapter 3
29 pages
BIO259 Note
No ratings yet
BIO259 Note
55 pages
Ex 3
No ratings yet
Ex 3
20 pages
Lab Book
No ratings yet
Lab Book
24 pages
Shahun Term Workr1
No ratings yet
Shahun Term Workr1
34 pages
Da Lab File 2
No ratings yet
Da Lab File 2
13 pages
Da Lab It
No ratings yet
Da Lab It
20 pages
Correct: Congratulations! You Passed!
No ratings yet
Correct: Congratulations! You Passed!
12 pages
Cas13 R ch2 3.R
No ratings yet
Cas13 R ch2 3.R
7 pages
Part A R Programming
No ratings yet
Part A R Programming
10 pages
Dsda Manual
No ratings yet
Dsda Manual
64 pages
First Machine Problem
No ratings yet
First Machine Problem
6 pages
Week 1-B. Data in R
No ratings yet
Week 1-B. Data in R
5 pages
Apply Functions With Purrr::: Cheat Sheet
No ratings yet
Apply Functions With Purrr::: Cheat Sheet
2 pages
A Short List of The Most Useful R Commands
No ratings yet
A Short List of The Most Useful R Commands
11 pages
Section 03
No ratings yet
Section 03
20 pages
Bigdata Programs&Solutions
No ratings yet
Bigdata Programs&Solutions
7 pages
Sta238 Wks - Week1+2
No ratings yet
Sta238 Wks - Week1+2
35 pages
Data - Analysis - With - R - 24
No ratings yet
Data - Analysis - With - R - 24
47 pages
What's Your Poison?: How Cocktails Got Their Names
From Everand
What's Your Poison?: How Cocktails Got Their Names
Jerry Bader
No ratings yet
Oldenglen
From Everand
Oldenglen
Robin Mason
No ratings yet
Crockpot Recipes: Scrumptious Crock Pot and Slow Cooker Recipes
From Everand
Crockpot Recipes: Scrumptious Crock Pot and Slow Cooker Recipes
Janet Daley
No ratings yet
Accelerated Data Science Getting Started Cheat Sheet Cudf 2003937 r4
No ratings yet
Accelerated Data Science Getting Started Cheat Sheet Cudf 2003937 r4
2 pages
Swapping Page Replacement Algorithms Thrashing
No ratings yet
Swapping Page Replacement Algorithms Thrashing
34 pages
Elementary TCP Sockets
No ratings yet
Elementary TCP Sockets
71 pages
Concurrency Control in DBMS
No ratings yet
Concurrency Control in DBMS
22 pages
Sensors 24 02197
No ratings yet
Sensors 24 02197
23 pages
Understanding - FM - Data 23
No ratings yet
Understanding - FM - Data 23
35 pages
Linux LVM Mirror
No ratings yet
Linux LVM Mirror
5 pages
Lab 10 SQL JOINS INNER SELF OUTER
No ratings yet
Lab 10 SQL JOINS INNER SELF OUTER
13 pages
1030 3044 1 PB
No ratings yet
1030 3044 1 PB
12 pages
DukeScientificWritingWorkshop PDF
No ratings yet
DukeScientificWritingWorkshop PDF
61 pages
Employee Engagement Kit
No ratings yet
Employee Engagement Kit
57 pages
IDoc To RNIF Scenario
0% (1)
IDoc To RNIF Scenario
12 pages
Quarter 4 Week 1 Technical Terms in Research
No ratings yet
Quarter 4 Week 1 Technical Terms in Research
7 pages
5
No ratings yet
5
47 pages
Reviewer Data Management
No ratings yet
Reviewer Data Management
3 pages
Alohomora Unlocking Data Quality Causes Through Event Log Contex
No ratings yet
Alohomora Unlocking Data Quality Causes Through Event Log Contex
16 pages
SCOM AV Exclsuions and Script Errors
No ratings yet
SCOM AV Exclsuions and Script Errors
13 pages
TNEA 2024 Cutoff - by Kalvium
No ratings yet
TNEA 2024 Cutoff - by Kalvium
87 pages
New Script
No ratings yet
New Script
2 pages
Quality Improvement Tools
No ratings yet
Quality Improvement Tools
14 pages
Alamalhodaei Humanizingdatadata 2020
No ratings yet
Alamalhodaei Humanizingdatadata 2020
21 pages
Resume DA 4y
No ratings yet
Resume DA 4y
1 page
Module 16 Siebel Data Model
100% (2)
Module 16 Siebel Data Model
21 pages
SW Error: Operating Instructions
No ratings yet
SW Error: Operating Instructions
7 pages
MT Practice Questions
No ratings yet
MT Practice Questions
4 pages
MS Access Lecture 01
No ratings yet
MS Access Lecture 01
11 pages
Q2 Lesson 3
No ratings yet
Q2 Lesson 3
41 pages
Oracle SQL and PLSQL Training Course Syllabus
No ratings yet
Oracle SQL and PLSQL Training Course Syllabus
5 pages
IEC 60870-5-101 Protocol Summary
100% (1)
IEC 60870-5-101 Protocol Summary
5 pages

Data Cleaning Using Dataset

Uploaded by

Data Cleaning Using Dataset

Uploaded by

DATA CLEANING:

Data Cleaning using Dataset:

Install the required packages and Import the required packages:

wordcloud(d$word,freq = d$freq,random.order = FALSE,colors = brewer.pal(1,"Dark2") ).

Twitter Data Analysis

1. Install and load packages:

2. Load the data and clean the data in the R environment

Create a Term Document Matrix and Calculate TF-IDF

tdm = DocumentTermMatrix(corpus) # Creating a Term document Matrix

term_tfidf <- tapply(tdm$v/row_sums(tdm)[tdm$i], tdm$j, mean) * log2(nDocs(tdm)/col_sums(tdm >

tdm <- tdm[,term_tfidf >= 0.1]

tdm <- tdm[row_sums(tdm) > 0,]

best.model <- lapply(seq(2, 50, by = 1), function(d){LDA(tdm, d)})

best.model.logLik <- as.data.frame(as.matrix(lapply(best.model, logLik)))

doc.lengths <- rowSums(as.matrix(DocumentTermMatrix(corpus)))

dtm <- DocumentTermMatrix(corpus[doc.lengths > 0])

models <- list(

VEM = LDA(dtm, k = k, control = list(seed = SEED)),

VEM_Fixed = LDA(dtm, k = k, control = list(estimate. alpha = FALSE, seed = SEED)),

Gibbs = LDA(dtm, k = k, method = "Gibbs", control = list(seed = SEED, burnin = 1000,

thin = 100, iter = 1000)))

assignments <- sapply(models, topics)

CTM VEM VEM_Fixed Gibbs

some_tweets = searchTwitter("Trump", n=1500, lang="en")

some_txt = sapply(some_tweets, function(x) x$getText())

> some_txt = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", "", some_txt)

> some_txt = gsub("@\\w+", "", some_txt)

> # remove punctuation

> some_txt = gsub("[[:punct:]]", "", some_txt)

> # remove numbers

> some_txt = gsub("[[:digit:]]", "", some_txt)

> # remove html links

> some_txt = gsub("http\\w+", "", some_txt)

> # remove unnecessary spaces

> some_txt = gsub("[ \t]{2,}", "", some_txt)

> class_emo = classify_emotion(some_txt, algorithm="bayes", prior=1.0)

> emotion = class_emo[,7]

> emotion[is.na(emotion)] = "unknown"

> # classify polarity

> class_pol = classify_polarity(some_txt, algorithm="bayes")

> # get polarity best fit

> polarity = class_pol[,4]

> # data frame with results

> sent_df = data.frame(text=some_txt, emotion=emotion,

> sent_df = within(sent_df,

+ emotion <- factor(emotion, levels=names(sort(table(emotion), decreasing=TRUE))))

Plot distribution of emotions

> ggplot(sent_df, aes(x=emotion)) +

+ labs(x="emotion categories", y="number of tweets") +

+ ggtitle("Sentiment Analysis of Tweets about Trump\n(classification by emotion)")

> ggplot(sent_df, aes(x=polarity)) +

+ labs(x="polarity categories", y="number of tweets") +

+ ggtitle("Sentiment Analysis of Tweets about Starbucks\n(classification by polarity)")

# separating text by emotion

> emos = levels(factor(sent_df$emotion))

> nemo = length(emos)

> emo.docs = rep("", nemo)

> for (i in 1:nemo)

+ tmp = some_txt[emotion == emos[i]]

+ emo.docs[i] = paste(tmp, collapse=" ")

> # remove stopwords

> emo.docs = removeWords(emo.docs, stopwords("english"))

> # create corpus

> corpus = Corpus(VectorSource(emo.docs))

> tdm = TermDocumentMatrix(corpus)

> tdm = as.matrix(tdm)

> colnames(tdm) = emos

You might also like