0% found this document useful (0 votes)

40 views20 pages

Chapter 26 Text Mining - Introduction To Data Science

Uploaded by

izzieapptest

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views20 pages

Chapter 26 Text Mining - Introduction To Data Science

Uploaded by

izzieapptest

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

11/15/24, 7:14 AM Chapter 26 Text mining | Introduction to Data Science

Chapter 26 Text mining

With the exception of labels used to represent categorical data, we have focused on numerical
data. But in many applications, data starts as text. Well-known examples are spam filtering,
cyber-crime prevention, counter-terrorism and sentiment analysis. In all these cases, the raw data
is composed of free form text. Our task is to extract insights from these data. In this section, we
learn how to generate useful numerical summaries from text data to which we can apply some of
the powerful data visualization and analysis techniques we have learned.

26.1 Case study: Trump tweets

During the 2016 US presidential election, then candidate Donald J. Trump used his twitter
account as a way to communicate with potential voters. On August 6, 2016, Todd Vaziri tweeted90
about Trump that “Every non-hyperbolic tweet is from iPhone (his staff). Every hyperbolic tweet is
from Android (from him).” Data scientist David Robinson conducted an analysis91 to determine if
data supported this assertion. Here, we go through David’s analysis to learn some of the basics
of text mining. To learn more about text mining in R, we recommend the Text Mining with R
book92 by Julia Silge and David Robinson.

We will use the following libraries:

library(tidyverse)
library(lubridate)
library(scales)

In general, we can extract data directly from Twitter using the rtweet package. However, in this
case, a group has already compiled data for us and made it available at
http://www.trumptwitterarchive.com. We can get the data from their JSON API using a script like
this:

https://rafalab.dfci.harvard.edu/dsbook/text-mining.html#exercises-46 1/20
11/15/24, 7:14 AM Chapter 26 Text mining | Introduction to Data Science

url <- 'http://www.trumptwitterarchive.com/data/realdonaldtrump/%s.json'

trump_tweets <- map(2009:2017, ~sprintf(url, .x)) |>
map_df(jsonlite::fromJSON, simplifyDataFrame = TRUE) |>
filter(!is_retweet & !str_detect(text, '^"')) |>
mutate(created_at = parse_date_time(created_at,

orders = "a b! d! H!:M!:S! z!* Y!",

tz="EST"))

For convenience, we include the result of the code above in the dslabs package:

library(dslabs)
data("trump_tweets")

You can see the data frame with information about the tweets by typing

head(trump_tweets)

with the following variables included:

names(trump_tweets)
#> [1] "source" "id_str"
#> [3] "text" "created_at"
#> [5] "retweet_count" "in_reply_to_user_id_str"
#> [7] "favorite_count" "is_retweet"

The help file ?trump_tweets provides details on what each variable represents. The tweets are
represented by the text variable:

trump_tweets$text[16413] |> str_wrap(width = options()$width) |> cat()

#> Great to be back in Iowa! #TBT with @JerryJrFalwell joining me in
#> Davenport- this past winter. #MAGA https://t.co/A5IF0QHnic

and the source variable tells us which device was used to compose and upload each tweet:

https://rafalab.dfci.harvard.edu/dsbook/text-mining.html#exercises-46 2/20
11/15/24, 7:14 AM Chapter 26 Text mining | Introduction to Data Science

trump_tweets |> count(source) |> arrange(desc(n)) |> head(5)

#> source n
#> 1 Twitter Web Client 10718
#> 2 Twitter for Android 4652
#> 3 Twitter for iPhone 3962

#> 4 TweetDeck 468

#> 5 TwitLonger Beta 288

We are interested in what happened during the campaign, so for this analysis we will focus on
what was tweeted between the day Trump announced his campaign and election day. We define
the following table containing just the tweets from that time period. Note that we use extract to
remove the Twitter for part of the source and filter out retweets.

campaign_tweets <- trump_tweets |>

extract(source, "source", "Twitter for (.*)") |>
filter(source %in% c("Android", "iPhone") &
created_at >= ymd("2015-06-17") &

created_at < ymd("2016-11-08")) |>

filter(!is_retweet) |>
arrange(created_at) |>
as_tibble()

We can now use data visualization to explore the possibility that two different groups were
tweeting from these devices. For each tweet, we will extract the hour, East Coast time (EST), it
was tweeted and then compute the proportion of tweets tweeted at each hour for each device:

https://rafalab.dfci.harvard.edu/dsbook/text-mining.html#exercises-46 3/20
11/15/24, 7:14 AM Chapter 26 Text mining | Introduction to Data Science

ungroup() |>
ggplot(aes(hour, percent, color = source)) +
geom_line() +
geom_point() +
scale_y_continuous(labels = percent_format()) +
labs(x = "Hour of day (EST)", y = "% of tweets", color = "")

We notice a big peak for the Android in the early hours of the morning, between 6 and 8 AM.
There seems to be a clear difference in these patterns. We will therefore assume that two
different entities are using these two devices.

We will now study how the tweets differ when we compare Android to iPhone. To do this, we
introduce the tidytext package.

26.2 Text as data

The tidytext package helps us convert free form text into a tidy table. Having the data in this
format greatly facilitates data visualization and the use of statistical techniques.
https://rafalab.dfci.harvard.edu/dsbook/text-mining.html#exercises-46 4/20
11/15/24, 7:14 AM Chapter 26 Text mining | Introduction to Data Science

library(tidytext)

The main function needed to achieve this is unnest_tokens . A token refers to a unit that we are
considering to be a data point. The most common token will be words, but they can also be single
characters, ngrams, sentences, lines, or a pattern defined by a regex. The functions will take a
vector of strings and extract the tokens so that each one gets a row in the new table. Here is a
simple example:

poem <- c("Roses are red,", "Violets are blue,",

"Sugar is sweet,", "And so are you.")
example <- tibble(line = c(1, 2, 3, 4),
text = poem)

example
#> # A tibble: 4 × 2
#> line text
#> <dbl> <chr>
#> 1 1 Roses are red,

#> 2 2 Violets are blue,

#> 3 3 Sugar is sweet,
#> 4 4 And so are you.
example |> unnest_tokens(word, text)
#> # A tibble: 13 × 2

#> line word

#> <dbl> <chr>
#> 1 1 roses
#> 2 1 are
#> 3 1 red
#> 4 2 violets

#> 5 2 are
#> # ℹ 8 more rows

Now let’s look at an example from the tweets. We will look at tweet number 3008 because it will
later permit us to illustrate a couple of points:

https://rafalab.dfci.harvard.edu/dsbook/text-mining.html#exercises-46 5/20
11/15/24, 7:14 AM Chapter 26 Text mining | Introduction to Data Science

i <- 3008
campaign_tweets$text[i] |> str_wrap(width = 65) |> cat()
#> Great to be back in Iowa! #TBT with @JerryJrFalwell joining me in
#> Davenport- this past winter. #MAGA https://t.co/A5IF0QHnic
campaign_tweets[i,] |>

unnest_tokens(word, text) |>

pull(word)
#> [1] "great" "to" "be" "back"
#> [5] "in" "iowa" "tbt" "with"
#> [9] "jerryjrfalwell" "joining" "me" "in"
#> [13] "davenport" "this" "past" "winter"

#> [17] "maga" "https" "t.co" "a5if0qhnic"

Note that the function tries to convert tokens into words. A minor adjustment is to remove the links
to pictures:

links <- "https://t.co/[A-Za-z\\d]+|&"

campaign_tweets[i,] |>
mutate(text = str_replace_all(text, links, "")) |>
unnest_tokens(word, text) |>
pull(word)
#> [1] "great" "to" "be" "back"

#> [5] "in" "iowa" "tbt" "with"

#> [9] "jerryjrfalwell" "joining" "me" "in"
#> [13] "davenport" "this" "past" "winter"
#> [17] "maga"

Now we are now ready to extract the words for all our tweets.

tweet_words <- campaign_tweets |>

mutate(text = str_replace_all(text, links, "")) |>
unnest_tokens(word, text)

And we can now answer questions such as “what are the most commonly used words?”:

https://rafalab.dfci.harvard.edu/dsbook/text-mining.html#exercises-46 6/20
11/15/24, 7:14 AM Chapter 26 Text mining | Introduction to Data Science

tweet_words |>
count(word) |>
arrange(desc(n))
#> # A tibble: 6,264 × 2
#> word n

#> <chr> <int>

#> 1 the 2330
#> 2 to 1413
#> 3 and 1245
#> 4 in 1190
#> 5 i 1151

#> # ℹ 6,259 more rows

It is not surprising that these are the top words. The top words are not informative. The tidytext
package has a database of these commonly used words, referred to as stop words, in text
mining:

stop_words
#> # A tibble: 1,149 × 2
#> word lexicon
#> <chr> <chr>
#> 1 a SMART

#> 2 a's SMART

#> 3 able SMART
#> 4 about SMART
#> 5 above SMART
#> # ℹ 1,144 more rows

If we filter out rows representing stop words with filter(!word %in% stop_words$word) :

tweet_words <- campaign_tweets |>

mutate(text = str_replace_all(text, links, "")) |>
unnest_tokens(word, text) |>
filter(!word %in% stop_words$word )

https://rafalab.dfci.harvard.edu/dsbook/text-mining.html#exercises-46 7/20
11/15/24, 7:14 AM Chapter 26 Text mining | Introduction to Data Science

we end up with a much more informative set of top 10 tweeted words:

tweet_words |>

count(word) |>
top_n(10, n) |>
mutate(word = reorder(word, n)) |>
arrange(desc(n))
#> # A tibble: 10 × 2

#> word n
#> <fct> <int>
#> 1 trump2016 415
#> 2 hillary 407
#> 3 people 304
#> 4 makeamericagreatagain 298

#> 5 america 255

#> # ℹ 5 more rows

Some exploration of the resulting words (not shown here) reveals a couple of unwanted
characteristics in our tokens. First, some of our tokens are just numbers (years, for example). We
want to remove these and we can find them using the regex ^\d+$ . Second, some of our
tokens come from a quote and they start with ' . We want to remove the ' when it is at the
start of a word so we will just str_replace . We add these two lines to the code above to
generate our final table:

tweet_words <- campaign_tweets |>

mutate(text = str_replace_all(text, links, "")) |>

unnest_tokens(word, text) |>

filter(!word %in% stop_words$word &
!str_detect(word, "^\\d+$")) |>
mutate(word = str_replace(word, "^'", ""))

Now that we have all our words in a table, along with information about what device was used to
compose the tweet they came from, we can start exploring which words are more common when
comparing Android to iPhone.

https://rafalab.dfci.harvard.edu/dsbook/text-mining.html#exercises-46 8/20
11/15/24, 7:14 AM Chapter 26 Text mining | Introduction to Data Science

For each word, we want to know if it is more likely to come from an Android tweet or an iPhone
tweet. Odds ratio are a summary statistic useful for quantifying these differences. .For each
device and a given word, let’s call it y , we compute the odds or the ratio between the proportion
of words that are y and not y and compute the ratio of those odds. Here we will have many
proportions that are 0, so we use the 0.5 correction. You can learn more about odds ratio in an
statistics or epidemiology textbook.

android_iphone_or <- tweet_words |>

count(word, source) |>

pivot_wider(names_from = "source", values_from = "n", values_fill = 0) |>
mutate(or = (Android + 0.5) / (sum(Android) - Android + 0.5) /
( (iPhone + 0.5) / (sum(iPhone) - iPhone + 0.5)))

Here are the highest odds ratios for Android

android_iphone_or |> arrange(desc(or))

#> # A tibble: 5,607 × 4
#> word Android iPhone or
#> <chr> <int> <int> <dbl>
#> 1 a.m 65 0 113.

#> 2 p.m 39 0 68.0

#> 3 mails 22 0 38.7
#> 4 poor 13 0 23.2
#> 5 t.v 13 0 23.2
#> # ℹ 5,602 more rows

and the top for iPhone:

https://rafalab.dfci.harvard.edu/dsbook/text-mining.html#exercises-46 9/20
11/15/24, 7:14 AM Chapter 26 Text mining | Introduction to Data Science

android_iphone_or |> arrange(or)

#> # A tibble: 5,607 × 4
#> word Android iPhone or
#> <chr> <int> <int> <dbl>
#> 1 makeamericagreatagain 0 298 0.00141

#> 2 americafirst 0 71 0.00598

#> 3 draintheswamp 0 63 0.00673
#> 4 trump2016 3 412 0.00707
#> 5 votetrump 0 56 0.00757
#> # ℹ 5,602 more rows

Given that several of these words are overall low frequency words, we can impose a filter based
on the total frequency like this:

https://rafalab.dfci.harvard.edu/dsbook/text-mining.html#exercises-46 10/20
11/15/24, 7:14 AM Chapter 26 Text mining | Introduction to Data Science

android_iphone_or |> filter(Android+iPhone > 100) |>

arrange(desc(or))
#> # A tibble: 30 × 4
#> word Android iPhone or
#> <chr> <int> <int> <dbl>

#> 1 bad 104 26 3.40

#> 2 crooked 156 49 2.73
#> 3 cnn 116 37 2.68
#> 4 ted 86 28 2.62
#> 5 interviewed 76 25 2.59
#> # ℹ 25 more rows

android_iphone_or |> filter(Android+iPhone > 100) |>

arrange(or)
#> # A tibble: 30 × 4
#> word Android iPhone or

#> <chr> <int> <int> <dbl>

#> 1 makeamericagreatagain 0 298 0.00141
#> 2 trump2016 3 412 0.00707
#> 3 join 1 157 0.00809
#> 4 tomorrow 24 101 0.206

#> 5 vote 46 67 0.591

#> # ℹ 25 more rows

We already see somewhat of a pattern in the types of words that are being tweeted more from
one device versus the other. However, we are not interested in specific words but rather in the
tone. Vaziri’s assertion is that the Android tweets are more hyperbolic. So how can we check this
with data? Hyperbolic is a hard sentiment to extract from words as it relies on interpreting
phrases. However, words can be associated to more basic sentiment such as anger, fear, joy, and
surprise. In the next section, we demonstrate basic sentiment analysis.

https://rafalab.dfci.harvard.edu/dsbook/text-mining.html#exercises-46 11/20
11/15/24, 7:14 AM Chapter 26 Text mining | Introduction to Data Science

26.3 Sentiment analysis

In sentiment analysis, we assign a word to one or more “sentiments”. Although this approach will
miss context-dependent sentiments, such as sarcasm, when performed on large numbers of
words, summaries can provide insights.

The first step in sentiment analysis is to assign a sentiment to each word. As we demonstrate, the
tidytext package includes several maps or lexicons. We will also be using the textdata package.

library(tidytext)
library(textdata)

The bing lexicon divides words into positive and negative sentiments. We can see this
using the tidytext function get_sentiments :

get_sentiments("bing")

The AFINN lexicon assigns a score between -5 and 5, with -5 the most negative and 5 the most
positive. Note that this lexicon needs to be downloaded the first time you call the function
get_sentiment :

get_sentiments("afinn")

The loughran and nrc lexicons provide several different sentiments. Note that these also
have to be downloaded the first time you use them.

https://rafalab.dfci.harvard.edu/dsbook/text-mining.html#exercises-46 12/20
11/15/24, 7:14 AM Chapter 26 Text mining | Introduction to Data Science

get_sentiments("loughran") |> count(sentiment)

#> # A tibble: 6 × 2
#> sentiment n
#> <chr> <int>
#> 1 constraining 184

#> 2 litigious 904

#> 3 negative 2355
#> 4 positive 354
#> 5 superfluous 56
#> # ℹ 1 more row

get_sentiments("nrc") |> count(sentiment)

#> # A tibble: 10 × 2
#> sentiment n

#> <chr> <int>

#> 1 anger 1245
#> 2 anticipation 837
#> 3 disgust 1056
#> 4 fear 1474

#> 5 joy 687

#> # ℹ 5 more rows

For our analysis, we are interested in exploring the different sentiments of each tweet so we will
use the nrc lexicon:

nrc <- get_sentiments("nrc") |>

select(word, sentiment)

We can combine the words and sentiments using inner_join , which will only keep words
associated with a sentiment. Here are 10 random words extracted from the tweets:

https://rafalab.dfci.harvard.edu/dsbook/text-mining.html#exercises-46 13/20
11/15/24, 7:14 AM Chapter 26 Text mining | Introduction to Data Science

tweet_words |> inner_join(nrc, by = "word", relationship = "many-to-many") |>

select(source, word, sentiment) |>
sample_n(5)
#> # A tibble: 5 × 3
#> source word sentiment

#> <chr> <chr> <chr>

#> 1 Android enjoy joy
#> 2 iPhone terrific sadness
#> 3 iPhone tactics trust
#> 4 Android clue anticipation
#> 5 iPhone change fear

Now we are ready to perform a quantitative analysis comparing Android and iPhone by comparing
the sentiments of the tweets posted from each device. Here we could perform a tweet-by-tweet
analysis, assigning a sentiment to each tweet. However, this will be challenging since each tweet
will have several sentiments attached to it, one for each word appearing in the lexicon. For
illustrative purposes, we will perform a much simpler analysis: we will count and compare the
frequencies of each sentiment appearing in each device.

sentiment_counts <- tweet_words |>

left_join(nrc, by = "word", relationship = "many-to-many") |>
count(source, sentiment) |>

pivot_wider(names_from = "source", values_from = "n") |>

mutate(sentiment = replace_na(sentiment, replace = "none"))
sentiment_counts
#> # A tibble: 11 × 3
#> sentiment Android iPhone
#> <chr> <int> <int>

#> 1 anger 962 527

#> 2 anticipation 917 707
#> 3 disgust 639 314
#> 4 fear 799 486
#> 5 joy 695 536

#> # ℹ 6 more rows

https://rafalab.dfci.harvard.edu/dsbook/text-mining.html#exercises-46 14/20
11/15/24, 7:14 AM Chapter 26 Text mining | Introduction to Data Science

For each sentiment, we can compute the odds of being in the device: proportion of words with
sentiment versus proportion of words without, and then compute the odds ratio comparing the two
devices.

sentiment_counts |>
mutate(Android = Android / (sum(Android) - Android) ,
iPhone = iPhone / (sum(iPhone) - iPhone),
or = Android/iPhone) |>

arrange(desc(or))
#> # A tibble: 11 × 4
#> sentiment Android iPhone or
#> <chr> <dbl> <dbl> <dbl>
#> 1 disgust 0.0299 0.0181 1.65
#> 2 anger 0.0457 0.0307 1.49

#> 3 negative 0.0814 0.0556 1.46

#> 4 sadness 0.0427 0.0300 1.42
#> 5 fear 0.0377 0.0283 1.33
#> # ℹ 6 more rows

So we do see some differences and the order is interesting: the largest three sentiments are
disgust, anger, and negative! But are these differences just due to chance? How does this
compare if we are just assigning sentiments at random? To answer this question we can
compute, for each sentiment, an odds ratio and a confidence interval. We will add the two values
we need to form a two-by-two table and the odds ratio:

https://rafalab.dfci.harvard.edu/dsbook/text-mining.html#exercises-46 15/20
11/15/24, 7:14 AM Chapter 26 Text mining | Introduction to Data Science

library(broom)
log_or <- sentiment_counts |>
mutate(log_or = log((Android / (sum(Android) - Android)) /
(iPhone / (sum(iPhone) - iPhone))),
se = sqrt(1/Android + 1/(sum(Android) - Android) +

1/iPhone + 1/(sum(iPhone) - iPhone)),

conf.low = log_or - qnorm(0.975)*se,
conf.high = log_or + qnorm(0.975)*se) |>
arrange(desc(log_or))

log_or

#> # A tibble: 11 × 7
#> sentiment Android iPhone log_or se conf.low conf.high
#> <chr> <int> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 disgust 639 314 0.502 0.0697 0.366 0.639
#> 2 anger 962 527 0.397 0.0552 0.288 0.505

#> 3 negative 1657 931 0.381 0.0423 0.298 0.464

#> 4 sadness 901 514 0.354 0.0562 0.244 0.464
#> 5 fear 799 486 0.287 0.0584 0.172 0.401
#> # ℹ 6 more rows

A graphical visualization shows some sentiments that are clearly overrepresented:

log_or |>
mutate(sentiment = reorder(sentiment, log_or)) |>
ggplot(aes(x = sentiment, ymin = conf.low, ymax = conf.high)) +
geom_errorbar() +
geom_point(aes(sentiment, log_or)) +

ylab("Log odds ratio for association between Android and sentiment") +

coord_flip()

https://rafalab.dfci.harvard.edu/dsbook/text-mining.html#exercises-46 16/20
11/15/24, 7:14 AM Chapter 26 Text mining | Introduction to Data Science

We see that the disgust, anger, negative, sadness, and fear sentiments are associated with the
Android in a way that is hard to explain by chance alone. Words not associated to a sentiment
were strongly associated with the iPhone source, which is in agreement with the original claim
about hyperbolic tweets.

If we are interested in exploring which specific words are driving these differences, we can refer
back to our android_iphone_or object:

android_iphone_or |> inner_join(nrc) |>

filter(sentiment == "disgust" & Android + iPhone > 10) |>
arrange(desc(or))
#> Joining with `by = join_by(word)`
#> # A tibble: 20 × 5

#> word Android iPhone or sentiment

#> <chr> <int> <int> <dbl> <chr>
#> 1 mess 15 2 5.33 disgust
#> 2 finally 12 2 4.30 disgust
#> 3 unfair 12 2 4.30 disgust

#> 4 bad 104 26 3.40 disgust

#> 5 lie 13 3 3.32 disgust
#> # ℹ 15 more rows

and we can make a graph:

https://rafalab.dfci.harvard.edu/dsbook/text-mining.html#exercises-46 17/20
11/15/24, 7:14 AM Chapter 26 Text mining | Introduction to Data Science

android_iphone_or |> inner_join(nrc, by = "word") |>

mutate(sentiment = factor(sentiment, levels = log_or$sentiment)) |>
mutate(log_or = log(or)) |>
filter(Android + iPhone > 10 & abs(log_or)>1) |>
mutate(word = reorder(word, log_or)) |>

ggplot(aes(word, log_or, fill = log_or < 0)) +

facet_wrap(~sentiment, scales = "free_x", nrow = 2) +
geom_bar(stat="identity", show.legend = FALSE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

This is just a simple example of the many analyses one can perform with tidytext. To learn more,
we again recommend the Tidy Text Mining book93.

26.4 Exercises

Project Gutenberg is a digital archive of public domain books. The R package gutenbergr
facilitates the importation of these texts into R.

https://rafalab.dfci.harvard.edu/dsbook/text-mining.html#exercises-46 18/20
11/15/24, 7:14 AM Chapter 26 Text mining | Introduction to Data Science

You can install and load by typing:

install.packages("gutenbergr")

library(gutenbergr)

You can see the books that are available like this:

gutenberg_metadata

1. Use str_detect to find the ID of the novel Pride and Prejudice.

2. We notice that there are several versions. The gutenberg_works() function filters this table
to remove replicates and include only English language works. Read the help file and use this
function to find the ID for Pride and Prejudice.

3. Use the gutenberg_download function to download the text for Pride and Prejudice. Save it to
an object called book .

4. Use the tidytext package to create a tidy table with all the words in the text. Save the table in
an object called words

5. We will later make a plot of sentiment versus location in the book. For this, it will be useful to
add a column with the word number to the table.

6. Remove the stop words and numbers from the words object. Hint: use the anti_join .

7. Now use the AFINN lexicon to assign a sentiment value to each word.

8. Make a plot of sentiment score versus location in the book and add a smoother.

9. Assume there are 300 words per page. Convert the locations to pages and then compute the
average sentiment in each page. Plot that average score by page. Add a smoother that appears
to go through data.

90. https://twitter.com/tvaziri/status/762005541388378112/photo/1↩︎

91. http://varianceexplained.org/r/trump-tweets/↩︎

92. https://www.tidytextmining.com/↩︎

https://rafalab.dfci.harvard.edu/dsbook/text-mining.html#exercises-46 19/20
11/15/24, 7:14 AM Chapter 26 Text mining | Introduction to Data Science

93. https://www.tidytextmining.com/↩︎

https://rafalab.dfci.harvard.edu/dsbook/text-mining.html#exercises-46 20/20

Anatomy of Desire by Emily Jamea Final
No ratings yet
Anatomy of Desire by Emily Jamea Final
304 pages
Columbus Problem Analysis Report 04.09.2021
No ratings yet
Columbus Problem Analysis Report 04.09.2021
29 pages
Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis On Text Files
No ratings yet
Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis On Text Files
7 pages
Annotations Notes Inventory
No ratings yet
Annotations Notes Inventory
1 page
Zoology Solved MCQS
88% (24)
Zoology Solved MCQS
67 pages
SWM Questionnaire (Team2)
100% (5)
SWM Questionnaire (Team2)
3 pages
SOP - How To Prepare The Hot Box
No ratings yet
SOP - How To Prepare The Hot Box
3 pages
Sentiment 2
No ratings yet
Sentiment 2
7 pages
RDataMining Slides Text Mining
No ratings yet
RDataMining Slides Text Mining
35 pages
Lecture 8
No ratings yet
Lecture 8
45 pages
21 Recipes For Mining Twitter With Rtweet
No ratings yet
21 Recipes For Mining Twitter With Rtweet
63 pages
Assign 5 TT
No ratings yet
Assign 5 TT
13 pages
Rtweet Workshop PDF
No ratings yet
Rtweet Workshop PDF
60 pages
Text Analysis
No ratings yet
Text Analysis
15 pages
ChatGPT Twitter Sentiment Analyzer
No ratings yet
ChatGPT Twitter Sentiment Analyzer
50 pages
SMTA - Lab Record - Aim, Procedures and Results
No ratings yet
SMTA - Lab Record - Aim, Procedures and Results
31 pages
Methodology
No ratings yet
Methodology
9 pages
02 Lab 2 Instructions
No ratings yet
02 Lab 2 Instructions
37 pages
Basic Tweet Preprocessing in Python: 1. Hashtag Extraction Using Regex
No ratings yet
Basic Tweet Preprocessing in Python: 1. Hashtag Extraction Using Regex
2 pages
RPubs - Text-Mining With Rvest and Qdap
No ratings yet
RPubs - Text-Mining With Rvest and Qdap
17 pages
Text Mining With R
No ratings yet
Text Mining With R
15 pages
Text Mining in R: A Tutorial
No ratings yet
Text Mining in R: A Tutorial
7 pages
Earthquake Shakes Twitter User:: Analyzing Tweets For Real-Time Event Detection
No ratings yet
Earthquake Shakes Twitter User:: Analyzing Tweets For Real-Time Event Detection
50 pages
Clustering Thesis
No ratings yet
Clustering Thesis
55 pages
ML 2
No ratings yet
ML 2
25 pages
Project Report
No ratings yet
Project Report
12 pages
RDataMining Slides Text Mining
No ratings yet
RDataMining Slides Text Mining
34 pages
Data Science Project
No ratings yet
Data Science Project
34 pages
Lab5 Instructions
No ratings yet
Lab5 Instructions
51 pages
Transportation System 上台報告ppt
No ratings yet
Transportation System 上台報告ppt
47 pages
Fake News Detection
No ratings yet
Fake News Detection
2 pages
Ayushi Data Science Final File
No ratings yet
Ayushi Data Science Final File
30 pages
Twittermining: 1 Twitter Text Mining - Required Libraries
No ratings yet
Twittermining: 1 Twitter Text Mining - Required Libraries
4 pages
FIT5196-S2-2020 Assessment 1: Task 1: Parsing Text Files (U)
No ratings yet
FIT5196-S2-2020 Assessment 1: Task 1: Parsing Text Files (U)
4 pages
Analyzing Ideological Discourse On Social Media - A Case Study of The Abortion Debate
No ratings yet
Analyzing Ideological Discourse On Social Media - A Case Study of The Abortion Debate
4 pages
RDataMining Slides Twitter Analysis
100% (1)
RDataMining Slides Twitter Analysis
40 pages
Step 1: Create A CSV File: # For Text Mining
No ratings yet
Step 1: Create A CSV File: # For Text Mining
9 pages
Character-Based Neural Embeddings For Tweet Clustering
No ratings yet
Character-Based Neural Embeddings For Tweet Clustering
9 pages
Twitter Sentiment Analysis in Python
0% (1)
Twitter Sentiment Analysis in Python
9 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
Text Analysis
No ratings yet
Text Analysis
15 pages
Do US Government and Commercial Media Concern Similar Topics? - A Text-Mining (NLP) Approach
No ratings yet
Do US Government and Commercial Media Concern Similar Topics? - A Text-Mining (NLP) Approach
6 pages
Group11 Report
No ratings yet
Group11 Report
18 pages
Wrangle Report
No ratings yet
Wrangle Report
4 pages
Text Mining Package and Datacleaning: #Cleaning The Text or Text Transformation
No ratings yet
Text Mining Package and Datacleaning: #Cleaning The Text or Text Transformation
6 pages
Data Cleaning Using Dataset
No ratings yet
Data Cleaning Using Dataset
12 pages
Twitter's Big Data Analysis Using RStudio
No ratings yet
Twitter's Big Data Analysis Using RStudio
9 pages
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
No ratings yet
We Are Intechopen, The World'S Leading Publisher of Open Access Books Built by Scientists, For Scientists
43 pages
Tutorial Text Analysis
No ratings yet
Tutorial Text Analysis
48 pages
4.10. Text Data Pre-Processing - Use Case - Ipynb - Colaboratory
No ratings yet
4.10. Text Data Pre-Processing - Use Case - Ipynb - Colaboratory
2 pages
Fake News Detection
100% (1)
Fake News Detection
25 pages
Fake
No ratings yet
Fake
14 pages
DMlab 2021
No ratings yet
DMlab 2021
4 pages
Monitoring The Public Opinion About The Vaccination Topic From Tweets Analysis
100% (1)
Monitoring The Public Opinion About The Vaccination Topic From Tweets Analysis
18 pages
More Than Sentiments
No ratings yet
More Than Sentiments
6 pages
Mlds5 Code
No ratings yet
Mlds5 Code
7 pages
Challenge 2024
No ratings yet
Challenge 2024
5 pages
Sentiment Analysis On User-Generated Tweets
No ratings yet
Sentiment Analysis On User-Generated Tweets
15 pages
PSOSM Lectures
No ratings yet
PSOSM Lectures
37 pages
NLP - (1) (1) .Ipynb - Colab
No ratings yet
NLP - (1) (1) .Ipynb - Colab
10 pages
Big Data
No ratings yet
Big Data
5 pages
Math513 CW 2023 24
No ratings yet
Math513 CW 2023 24
16 pages
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Data Science Programming In Python
From Everand
Data Science Programming In Python
Anita Raichand
No ratings yet
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
Python For Beginners
From Everand
Python For Beginners
Célio Azevedo
No ratings yet
Test Construction
No ratings yet
Test Construction
20 pages
Notes On Composition
No ratings yet
Notes On Composition
1 page
Immediate Download Hollows Peepers and Highlanders An Appalachian Mountain Ecology 2nd Edition George Constantz Ebooks 2024
100% (2)
Immediate Download Hollows Peepers and Highlanders An Appalachian Mountain Ecology 2nd Edition George Constantz Ebooks 2024
67 pages
Synthetic Air Data Estimation
No ratings yet
Synthetic Air Data Estimation
110 pages
Tabela Preços Lukma
No ratings yet
Tabela Preços Lukma
105 pages
Kerala Second Term Christmas Exam Timetable 2024 Dec For Plus One Plus Two
No ratings yet
Kerala Second Term Christmas Exam Timetable 2024 Dec For Plus One Plus Two
2 pages
Lembar Penilaian Hotel Reception
No ratings yet
Lembar Penilaian Hotel Reception
10 pages
MEEN 20052 - Week 4 - Salamat, Andre Agassi D.
No ratings yet
MEEN 20052 - Week 4 - Salamat, Andre Agassi D.
4 pages
Testoviron - DEPOT 250 MG / 1 ML: Bayer Middle East
No ratings yet
Testoviron - DEPOT 250 MG / 1 ML: Bayer Middle East
5 pages
Integrated Townships As A Policy Response To Changing Supply and Demand Dynamics of Urban Growth - India Infrastructure Report - 2009
No ratings yet
Integrated Townships As A Policy Response To Changing Supply and Demand Dynamics of Urban Growth - India Infrastructure Report - 2009
10 pages
Past Paper Question - Particulate Theory of Matter-Separation-Diffusion
0% (1)
Past Paper Question - Particulate Theory of Matter-Separation-Diffusion
4 pages
6403 Assignment (1) ..
No ratings yet
6403 Assignment (1) ..
27 pages
0721634274.TERM 2.Gr 8.sci - Grade 8 Rationalised Spotlight Integrated Science Schemes of Work Term 2
No ratings yet
0721634274.TERM 2.Gr 8.sci - Grade 8 Rationalised Spotlight Integrated Science Schemes of Work Term 2
18 pages
Quantitative-Research/: What Are The Main Features That Differentiate Qualitative Research From Quantitative Research?
No ratings yet
Quantitative-Research/: What Are The Main Features That Differentiate Qualitative Research From Quantitative Research?
1 page
Roads and Maritime Services (RMS) Rms Specification D&C B284 Installation of Bridge Bearings
No ratings yet
Roads and Maritime Services (RMS) Rms Specification D&C B284 Installation of Bridge Bearings
17 pages
Hvco Case Study
No ratings yet
Hvco Case Study
10 pages
MATH 311 - Mathematical Modelling
No ratings yet
MATH 311 - Mathematical Modelling
109 pages
Carlson 1981
No ratings yet
Carlson 1981
9 pages
Use of Generated Artificial Road Profiles in Road Roughness Evaluation
No ratings yet
Use of Generated Artificial Road Profiles in Road Roughness Evaluation
10 pages
Ultrasound Use in The Frozen Process.20140224.154936
No ratings yet
Ultrasound Use in The Frozen Process.20140224.154936
2 pages
B31.1 Page 120 NDT
No ratings yet
B31.1 Page 120 NDT
1 page
Wa0008
No ratings yet
Wa0008
3 pages
Content Server
No ratings yet
Content Server
18 pages
Mock Test CH-1, 2 & 3
No ratings yet
Mock Test CH-1, 2 & 3
2 pages

Chapter 26 Text Mining - Introduction To Data Science

Uploaded by

Chapter 26 Text Mining - Introduction To Data Science

Uploaded by

11/15/24, 7:14 AM Chapter 26 Text mining | Introduction to Data Science

Chapter 26 Text mining

26.1 Case study: Trump tweets

We will use the following libraries:

url <- 'http://www.trumptwitterarchive.com/data/realdonaldtrump/%s.json'

orders = "a b! d! H!:M!:S! z!* Y!",

with the following variables included:

trump_tweets$text[16413] |> str_wrap(width = options()$width) |> cat()

trump_tweets |> count(source) |> arrange(desc(n)) |> head(5)

#> 4 TweetDeck 468

campaign_tweets <- trump_tweets |>

created_at < ymd("2016-11-08")) |>

26.2 Text as data

poem <- c("Roses are red,", "Violets are blue,",

#> 2 2 Violets are blue,

#> line word

unnest_tokens(word, text) |>

#> [17] "maga" "https" "t.co" "a5if0qhnic"

links <- "https://t.co/[A-Za-z\\d]+|&amp;"

#> [5] "in" "iowa" "tbt" "with"

tweet_words <- campaign_tweets |>

#> <chr> <int>

#> # ℹ 6,259 more rows

#> 2 a's SMART

tweet_words <- campaign_tweets |>

we end up with a much more informative set of top 10 tweeted words:

#> 5 america 255

tweet_words <- campaign_tweets |>

unnest_tokens(word, text) |>

android_iphone_or <- tweet_words |>

count(word, source) |>

Here are the highest odds ratios for Android

android_iphone_or |> arrange(desc(or))

#> 2 p.m 39 0 68.0

and the top for iPhone:

android_iphone_or |> arrange(or)

#> 2 americafirst 0 71 0.00598

android_iphone_or |> filter(Android+iPhone > 100) |>

#> 1 bad 104 26 3.40

android_iphone_or |> filter(Android+iPhone > 100) |>

#> <chr> <int> <int> <dbl>

#> 5 vote 46 67 0.591

26.3 Sentiment analysis

get_sentiments("loughran") |> count(sentiment)

#> 2 litigious 904

get_sentiments("nrc") |> count(sentiment)

#> <chr> <int>

#> 5 joy 687

nrc <- get_sentiments("nrc") |>

tweet_words |> inner_join(nrc, by = "word", relationship = "many-to-many") |>

#> <chr> <chr> <chr>

sentiment_counts <- tweet_words |>

pivot_wider(names_from = "source", values_from = "n") |>

#> 1 anger 962 527

#> # ℹ 6 more rows

#> 3 negative 0.0814 0.0556 1.46

1/iPhone + 1/(sum(iPhone) - iPhone)),

#> 3 negative 1657 931 0.381 0.0423 0.298 0.464

A graphical visualization shows some sentiments that are clearly overrepresented:

ylab("Log odds ratio for association between Android and sentiment") +

android_iphone_or |> inner_join(nrc) |>

#> word Android iPhone or sentiment

#> 4 bad 104 26 3.40 disgust

and we can make a graph:

android_iphone_or |> inner_join(nrc, by = "word") |>

ggplot(aes(word, log_or, fill = log_or < 0)) +

You can install and load by typing:

1. Use str_detect to find the ID of the novel Pride and Prejudice.

You might also like

links <- "https://t.co/[A-Za-z\\d]+|&"