0% found this document useful (0 votes)
46 views9 pages

Twitter's Big Data Analysis Using RStudio

This document summarizes a study that analyzed tweets related to Donald Trump and Justin Trudeau using the R programming language. The study extracted over 5,000 tweets for each leader since 2017, cleaned the data, and generated word clouds to visualize the most common words. The word cloud for Trump showed many political terms that provided insights into issues he discusses, while the word cloud for Trudeau did not reveal much meaningful information due to fewer significant words. The study demonstrated how text analysis of social media can generate insights into prominent figures but has limitations depending on the quantity and content of tweets.

Uploaded by

Nazanin Ghasemi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views9 pages

Twitter's Big Data Analysis Using RStudio

This document summarizes a study that analyzed tweets related to Donald Trump and Justin Trudeau using the R programming language. The study extracted over 5,000 tweets for each leader since 2017, cleaned the data, and generated word clouds to visualize the most common words. The word cloud for Trump showed many political terms that provided insights into issues he discusses, while the word cloud for Trudeau did not reveal much meaningful information due to fewer significant words. The study demonstrated how text analysis of social media can generate insights into prominent figures but has limitations depending on the quantity and content of tweets.

Uploaded by

Nazanin Ghasemi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

CHAPTER 4

Twitter’s Big Data Analysis Using RStudio


HOUSSAME EDDINE BALOULI1 and LAZHAR CHINE2
1
National High School of Statistics and Applied Economics (ENSSEA),
Koléa, Algeria, E-mail: [email protected]
2
Associate Professor, Boumerdes University, Algeria,
E-mail: [email protected]

ABSTRACT

This work is an application of tweets analytics using the R programming


language [1] and its interface Rstudio. In this work, we have used many
packages such as “twitteR,” “tm,” and “wordcloud.” Two political figures
– the US President Donald Trump and the Canadian Prime Minister Justin
Trudeau – were chosen because they are among the most influential leaders
in the last four years. This research aims to reach a big collection of tweets
that talk about each one. After that, data obtained through different steps are
revised. Finally, we created a cloud of words (which is a visual representa-
tion of words in which the size is proportional to the frequency of that word
in a given text) for each one and analyzed them.

4.1 INTRODUCTION

Big data is a modern Era that means data sets with several characteristics:
high volume, high velocity, and high value. It comes from many areas in
our life, such as social media, public health, industry, and agriculture. It is
essential to apply some advanced statistical methods in technologies that
allow us to extract, store, analyze, and visualize the information [2].
34 Big Data Analytics: Harnessing Data for New Business Models

4.1.1 BIG DATA ANALYTICS PROCESS


Generally, many steps characterize the big data analytics process: as a first
step by identifying the business problem (the subject) to be solved. Next, all
sources of data (input) need to be collected; it is an essential step. The next
action is data cleaning to make it ready for the analysis. A model, picture,
equation, or other type of results will be constructed in the analytics step.
Finally, we interpret the final output in order to improve the quality of the
decision-making process and clarify the future [3] (Figure 4.1).

FIGURE 4.1 Big data analytics process.


Source: Team [1].

4.1.2 SOCIAL MEDIA ANALYTICS (SMA)


SMA refers to collecting, cleaning, and treating structured, unstructured,
qualitative, or quantitative (quantifiable or not) data loaded from social
media websites. Moreover, social media is a broad term encompassing a
variety of online platforms that allow subscribers to exchange content, senti-
ment, information, and others.
Twitter-by its 140 characters rule-is one of the most popular social media
websites (after Facebook) with more than 500 million tweets per day. Twitter
is growing rapidly since its creation before 13 years. An advantage of Twitter
is that all tweets are showed in real-time in which the information can reach
a large number of subscribers in a very short moment [4].

4.1.3 TWEETS ANALYTICS PROCESS


Tweets analytics process can be described as follow:
i. Creating a Connection to the Twitter Server: By creating new
apps on the development website (http://developer.twitter.com/). The
Twitter’s Big Data Analysis Using RStudio 35

server admin will ask you many questions about the nature of your
project and the followed approach. The objective is to have access to
tweets.
ii. Select the Data Type: Select one or many keywords that you need to
study and analyze.
iii. Extract the Data: Using the package “searchTwitter” or other pack-
ages. You can choose any number of tweets you need and can specify
the language, the date, and other parameters. All these options are
related to the nature of the keywords you search for.
iv. Clean the Data: Using the “tm” (text mining) package, we extract
only texts from the loaded data. After that, we clean the texts from
numbers, punctuations, and other special characters. The next step
eliminates a group of English words such as “did,” “do,” “must,”
“they,” and others. The list of these words will show in the practical
part. At this level, we also eliminate repeated words.
v. Data Processing: The data now is cleaned and ready for analysis.
Using the “wordcloud” package [5], we create the word cloud of
both Trump and Trudeau. Many options are available such as colors
and the position of the words in the cloud.

4.1.4 WORD CLOUD UTILITY

Word cloud is a powerful communication tool. It is very easy to understand


and share. The word clouds summarize any topic well. Word cloud is an
efficient method for text analysis. It adds simplicity and clarity. The most
used keywords appear better in a word cloud presentation. Word cloud is
visually more explicable than a data table filled with texts. However, who
uses Word Cloud?
Word cloud has several uses:

• Laboratories: For the presentation of both quantitative and qualita-


tive data.
• Marketing Campaigns: To highlight customer needs and identify
dissatisfaction.
• Education: To support essential topics.
• Politicians and journalists.
• Social Network: To collect, analyze, and share the user’s sentiments.
36 Big Data Analytics: Harnessing Data for New Business Models

4.2 EXPERIMENTAL METHODS AND MATERIALS

4.2.1 LOADING LIBRARIES (PACKAGES)

We used many packages in our study: “twitteR” [6] to connect to the


Twitter website; “tm” [7] to the text mining phase; and “wordcloud” to the
presentation.

Library (twitteR)
Library (tm)
Library (wordcloud)

4.2.2 CREATING A CONNECTION

The next step is to connect to the Twitter server. For this, we must have
authentication keys that can be obtained by registering on the develop-
ment website (https://developer.twitter.com/). The process is not very
complex.

consumer_key <- "aSRrP8oQSc29YyvdglqflvogH"


consumer_secret <- "xaA9ihDYXiGWkECOxPC45S6VRz1cnNR29rZWchORLGWqvDgPVw"
access_token <- "1013074431550291968-jxjLtzaELHQB0xqQIrBTkzf2EOsNAg"
access_secret <- "LDEzIC5kw1JwpZK39nsH5gBapE5a93gAVFn7du45zEHKX"

Once obtained, we specify the string consumer_key, consumer_secret,


access_token, and access_secret in the sutup_twitter_oauth command:

setup_twitter_auth (consumer_keyonsumer_secret, access_token, access_secret)


[1] “Using direct authentication”

The message “Using direct authentication” should appear in the console,


indicating that the operation is running smoothly.
Twitter’s Big Data Analysis Using RStudio 37

4.2.3 EXTRACTION OF TWEETS

The search Twitter function is used to load tweets online. In our study, we
specify two keywords: @realDonaldTrump and @JustinTrudeau.

We limit the number of extracted tweets to n = 5000, and since 01/01/2017


for the date. We are interested in the English language for the tweets. The
date of the extraction was 25/01/2019 at 16:00 UC.

4.2.4 THE STRUCTURE OF THE OUTPUT

Using the “str” base function of R, we confirm that the output is a list. It
means that our output contains characters, numbers, and other types of data.

4.2.5 FIRST TWEETS

We can show any tweet we need, and we can know many things about it such
as the date of publication, by who, its IP address, and other information.
38 Big Data Analytics: Harnessing Data for New Business Models

4.2.6 CLEANING THE OUTPUT

The first step is the extraction of the text-only using the “sapply” function.

Then, we create a corpus using the “corpus” function.

After that, we clean the corpus from numbers, punctuation, special


characters, spaces, and a group of English words such as “they,” “you,” and
“must.” Now, the text is ready to analyze by the construction of the word
cloud.

4.3 RESULTS AND DISCUSSION

The first result is the Trump word cloud created using the “wordcloud” func-
tion. We limit the showed words to 100.

The word cloud of Trump is given in Figure 4.2.


Through the cloud of words, we notice that there are many words related
to different contexts: political, economic, and others. These words are
government, democrats, Roger Mueller (The Special Investigator about the
possibility of Russian intervention in the elections), border (Mexican border),
Maduro (Venezuelan President), FBI, and Nancy Beloucy (President of the
US House of Representatives). These could show us all the problems that the
American president is suffering and his intervention, even in cases outside
the United States (Venezuela, for example).
Twitter’s Big Data Analysis Using RStudio 39

FIGURE 4.2 Trump word cloud.

The word cloud of Trump allows us to discover many things about


his personality, his thinking, his vision about many political, social, and
economic events, and his interaction with the outside world.
The second-word cloud is about Canadian Prime Minister Justin Trudeau.
We use the same function as the precedent Trump word cloud (Figure 4.3).

FIGURE 4.3 Trudeau word cloud.


40 Big Data Analytics: Harnessing Data for New Business Models

Through this word cloud, we cannot know a lot about Trudeau using this
simple analysis. There are no significant words except Canada or Canadian.
Although we got more than 5,000 tweets about the Canadian Prime Minister,
we could not analyze his personality or know much about him.

4.4 CONCLUSION, LIMITATION, AND FUTURE RESEARCH

Unlike some previous studies published in non-peer reviews that have shown
how to extract only tweets, we compared extracted tweets in a specific area
(politics), and we explain how they are used to extract important information.
Twitter’s big data is a treasure that we need to conserve, discover, and
analyze. Many personal data, statistics, emotions, visions, reports, plans,
strategies, and other types of data are available to analyze.
The study of tweets is a strong focus of the analysis of social networks
because Twitter has become an important factor in communication. This
example shows that it is easy to initiate the first analysis from data extracted
directly online. The data preparation phase is becoming as important as ever.
On the other hand, the programming language R with its interface Rstudio
allows us to use it as a powerful tool to extract big data online and clean it
to be ready for use and study. It allows us to build many powerful plots and
graphs that help managers, researchers, governments, and other actors in the
decision making process.
Regarding the limits of the study, the word cloud is not enough to extract
information about the studied keywords because the process does not always
work well. We need to improve the study by advanced techniques such as
sentiment analysis that is the subject of our next work.

KEYWORDS

 big data
 programming language R
 Rstudio
 Trudeau word cloud
 Twitter
 word cloud
Twitter’s Big Data Analysis Using RStudio 41

REFERENCES

1. Team, R. C. R., (2018). A Language and Environment for Statistical Computing.


Retrieved on CRAN: URL: https://www.R-project.org/ (accessed on 22 October 2020).
2. B, B., (2014). Analytics in a Big Data World: The Essential Guide to Data Science and
its Applications. John Wiley & Sons.
3. Gandomi, A. H., (2015). Beyond the hype: Big data concepts, methods, and analytics.
International Journal of Information Management, 137–144.
4. Zhao, Y., (2013). Analyzing Twitter data with text mining and social network analysis.
The 11th Australasian Data Mining and Analytics Conference (AusDM 2013).
5. Fellows, I., (2018). Word Cloud: Word Clouds. Retrieved on CRAN: https://CRAN.R-
project.org/package=wordcloud (accessed on 22 October 2020).
6. Gentry, J., (2015). TwitteR: R Based Twitter Client. Retrieved on CRAN. https://cran.r-
project.org/web/packages/twitteR/index.html (accessed on 22 October 2020).
7. Hornik, I. F., (2018). Tm: Text Mining Package. Retrieved from CRAN: https://
CRAN.R-project.org/package=tm (accessed on 22 October 2020).

You might also like