Twitter's Big Data Analysis Using RStudio
Twitter's Big Data Analysis Using RStudio
ABSTRACT
4.1 INTRODUCTION
Big data is a modern Era that means data sets with several characteristics:
high volume, high velocity, and high value. It comes from many areas in
our life, such as social media, public health, industry, and agriculture. It is
essential to apply some advanced statistical methods in technologies that
allow us to extract, store, analyze, and visualize the information [2].
34 Big Data Analytics: Harnessing Data for New Business Models
server admin will ask you many questions about the nature of your
project and the followed approach. The objective is to have access to
tweets.
ii. Select the Data Type: Select one or many keywords that you need to
study and analyze.
iii. Extract the Data: Using the package “searchTwitter” or other pack-
ages. You can choose any number of tweets you need and can specify
the language, the date, and other parameters. All these options are
related to the nature of the keywords you search for.
iv. Clean the Data: Using the “tm” (text mining) package, we extract
only texts from the loaded data. After that, we clean the texts from
numbers, punctuations, and other special characters. The next step
eliminates a group of English words such as “did,” “do,” “must,”
“they,” and others. The list of these words will show in the practical
part. At this level, we also eliminate repeated words.
v. Data Processing: The data now is cleaned and ready for analysis.
Using the “wordcloud” package [5], we create the word cloud of
both Trump and Trudeau. Many options are available such as colors
and the position of the words in the cloud.
Library (twitteR)
Library (tm)
Library (wordcloud)
4.2.2 CREATING A CONNECTION
The next step is to connect to the Twitter server. For this, we must have
authentication keys that can be obtained by registering on the develop-
ment website (https://developer.twitter.com/). The process is not very
complex.
4.2.3 EXTRACTION OF TWEETS
The search Twitter function is used to load tweets online. In our study, we
specify two keywords: @realDonaldTrump and @JustinTrudeau.
Using the “str” base function of R, we confirm that the output is a list. It
means that our output contains characters, numbers, and other types of data.
We can show any tweet we need, and we can know many things about it such
as the date of publication, by who, its IP address, and other information.
38 Big Data Analytics: Harnessing Data for New Business Models
The first step is the extraction of the text-only using the “sapply” function.
The first result is the Trump word cloud created using the “wordcloud” func-
tion. We limit the showed words to 100.
Through this word cloud, we cannot know a lot about Trudeau using this
simple analysis. There are no significant words except Canada or Canadian.
Although we got more than 5,000 tweets about the Canadian Prime Minister,
we could not analyze his personality or know much about him.
Unlike some previous studies published in non-peer reviews that have shown
how to extract only tweets, we compared extracted tweets in a specific area
(politics), and we explain how they are used to extract important information.
Twitter’s big data is a treasure that we need to conserve, discover, and
analyze. Many personal data, statistics, emotions, visions, reports, plans,
strategies, and other types of data are available to analyze.
The study of tweets is a strong focus of the analysis of social networks
because Twitter has become an important factor in communication. This
example shows that it is easy to initiate the first analysis from data extracted
directly online. The data preparation phase is becoming as important as ever.
On the other hand, the programming language R with its interface Rstudio
allows us to use it as a powerful tool to extract big data online and clean it
to be ready for use and study. It allows us to build many powerful plots and
graphs that help managers, researchers, governments, and other actors in the
decision making process.
Regarding the limits of the study, the word cloud is not enough to extract
information about the studied keywords because the process does not always
work well. We need to improve the study by advanced techniques such as
sentiment analysis that is the subject of our next work.
KEYWORDS
big data
programming language R
Rstudio
Trudeau word cloud
Twitter
word cloud
Twitter’s Big Data Analysis Using RStudio 41
REFERENCES