Bopardikar Yash Masters Project Report

TWITTER DATA ANALYSIS USING SPARK
A Project
Presented to the faculty of the Department of Computer Science
California State University, Sacramento
Submitted in partial satisfaction of

the requirements for the degree of
MASTER OF SCIENCE
in
Computer Science
by
Yash Bopardikar
FALL
2016
© 2016
Yash Bopardikar
ALL RIGHTS RESERVED

ii
A Project
by
Yash Bopardikar
Approved by:
__________________________________, Committee Chair

Dr. Jun Dai
__________________________________, Second Reader

Dr. Ying Jin
____________________________
Date
iii
Student: Yash Bopardikar
I certify that this student has met the requirements for format contained in the University
format manual, and that this project is suitable for shelving in the Library and credit is to
be awarded for the project.
__________________________, Graduate Coordinator ___________________

Dr. Ying Jin Date
Department of Computer Science
iv
Abstract
of
by
Yash Bopardikar
The idea is to solve people’s dilemma for choosing a certain product or service over
other. The objective is to collect data from tweeter feeds on trending topics like various
OS, new technologies, different Products etc. and to categories them according to need.
This Project would use Scala and Spark cluster. All the Big Data that would be fetched
would be collected in MySQL database. Then using the MapReduce API’s and Data
Mining algorithms data will be classified accordingly. Here we would be using Naïve
Bayes Algorithm for Sentimental analysis over the data. Twitter provides a streaming
API to stream real time twitter data. This library named Twitter4j is available in python
to access Streaming API and download twitter data. The filter on this data is based on a
list of keywords supplied. This analysis will employee a distributed data processing
system known as Apache Spark using several worker and master nodes. This cluster is
scalable and can handle millions of records.
To filter out the huge number of data we will use Map reduce technique on Spark. The
input file will contain a JSON object for each tweet in data. This file will be uploaded on
Spark frame structure. The Spark frame structure will have replicated the structure and
distribute to multiple nodes, thus the mapper takes all files presented in the directory and
v
classifies them according to the filter set. These filtered tweets will pass through the data
mining algorithms and thus we could observe one to one comparison of data which would
be helpful to take tough decisions.
_______________________, Committee Chair

Dr. Jun Dai
_______________________
Date
vi
ACKNOLEDGEMENTS
I would like to thank Dr. Jun Dai, my advisor for providing me an opportunity to work on
this project and also motivate me to do better and better. His constant encouragement and
interest in making best out of such a powerful platform helped me throughout the project.
In addition, I would also like to thank Dr. Ying Jin for her willingness to serve on the
committee. Last but not the least, I would like to thank the entire faculty and staff of
Department of Computer Science Engineering at California State University,
Sacramento.
vii
TABLE OF CONTENTS
Page
Acknowledgements ........................................................................................................... vii
List of Figures ..................................................................................................................... x
Chapters
1. INTRODUCTION .......................................................................................................... 1
1.1 Overview ................................................................................................................. 1
1.2 Introduction to Big Data ......................................................................................... 2
1.3 Sources of Big Data ...................................................................................................... 4
1.4 Why we need Big Data? ............................................................................................... 6
1.5 What is Big Data Analytics? ......................................................................................... 6
2. OVERVIEW TECHNOLOGIES AND PLATFORM USED ........................................ 7
2.1 Twitter ........................................................................................................................... 7
2.2 Apache Spark ................................................................................................................ 8
2.2.1 Spark Streaming ....................................................................................................... 11
2.2.2 Spark SQL................................................................................................................ 11
2.2.3 Spark ML Lib (Machine learning libraries) ............................................................. 12
2.3 SBT ............................................................................................................................. 12
2.4 Tableau ........................................................................................................................ 12
3. SYSTEM DESIGN ....................................................................................................... 13
3.1 Spark Cluster Setup..................................................................................................... 13
3.2 System Flow Diagram................................................................................................. 16
viii
4. IMPLEMENTATION ................................................................................................... 17
4.1 Streaming Twitter Data Using Spark .......................................................................... 17
4.2 Understanding Data .................................................................................................... 18
4.3 Pre- Processing Data ................................................................................................... 21
4.4 Developed Model ........................................................................................................ 24
4.5 Saving to Database...................................................................................................... 26
5. EXECUTION................................................................................................................ 28
5.1 Building Project .......................................................................................................... 28
5.2 Submitting Job ............................................................................................................ 29
5.3 Tableau ........................................................................................................................ 31
6. USE-CASES AND RESULTS ..................................................................................... 32
6.1 IPhone vs Samsung Note7 .......................................................................................... 32
6.2 PlayStation vs Xbox vs Nintendo ............................................................................... 37
7. FUTURE WORK .......................................................................................................... 40
8. CONCLUSION ............................................................................................................. 41
Bibliography ..................................................................................................................... 42
ix
LIST OF FIGURES
Figure Page
1. V’s of Big Data ............................................................................................................... 2
2. Spark Architecture .......................................................................................................... 9
3. Spark Streaming ............................................................................................................ 11
4. Spark Master WebUI .................................................................................................... 14
5. Spark Worker UI ........................................................................................................... 15
6. Spark Application Window........................................................................................... 15
7. System Flow Diagram................................................................................................... 16
8. Spark Streaming ............................................................................................................ 17
9. DataFrame Schema ....................................................................................................... 21
10. Basic Workflow Model ............................................................................................... 24
11. DataFrame ................................................................................................................... 26
12. Tableau Connection .................................................................................................... 31
13. Density, Sentiment and Location Graph for Iphone vs Samsung .............................. 33
14. Density, Sentiment and Location Graph for Iphone vs Samsung .............................. 34
15. Popularity Graph Iphone v/s Samsung Note7............................................................. 35
16. Location Based Sentimental Analysis ........................................................................ 36
17. Heat Map for Gaming Consoles (Canada) .................................................................. 38
18. Heat Map for Gaming Consoles (India)...................................................................... 39
x
1
Chapter 1
INTRODUCTION
1.1 Overview
In today's world there is a lot of informative data available in the social media. This data
has been a powerful resource for data miners to get deep insights. Collecting this data
and analyzing it might give us a lot of useful information which can help solve a lot of
our day to day problems. Taking this idea into consideration, I have harnessed a tool
which collects data from Twitter and analyses it to produce intelligent information.
Suppose we use this tool according to a consumer/buyer perspective, it can be used in the
following way. This app collects the Tweets that relate to brands, products or topics set as
filter and gives us a review as to which is the best for a given geographical location, what
are the sentiments associated with it at a particular location, which brand, product or topic
is most trending, what the trend looks like on different dates, how popular it is and how
many followers it has and thus helping the consumer to make wise choices. Consider
buying a gaming console as our target mission, we want to analyze gaming consoles that
belong to three brands say Xbox, PlayStation, Nintendo. This tool is not just limited to
gaming consoles but can be used to get reviews and insights for any product or any
trending topic for example phones, cars, electronic devices. The results can be visualized
it various graphs like heat map, bar graph, trending graph, density graph, etc.
2
1.2 Introduction to Big Data
Today, we live in the digital world. Because of the growing digitization and innovations
in technologies, the amount of data (structured or unstructured) and data collection have
also tremendously increased. The data is deposited in databases which grow substantially
and become difficult to collect, manage, share, examine and visualize via traditional
database software tools [1]. This is the reason for emergence of big data and a recent area
of strategic investment for IT organizations.
Big Data can be defined as a collection of data which may be structured or unstructured
and so large and complicated that it may become complex to process it using simple
systems or traditional data processing applications [2]
Figure 1 V’s of Big Data [3]

3
The characteristics of Big Data are the following [3]:
Ø Volume
Volume [3] refers to huge amount of data transformed every second which is not possible
for traditional systems to manage. For example, consider a social networking website like
Facebook where people upload pictures, send messages, comments and click the Like
button billion times on daily basis.
Ø Variety
The data like structured or unstructured, videos, audios, images, text messages, tables,
graphs etc., are referred to as Variety. These different forms of data can be collected via
various sources like social media, sensors, mobile devices, government datasets, user
based etc. This data can have many dimensions and also various forms.
Ø Velocity
The speed to generate new data and move around are stated as Velocity. For example, the
messages or videos on social media that’s going viral within a minute and quickly detect
the suspicious activity on credit card transactions. Systems designed to handle big data
can handle and process millions of rows per second, which makes it easier to get desired
insights out of this data.
Ø Veracity
Veracity refers to the reliability or disorderliness of the data [3]. The quality and accuracy
are not easy to manage with many forms of big data. For example, the post on Twitter or
Facebook with many hashtags, typos, and informal speech. The system has many worker
4
nodes and threads running to complete a desired task which work in synchronization to
maintain the reliability of data.
Ø Value
Value refers to the business cost which is obtained through managing big data. For
example, managing very large unstructured data from blogs or social media stream. The
value of data depends on how accurate are the insights obtained from the data, the
processing time, how deep the mining is done and how it is visualized.
1.3 Sources of Big Data
With the evolution and innovation of technology, the amount of generated data is also
increased. Also, the data needs to be collected into variety of formats. Sources of Big
Data can be generally classified into six different categories:
Ø Enterprise Data
There is humongous amount of data spread across the various businesses, companies, and
institutions in different formats known as Enterprise Data. The formats include emails,
word documents, spreadsheets, presentations, HTML pages, pdf files, XMLs, flat files,
legacy formats, Json, csv, tables, graphs, etc.
Ø Transactional Data
There are countless applications like web applications, mobile applications, CRM
systems, banking systems, service providing platforms, etc., in every organization which
5
involve in different kinds of transactions. So, the relational databases can be used as a
backend to reinforce these applications.
Ø Social Media
Social networks like Facebook, Twitter, etc., produce vast amount of data every day
which are unstructured data which can be text, images, video, and others. This data can
be cleansed, ordered and transformed into a structured data to make it useful data.
Ø Activity Generated
The data being generated by machines are referred to as Activity generated data. The
origin of these kinds of data comes from satellites, medical devices, sensor data,
industrial machinery, surveillance videos, cell phone towers, etc. This data may be
present in Log files, log tables, Json format and can be used to get future predictions of
the system.
Ø Public Data
The publicly available data like Wikipedia, data from weather department, data published
by research institutes, open data-sets provided by government and other types of data
which are easily available and accessible to public at no cost. This type of data is called
Public Data.
Ø Archives
In today's world, all insignificant data are being achieved by enterprises. As hardware
getting cheaper day-by-day, the enterprises can pile up the data. This kind of data
includes scanned documents and agreements, records of ex-employees, project
documents, and all banking records.

6
1.4 Why we need Big Data?
A normal system hardware and software are not capable to deal with very large amount
of various types of data which are created and collected at such a high speed. Big data is
the term for large and complicated data sets that it becomes difficult for traditional data
warehousing to store, analyze, manage, process and work on them and visualize. The
insights gained by processing big data can be significant for a business, can help
consumers, can be used to get predictive models for natural calamities and avoid them,
and can be used to predict behavioral patterns and trends. That is the reason of big data
and its analysis is the main focus modern science and business.
1.5 What is Big Data Analytics?
“Big data analytics is the process of examining large data sets containing a variety of data
types, like big data, to uncover hidden patterns, unknown correlations, market trends,
customer preferences and other useful business information.” [4]
Using Big Data Analysis important insights can be gained. For this project we use Big
Data Analytic for getting insights from twitter data.

7
Chapter 2
OVERVIEW TECHNOLOGIES AND PLATFORM USED
2.1 Twitter
Twitter [5] is a social networking platform used to share small bits of information across
the world. It is the most popular micro blogging website in today’s world. The 140
characters’ messages/posts are called tweet and people can follow each other to receive
other person's tweets. They can tag other using @ symbol.
Like all social media forum these tweets can include URLs, photographs, and hashtags. A
person can share another person's tweet (re-tweet) if he is following the later. Hashtag (#)
is used as a trendsetter in twitter, and used to easily look for information on a particular
topic while collecting data from twitter.
Twitter has more than billion registered user accounts and around 317 million monthly
active Twitter users [5]. It contains massive amount of data and included users from all
fields like movie stars, brand reviews, news, sportsmen, common people, politicians, etc.
thus giving us the global outlook of people around the world. Thus collecting the
expressions of people from different walks of life, can give us a clear picture on a
particular topic.
8
2.2 Apache Spark
Spark [6] is a frame work which provides cluster computing platform to perform various
tasks like data analytics, machine learning, data streaming, database management,
parallel computing, graph operations, etc. Spark can run as a standalone cluster as well as
in cluster mode.
Spark uses RDD (resilient distributed dataset) to distribute items over the cluster. RDD’s
are read only datasets and are handled by spark for the purpose of maintaining its fault
tolerant behavior. Aside from RDD’s spark also has DataFrames which are table like
structures used by spark to store data in table format, for manipulation of data in
dataframes spark also provides SQL libraries and functions. Spark has Spark SQL library
which can be used to perform SQL queries on dataframes.
Apache Spark supports Java, Scala, Python and R [6]. Spark core is the main engine
which manages input output operations, scheduling, memory management, networking
interfaces and dataset (RDD based).

9
The spark structure can be understood from the diagram below:
Figure 2: Spark Architecture [7]

Ø Driver Program
Driver program [7] manages the allocation of tasks to the worker nodes. It is also called
as Master. The driver does the task of listening to worker nodes for possible incoming
messages. It also keeps the task isolated so there is not data leak between the tasks. It
prioritizes the task of managing the jobs submitted to the spark cluster.
Ø Cluster Manager
Cluster manager [7] distributes the task to various worker nodes. It is like the
intermediator between the driver program and worker nodes which manages the cluster
when it is distributed. The cluster manager handles the request when either od driver or
10
worker node is busy. This makes spark capable of fault tolerant. The cluster manager
supports various applications and package handlers
Ø Worker Node
Each worker node [7] has an executor which can perform many tasks. Every worker has a
cache memory allocated which is configurable. Every executor isolates the task so there
is no memory leak between the multiple tasks submitted. The only way share memory
between two tasks is to write it to an external memory.
Ø Executor
Executor [7] is the process initiated for an execution of an application. Each application
running on the cluster has its own executors. The executor is responsible for data keeping
the data and doing input output operations between the applications.
Ø Task
Task [7] is a unit of work assigned by an executor.
A spark cluster can be configured with various parameter settings. The number of
executors, worker nodes, memory allocation, cache memory, worker cores, executor
cores everything can be configured according to the system requirements.
Apache spark provides various other functionalities like:

11
2.2.1 Spark Streaming
Spark Streaming is an add on package to the core Spark API which is scalable, high-
throughput and also fault-tolerant. It provides functionality of processing live data
streams. [8]. For streaming Apache spark can have flume, HDFS, apache Kafka, twitter,
kinesis data sources. This data can be then cleaned and structured in spark itself and used
to do further processing.
Figure 3: Spark Streaming [8]
2.2.2 Spark SQL
Spark core has an SQL extension which supports more optimization on datasets(RDD)
and is in structured format to retrieve data using the SQL queries. Spark SQL provides
most convenient way to perform several transitions on the data. Spark SQL [9] uses
dataframes for data manipulations.

12
2.2.3 Spark ML Lib (Machine learning libraries)
Spark ML Lib [10] provides wide range of advanced machine learning libraries. Spark
has two kinds of libraries which perform operations on RDDs and DataFrames. Spark can
process large amount of data and perform advanced machine learning algorithms on it.
Spark supports clustering, classification, reduction, regression, etc.
2.3 SBT
“SBT is an open source build tool for Scala and Java projects.” [11]. It is tool for
compiling Scala code and integrating various Scala frameworks. SBT builds a jar file of
the project which includes all the dependencies needed for project to run. This jar then
can be deployed on various frameworks as a complete application with dependencies.
2.4 Tableau
Tableau [12] is data visualization software. Data visualization is one of the important
features of Big Data. Visualization views can project data in different dimensions which
can make an impact. Tableau is a powerful tool which can project data in many formats
and graphs. It has the power to extract data from various data sources as well as stream
data from data sources to provide a live view. Tableau has custom settings for color, toot-
tips, text representation, scale, etc. It has various versions which can be accessed online
as well as of-line as a desktop version.

13
Chapter3
SYSTEM DESIGN
3.1 Spark Cluster Setup
I have set up spark cluster in standalone mode on my laptop which has 8GB
memory(RAM), 512GB storage and intel i5 processor. This spark cluster is build using
maven which is included in the spark package. Below are the spark environment
variables set for this cluster.
Here I have set spark worker instances to 4, which initiates 4 worker instances are soon
as the spark cluster is started. The Executor instances are set to 5 and each worker is
allocated 8 GB of memory i.e. the spark cluster can use the total amount of memory
available on the system. The spark Executor and Worker cores are set to 6 and 10 which
are virtual cores.

14
Spark has a WebUI interface which provides visualizations for the processing and
resources allocated. I have set the WebUI port to 8080 on localhost.
The WebUI looks as follows for the system configured.
Figure 4: Spark Master WebUI

15
Figure 5: Spark Worker UI
Figure 6: Spark Application Window

16
3.2 System Flow Diagram
Figure 7: System Flow Diagram

17
Chapter 4
IMPLEMENTATION
4.1 Streaming Twitter Data Using Spark
We use spark streaming to stream live spark data. To load Twitter data into Apache Spark
twitter provides an interface to developers which can be used to access twitter data. Visit
twitters applications site to register your application “https://apps.twitter.com/” [13]. I
have noted twitter tokens into TwitterKeys.txt which are needed to initialize spark
streaming context.
The streaming data is captured into batches. The interval for downloading batches can be
set, I have set the interval to be 10 seconds. Therefore, every 10 seconds a new batch of
data is streamed and captured into batches.
We set the filter which contains set of keywords to filter the tweets. Only tweets with
those keywords present in it would be streamed.
Figure 8: Spark Streaming [8]

18
4.2 Understanding Data
The far most and initial stage in data mining is to perceive and recognize the problem and
to note down the objectives [14]. We have equipped ourselves with domain knowledge to
recognize the problem complication, it will tremendously improve data mining
effectiveness and potency. We have thoroughly understood the sources and types of
related data and we have gathered, collected some useful data. We are focusing on
finding the right insights from data.
The data we receive from twitter streaming is in Json format. This data needs to be
understood before using it, selecting the fields form this data is one of the important
19
aspects. We use twitter4j library which is embedded into TwitterUtils library for getting
specific fields out of the data. The raw Json looks as follows:
20
21
4.3 Pre- Processing Data
The collected data were noisy, missing useful info and inconsistent [14]. We have almost
finished the Data preparation process. In this process, we have to check if there are empty
values or inconsistency in the data. The data should be in a consistent state to be
analyzed. To improve the efficiency of our analysis the data should be in a simple format.
Data mining is done on this data so to get efficient results the data must be processed by
removing redundancies. The data is made meaningful by deriving information from it
like deriving dates out of months, days and year. For example, Deriving the age of the
tweet by its date.
A schema of organized data can be seen as follows
Figure 9: DataFrame Schema

22
4.4 Algorithm and Sentiment Analysis
Sentiment analysis is performed using Stanford’s Natural Language Processing library
[15]. It takes text input and sends this library to get the sentiment in return. This library
constructs a tree like structure out of the plain text passed to it, this structured is created
after cleaning the data and removing all the stop words.
Calling function SentimentAnalysisUtils with parameters as plain text from the tweet.
The returned value is saved into map so that the status id tagged with its sentiment and
can be used further. This is the code for filtering the
After we filter the code and get the plain text we pass it to the function which creates a
tree out of it for getting the sentiment score. The function is defined as follows:
23
The score is counted by taking calculating the average i.e. by dividing the sum of score
by the size of sentence as each sub tree and each word has a score associated to it.
24
The score is then used to get the weighted sentiment out of the status and then mapped to
the sentiment based on the score. Here we have stated that 0 score is very negative, 1 is
associated to negative, 2 is neutral, 3 is positive and 4 is very positive.
4.4 Developed Model
The model should be reviewed and checked before using it for obtaining results. The
results obtained after mining the data should be analyzed meticulously and interpreted by
experts in order to perform efficient data analysis.
Figure 10: Basic Workflow Model [16]

25
Above figure describes the steps involved in our project. Basic flow diagram of how we
use our raw data to extract relevant information. The extracted information is saved into
case classes and then saved into data frames. The data frame is created using the structure
we need.
Case Class:
Constructing DataFrames from schema string. The data types of each column needs to be
specified while creating the schema of for data frame.

26
A dataframe looks like a table. When we use show () method on a dataframe it looks like.
Figure 11: DataFrame
4.5 Saving to Database
I am using MySQL to store the huge data. MySQL is widely used and is very efficient in
storing tabular data. It is easy to store data into MySQL using JDBC connection. The
connection needs a driver which then makes a connection to the database and pushes the
data into desired table.

27
Every time a new data frame is created it gets pushed and appended to the existing table
in the database. The driver needs MySQL server URL and password along with table and
database name to store data. In order for the data to be isolated we need to set the
isolation property to read committed.

28
Chapter 5
EXECUTION
5.1 Building Project
Launching the project on spark cluster needs a jar file. I use SBT as the package builder
for generating a .JAR of the project. SBT is widely used for building Scala and Java
projects. This JAR downloads all the dependencies needed for execution. This jar is
called FAT JAR or UBER JAR which contains all the code segments and dependencies.
29
5.2 Submitting Job
Spark cluster needs to be started before submitting any job to the cluster. All worker
nodes to be initiated while starting the cluster. Spark contains shell code for starting
master, starting workers, stopping master, stopping workers, starting all, stopping all, etc.
30
31
5.3 Tableau
Setting up tableau with the data source [12]. Tableau has two modes which are live and
extract.
Figure 12: Tableau Connection

32
Chapter 6
USE-CASES AND RESULTS
6.1 IPhone vs Samsung Note7
One of the most observed dilemmas for common people is comparison between Apple
IPhone and Samsung (android phones). [17] The application would filter the live tweets
based on the filters set. Getting all the features from the clean data sentiment analysis
would be done on the data set. This comparison is made based on sentiments base on the
tweet, their locations and other factors. This comparison gives us a brief idea of how,
what and where people think on these products (Apple and Samsung)
This would answer questions like:
1. Where is IPhone most Popular?
2. Where is android most Popular?
3. What do peoples sentiments about IPhone and android?
4. How many people on an average from 50,000(Tweets collected) use which
phone?
5. How many influential people follow either of them?
6. Are there any trend changes after the recent changes in models?
The observations below are based upon data collected on different dates and the total size
of data base is 15000 rows plus.

33
Figure 13:Density, Sentiment and Location Graph for Iphone vs Samsung (New York,
USA)
34
Figure 14: Density, Sentiment and Location Graph for Iphone vs Samsung (Washington,
USA)
35
Figure 15: Popularity Graph Iphone v/s Samsung Note7

36
Figure 16: Location Based Sentimental Analysis

37
6.2 PlayStation vs Xbox vs Nintendo
Other confusing task found amongst young generations is game and console selections.
There are many game stations available in today’s world but the Sony PlayStation,
Windows Xbox and Nintendo are the market leaders. All of them have their own follower
base and come with their own specialties. To solve this dilemma, I picked up this use
case for getting heat map based on location and sentiment.
The total size of dataset collected for this use case is 200000 plus.
38
Figure 17: Heat Map for Gaming Consoles (Canada)

39
Figure 18: Heat Map for Gaming Consoles (India)

40
Chapter 7
FUTURE WORK
The world is currently moving towards mobile computing. Mobile applications have
become an important factor for a products success. It’s also very convenient to access
mobile applications on the Go. In future a mobile application could be developed which
would give instant reviews as soon as you set the filers as you desire. The task for mobile
computing would be optimization of the cluster on such a small device with minimal
hardware support and visualization.

41
Chapter 8
CONCLUSION
I have learnt a lot of technologies while developing this project. The problems I faced
made me think deeper and out of the box. The vibrant and multidimensional visualization
would help consumers select a better product and avail the benefits of real time review
system based on location, public view and trend. It will also benefit businesses to get
feedback based on it there would be a huge scope for development of their product or
service.
During the course of implementation, I have gained in-depth knowledge in field of big
data analytics. I got to learn concepts like cluster computing on Apache spark which is a
very powerful platform and a trending platform for cluster computing, language like
Scala which is object oriented as well as functional language provides a lot more
flexibility for writing optimized code in smaller number of lines and has great capability
with the spark cluster. Tools like SBT and Tableau which are used wide across the
technological companies.
At last but not the least is I have learned how to overcome problems and develop a
product that would help people in their lives. The experience has made me technically
skilled and developed my thinking power around the problems.

42
BIBLIOGRAPHY
[1] Forbes. 12 Big Data Definitions. [Online]. Available:
http://www.forbes.com/sites/gilpress/2014/09/03/12-big-data-definitions-whats-
yours/#6490f23421a9. Accessed in September 2014.
[2] Wikipedia. Big Data. [Online]. Available:
https://en.wikipedia.org/wiki/Big_data. Accessed in November 2013.
[3] Research Gate. The five V’s of Big Data. [Online]. Available:
https://www.researchgate.net/figure/281404634_fig1_Figure-1-The-five-V's-of-
Big-Data-Adapted-from-IBM-big-data-platform-Bringing-big. Accessed in
September 2015.
[4] Slide Share. Big Data Characteristics. [Online]. Available:
http://www.slideshare.net/venturehire/what-is-big-data-and-its-characteristics.
Accessed in July 2013.
[5] Wikipedia. Twitter. [Online]. Available: https://en.wikipedia.org/wiki/Twitter.
Accessed in November 2012.
[6] Spark. Spark Overview. [Online]. Available: http://spark.apache.org/docs/2.0.1/.
Accessed in October 2013.
[7] Spark. Cluster Mode Overview. [Online]. Available:
http://spark.apache.org/docs/latest/cluster-overview.html. Accessed in October
2013.
43
[8] Spark. Spark Streaming Programming Guide. [Online]. Available:
http://spark.apache.org/docs/2.0.1/streaming-programming-guide.html. Accessed
in October 2013.
[9] Spark. Spark SQL, DataFrames and Datasets Guide. [Online]. Available:
http://spark.apache.org/docs/2.0.1/sql-programming-guide.html. Accessed in
October 2013.
[10] Spark. Spark SQL, DataFrames and Datasets Guide. [Online]. Available:
http://spark.apache.org/docs/2.0.1/sql-programming-guide.html. Accessed in
October 2013.
[11] Wikipedia. SBT. [Online]. Available:
https://en.wikipedia.org/wiki/SBT_(software). Accessed in September 2013.
[12] Wikipedia. Tableau. [Online]. Available:
https://en.wikipedia.org/wiki/Tableau_Software. Accessed in April 2013.
[13] Spring by Pivotal. Registering an application with Twitter. [Online]. Available:
https://spring.io/guides/gs/register-twitter-app/. Accessed in May 2014.
[14] Big Sky. The Data Analysis Process. [Online]. Available:
http://www.bigskyassociates.com/blog/bid/372186/The-Data-Analysis-Process-
5-Steps-To-Better-Decision-Making. Accessed in March 2013.
[15] Stanford CoreNLP. A Suite of core NLP Tools. [Online]. Available:
http://stanfordnlp.github.io/CoreNLP/. Accessed in August 2014.
[16] Recommender Systems. Data Mining. [Online]. Available: http://recommender-
systems.readthedocs.io/en/latest/datamining.html. Accessed in March 2013.

Bopardikar Yash Masters Project Report

Uploaded by

Copyright:

Available Formats

Bopardikar Yash Masters Project Report

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Bopardikar Yash Masters Project Report

Uploaded by

Copyright:

Available Formats

TWITTER DATA ANALYSIS USING SPARK

Presented to the faculty of the Department of Computer Science

California State University, Sacramento

Submitted in partial satisfaction of

ALL RIGHTS RESERVED

__________________________________, Committee Chair

__________________________________, Second Reader

be awarded for the project.

__________________________, Graduate Coordinator ___________________

Department of Computer Science

TWITTER DATA ANALYSIS USING SPARK

scalable and can handle millions of records.

be helpful to take tough decisions.

_______________________, Committee Chair

Department of Computer Science Engineering at California State University,

List of Figures ..................................................................................................................... x

1.1 Overview ................................................................................................................. 1

1.2 Introduction to Big Data ......................................................................................... 2

1.3 Sources of Big Data ...................................................................................................... 4

1.4 Why we need Big Data? ............................................................................................... 6

1.5 What is Big Data Analytics? ......................................................................................... 6

2. OVERVIEW TECHNOLOGIES AND PLATFORM USED ........................................ 7

2.1 Twitter ........................................................................................................................... 7

2.2 Apache Spark ................................................................................................................ 8

2.2.1 Spark Streaming ....................................................................................................... 11

2.2.2 Spark SQL................................................................................................................ 11

2.2.3 Spark ML Lib (Machine learning libraries) ............................................................. 12

2.3 SBT ............................................................................................................................. 12

2.4 Tableau ........................................................................................................................ 12

3. SYSTEM DESIGN ....................................................................................................... 13

3.1 Spark Cluster Setup..................................................................................................... 13

3.2 System Flow Diagram................................................................................................. 16

4.1 Streaming Twitter Data Using Spark .......................................................................... 17

4.2 Understanding Data .................................................................................................... 18

4.3 Pre- Processing Data ................................................................................................... 21

4.4 Developed Model ........................................................................................................ 24

4.5 Saving to Database...................................................................................................... 26

5.1 Building Project .......................................................................................................... 28

5.2 Submitting Job ............................................................................................................ 29

5.3 Tableau ........................................................................................................................ 31

6. USE-CASES AND RESULTS ..................................................................................... 32

6.1 IPhone vs Samsung Note7 .......................................................................................... 32

6.2 PlayStation vs Xbox vs Nintendo ............................................................................... 37

7. FUTURE WORK .......................................................................................................... 40

1. V’s of Big Data ............................................................................................................... 2

2. Spark Architecture .......................................................................................................... 9

3. Spark Streaming ............................................................................................................ 11

4. Spark Master WebUI .................................................................................................... 14

5. Spark Worker UI ........................................................................................................... 15

6. Spark Application Window........................................................................................... 15

7. System Flow Diagram................................................................................................... 16

8. Spark Streaming ............................................................................................................ 17

9. DataFrame Schema ....................................................................................................... 21

10. Basic Workflow Model ............................................................................................... 24

11. DataFrame ................................................................................................................... 26

12. Tableau Connection .................................................................................................... 31

15. Popularity Graph Iphone v/s Samsung Note7............................................................. 35

16. Location Based Sentimental Analysis ........................................................................ 36

17. Heat Map for Gaming Consoles (Canada) .................................................................. 38

18. Heat Map for Gaming Consoles (India)...................................................................... 39

________, Graduate Coordinator _