Bopardikar Yash Masters Project Report
Bopardikar Yash Masters Project Report
Bopardikar Yash Masters Project Report
A Project
MASTER OF SCIENCE
in
Computer Science
by
Yash Bopardikar
FALL
2016
© 2016
Yash Bopardikar
A Project
by
Yash Bopardikar
Approved by:
____________________________
Date
iii
Student: Yash Bopardikar
I certify that this student has met the requirements for format contained in the University
format manual, and that this project is suitable for shelving in the Library and credit is to
iv
Abstract
of
by
Yash Bopardikar
The idea is to solve people’s dilemma for choosing a certain product or service over
other. The objective is to collect data from tweeter feeds on trending topics like various
OS, new technologies, different Products etc. and to categories them according to need.
This Project would use Scala and Spark cluster. All the Big Data that would be fetched
would be collected in MySQL database. Then using the MapReduce API’s and Data
Mining algorithms data will be classified accordingly. Here we would be using Naïve
Bayes Algorithm for Sentimental analysis over the data. Twitter provides a streaming
API to stream real time twitter data. This library named Twitter4j is available in python
to access Streaming API and download twitter data. The filter on this data is based on a
list of keywords supplied. This analysis will employee a distributed data processing
system known as Apache Spark using several worker and master nodes. This cluster is
To filter out the huge number of data we will use Map reduce technique on Spark. The
input file will contain a JSON object for each tweet in data. This file will be uploaded on
Spark frame structure. The Spark frame structure will have replicated the structure and
distribute to multiple nodes, thus the mapper takes all files presented in the directory and
v
classifies them according to the filter set. These filtered tweets will pass through the data
mining algorithms and thus we could observe one to one comparison of data which would
_______________________
Date
vi
ACKNOLEDGEMENTS
I would like to thank Dr. Jun Dai, my advisor for providing me an opportunity to work on
this project and also motivate me to do better and better. His constant encouragement and
interest in making best out of such a powerful platform helped me throughout the project.
In addition, I would also like to thank Dr. Ying Jin for her willingness to serve on the
committee. Last but not the least, I would like to thank the entire faculty and staff of
Sacramento.
vii
TABLE OF CONTENTS
Page
Acknowledgements ........................................................................................................... vii
Chapters
1. INTRODUCTION .......................................................................................................... 1
viii
4. IMPLEMENTATION ................................................................................................... 17
5. EXECUTION................................................................................................................ 28
8. CONCLUSION ............................................................................................................. 41
Bibliography ..................................................................................................................... 42
ix
LIST OF FIGURES
Figure Page
13. Density, Sentiment and Location Graph for Iphone vs Samsung .............................. 33
14. Density, Sentiment and Location Graph for Iphone vs Samsung .............................. 34
x
1
Chapter 1
INTRODUCTION
1.1 Overview
In today's world there is a lot of informative data available in the social media. This data
has been a powerful resource for data miners to get deep insights. Collecting this data
and analyzing it might give us a lot of useful information which can help solve a lot of
our day to day problems. Taking this idea into consideration, I have harnessed a tool
which collects data from Twitter and analyses it to produce intelligent information.
Suppose we use this tool according to a consumer/buyer perspective, it can be used in the
following way. This app collects the Tweets that relate to brands, products or topics set as
filter and gives us a review as to which is the best for a given geographical location, what
are the sentiments associated with it at a particular location, which brand, product or topic
is most trending, what the trend looks like on different dates, how popular it is and how
many followers it has and thus helping the consumer to make wise choices. Consider
buying a gaming console as our target mission, we want to analyze gaming consoles that
belong to three brands say Xbox, PlayStation, Nintendo. This tool is not just limited to
gaming consoles but can be used to get reviews and insights for any product or any
trending topic for example phones, cars, electronic devices. The results can be visualized
it various graphs like heat map, bar graph, trending graph, density graph, etc.
2
1.2 Introduction to Big Data
Today, we live in the digital world. Because of the growing digitization and innovations
in technologies, the amount of data (structured or unstructured) and data collection have
also tremendously increased. The data is deposited in databases which grow substantially
and become difficult to collect, manage, share, examine and visualize via traditional
database software tools [1]. This is the reason for emergence of big data and a recent area
Big Data can be defined as a collection of data which may be structured or unstructured
and so large and complicated that it may become complex to process it using simple
Ø Volume
Volume [3] refers to huge amount of data transformed every second which is not possible
for traditional systems to manage. For example, consider a social networking website like
Facebook where people upload pictures, send messages, comments and click the Like
Ø Variety
The data like structured or unstructured, videos, audios, images, text messages, tables,
graphs etc., are referred to as Variety. These different forms of data can be collected via
various sources like social media, sensors, mobile devices, government datasets, user
based etc. This data can have many dimensions and also various forms.
Ø Velocity
The speed to generate new data and move around are stated as Velocity. For example, the
messages or videos on social media that’s going viral within a minute and quickly detect
the suspicious activity on credit card transactions. Systems designed to handle big data
can handle and process millions of rows per second, which makes it easier to get desired
Ø Veracity
Veracity refers to the reliability or disorderliness of the data [3]. The quality and accuracy
are not easy to manage with many forms of big data. For example, the post on Twitter or
Facebook with many hashtags, typos, and informal speech. The system has many worker
4
nodes and threads running to complete a desired task which work in synchronization to
Ø Value
Value refers to the business cost which is obtained through managing big data. For
example, managing very large unstructured data from blogs or social media stream. The
value of data depends on how accurate are the insights obtained from the data, the
processing time, how deep the mining is done and how it is visualized.
With the evolution and innovation of technology, the amount of generated data is also
increased. Also, the data needs to be collected into variety of formats. Sources of Big
Ø Enterprise Data
There is humongous amount of data spread across the various businesses, companies, and
institutions in different formats known as Enterprise Data. The formats include emails,
word documents, spreadsheets, presentations, HTML pages, pdf files, XMLs, flat files,
Ø Transactional Data
There are countless applications like web applications, mobile applications, CRM
systems, banking systems, service providing platforms, etc., in every organization which
5
involve in different kinds of transactions. So, the relational databases can be used as a
Ø Social Media
Social networks like Facebook, Twitter, etc., produce vast amount of data every day
which are unstructured data which can be text, images, video, and others. This data can
be cleansed, ordered and transformed into a structured data to make it useful data.
Ø Activity Generated
The data being generated by machines are referred to as Activity generated data. The
origin of these kinds of data comes from satellites, medical devices, sensor data,
industrial machinery, surveillance videos, cell phone towers, etc. This data may be
present in Log files, log tables, Json format and can be used to get future predictions of
the system.
Ø Public Data
The publicly available data like Wikipedia, data from weather department, data published
by research institutes, open data-sets provided by government and other types of data
which are easily available and accessible to public at no cost. This type of data is called
Public Data.
Ø Archives
In today's world, all insignificant data are being achieved by enterprises. As hardware
getting cheaper day-by-day, the enterprises can pile up the data. This kind of data
A normal system hardware and software are not capable to deal with very large amount
of various types of data which are created and collected at such a high speed. Big data is
the term for large and complicated data sets that it becomes difficult for traditional data
warehousing to store, analyze, manage, process and work on them and visualize. The
insights gained by processing big data can be significant for a business, can help
consumers, can be used to get predictive models for natural calamities and avoid them,
and can be used to predict behavioral patterns and trends. That is the reason of big data
and its analysis is the main focus modern science and business.
“Big data analytics is the process of examining large data sets containing a variety of data
types, like big data, to uncover hidden patterns, unknown correlations, market trends,
Using Big Data Analysis important insights can be gained. For this project we use Big
2.1 Twitter
Twitter [5] is a social networking platform used to share small bits of information across
the world. It is the most popular micro blogging website in today’s world. The 140
characters’ messages/posts are called tweet and people can follow each other to receive
Like all social media forum these tweets can include URLs, photographs, and hashtags. A
person can share another person's tweet (re-tweet) if he is following the later. Hashtag (#)
is used as a trendsetter in twitter, and used to easily look for information on a particular
Twitter has more than billion registered user accounts and around 317 million monthly
active Twitter users [5]. It contains massive amount of data and included users from all
fields like movie stars, brand reviews, news, sportsmen, common people, politicians, etc.
thus giving us the global outlook of people around the world. Thus collecting the
expressions of people from different walks of life, can give us a clear picture on a
particular topic.
8
2.2 Apache Spark
Spark [6] is a frame work which provides cluster computing platform to perform various
tasks like data analytics, machine learning, data streaming, database management,
parallel computing, graph operations, etc. Spark can run as a standalone cluster as well as
in cluster mode.
Spark uses RDD (resilient distributed dataset) to distribute items over the cluster. RDD’s
are read only datasets and are handled by spark for the purpose of maintaining its fault
tolerant behavior. Aside from RDD’s spark also has DataFrames which are table like
structures used by spark to store data in table format, for manipulation of data in
dataframes spark also provides SQL libraries and functions. Spark has Spark SQL library
Apache Spark supports Java, Scala, Python and R [6]. Spark core is the main engine
Driver program [7] manages the allocation of tasks to the worker nodes. It is also called
as Master. The driver does the task of listening to worker nodes for possible incoming
messages. It also keeps the task isolated so there is not data leak between the tasks. It
prioritizes the task of managing the jobs submitted to the spark cluster.
Ø Cluster Manager
Cluster manager [7] distributes the task to various worker nodes. It is like the
intermediator between the driver program and worker nodes which manages the cluster
when it is distributed. The cluster manager handles the request when either od driver or
10
worker node is busy. This makes spark capable of fault tolerant. The cluster manager
Ø Worker Node
Each worker node [7] has an executor which can perform many tasks. Every worker has a
cache memory allocated which is configurable. Every executor isolates the task so there
is no memory leak between the multiple tasks submitted. The only way share memory
Ø Executor
Executor [7] is the process initiated for an execution of an application. Each application
running on the cluster has its own executors. The executor is responsible for data keeping
the data and doing input output operations between the applications.
Ø Task
A spark cluster can be configured with various parameter settings. The number of
executors, worker nodes, memory allocation, cache memory, worker cores, executor
Spark Streaming is an add on package to the core Spark API which is scalable, high-
streams. [8]. For streaming Apache spark can have flume, HDFS, apache Kafka, twitter,
kinesis data sources. This data can be then cleaned and structured in spark itself and used
to do further processing.
Spark core has an SQL extension which supports more optimization on datasets(RDD)
and is in structured format to retrieve data using the SQL queries. Spark SQL provides
most convenient way to perform several transitions on the data. Spark SQL [9] uses
Spark ML Lib [10] provides wide range of advanced machine learning libraries. Spark
has two kinds of libraries which perform operations on RDDs and DataFrames. Spark can
process large amount of data and perform advanced machine learning algorithms on it.
2.3 SBT
“SBT is an open source build tool for Scala and Java projects.” [11]. It is tool for
compiling Scala code and integrating various Scala frameworks. SBT builds a jar file of
the project which includes all the dependencies needed for project to run. This jar then
2.4 Tableau
Tableau [12] is data visualization software. Data visualization is one of the important
features of Big Data. Visualization views can project data in different dimensions which
can make an impact. Tableau is a powerful tool which can project data in many formats
and graphs. It has the power to extract data from various data sources as well as stream
data from data sources to provide a live view. Tableau has custom settings for color, toot-
tips, text representation, scale, etc. It has various versions which can be accessed online
SYSTEM DESIGN
I have set up spark cluster in standalone mode on my laptop which has 8GB
memory(RAM), 512GB storage and intel i5 processor. This spark cluster is build using
maven which is included in the spark package. Below are the spark environment
Here I have set spark worker instances to 4, which initiates 4 worker instances are soon
as the spark cluster is started. The Executor instances are set to 5 and each worker is
allocated 8 GB of memory i.e. the spark cluster can use the total amount of memory
available on the system. The spark Executor and Worker cores are set to 6 and 10 which
IMPLEMENTATION
We use spark streaming to stream live spark data. To load Twitter data into Apache Spark
twitter provides an interface to developers which can be used to access twitter data. Visit
have noted twitter tokens into TwitterKeys.txt which are needed to initialize spark
streaming context.
The streaming data is captured into batches. The interval for downloading batches can be
set, I have set the interval to be 10 seconds. Therefore, every 10 seconds a new batch of
We set the filter which contains set of keywords to filter the tweets. Only tweets with
The far most and initial stage in data mining is to perceive and recognize the problem and
to note down the objectives [14]. We have equipped ourselves with domain knowledge to
effectiveness and potency. We have thoroughly understood the sources and types of
related data and we have gathered, collected some useful data. We are focusing on
The data we receive from twitter streaming is in Json format. This data needs to be
understood before using it, selecting the fields form this data is one of the important
19
aspects. We use twitter4j library which is embedded into TwitterUtils library for getting
specific fields out of the data. The raw Json looks as follows:
20
21
The collected data were noisy, missing useful info and inconsistent [14]. We have almost
finished the Data preparation process. In this process, we have to check if there are empty
analyzed. To improve the efficiency of our analysis the data should be in a simple format.
Data mining is done on this data so to get efficient results the data must be processed by
like deriving dates out of months, days and year. For example, Deriving the age of the
[15]. It takes text input and sends this library to get the sentiment in return. This library
constructs a tree like structure out of the plain text passed to it, this structured is created
after cleaning the data and removing all the stop words.
Calling function SentimentAnalysisUtils with parameters as plain text from the tweet.
The returned value is saved into map so that the status id tagged with its sentiment and
After we filter the code and get the plain text we pass it to the function which creates a
tree out of it for getting the sentiment score. The function is defined as follows:
23
The score is counted by taking calculating the average i.e. by dividing the sum of score
by the size of sentence as each sub tree and each word has a score associated to it.
24
The score is then used to get the weighted sentiment out of the status and then mapped to
the sentiment based on the score. Here we have stated that 0 score is very negative, 1 is
The model should be reviewed and checked before using it for obtaining results. The
results obtained after mining the data should be analyzed meticulously and interpreted by
use our raw data to extract relevant information. The extracted information is saved into
case classes and then saved into data frames. The data frame is created using the structure
we need.
Case Class:
Constructing DataFrames from schema string. The data types of each column needs to be
A dataframe looks like a table. When we use show () method on a dataframe it looks like.
I am using MySQL to store the huge data. MySQL is widely used and is very efficient in
storing tabular data. It is easy to store data into MySQL using JDBC connection. The
connection needs a driver which then makes a connection to the database and pushes the
Every time a new data frame is created it gets pushed and appended to the existing table
in the database. The driver needs MySQL server URL and password along with table and
database name to store data. In order for the data to be isolated we need to set the
EXECUTION
Launching the project on spark cluster needs a jar file. I use SBT as the package builder
for generating a .JAR of the project. SBT is widely used for building Scala and Java
projects. This JAR downloads all the dependencies needed for execution. This jar is
called FAT JAR or UBER JAR which contains all the code segments and dependencies.
29
Spark cluster needs to be started before submitting any job to the cluster. All worker
nodes to be initiated while starting the cluster. Spark contains shell code for starting
master, starting workers, stopping master, stopping workers, starting all, stopping all, etc.
30
31
5.3 Tableau
Setting up tableau with the data source [12]. Tableau has two modes which are live and
extract.
One of the most observed dilemmas for common people is comparison between Apple
IPhone and Samsung (android phones). [17] The application would filter the live tweets
based on the filters set. Getting all the features from the clean data sentiment analysis
would be done on the data set. This comparison is made based on sentiments base on the
tweet, their locations and other factors. This comparison gives us a brief idea of how,
what and where people think on these products (Apple and Samsung)
phone?
6. Are there any trend changes after the recent changes in models?
The observations below are based upon data collected on different dates and the total size
Figure 13:Density, Sentiment and Location Graph for Iphone vs Samsung (New York,
USA)
34
Figure 14: Density, Sentiment and Location Graph for Iphone vs Samsung (Washington,
USA)
35
Other confusing task found amongst young generations is game and console selections.
There are many game stations available in today’s world but the Sony PlayStation,
Windows Xbox and Nintendo are the market leaders. All of them have their own follower
base and come with their own specialties. To solve this dilemma, I picked up this use
The total size of dataset collected for this use case is 200000 plus.
38
FUTURE WORK
The world is currently moving towards mobile computing. Mobile applications have
become an important factor for a products success. It’s also very convenient to access
mobile applications on the Go. In future a mobile application could be developed which
would give instant reviews as soon as you set the filers as you desire. The task for mobile
computing would be optimization of the cluster on such a small device with minimal
CONCLUSION
I have learnt a lot of technologies while developing this project. The problems I faced
made me think deeper and out of the box. The vibrant and multidimensional visualization
would help consumers select a better product and avail the benefits of real time review
system based on location, public view and trend. It will also benefit businesses to get
feedback based on it there would be a huge scope for development of their product or
service.
During the course of implementation, I have gained in-depth knowledge in field of big
data analytics. I got to learn concepts like cluster computing on Apache spark which is a
very powerful platform and a trending platform for cluster computing, language like
Scala which is object oriented as well as functional language provides a lot more
flexibility for writing optimized code in smaller number of lines and has great capability
with the spark cluster. Tools like SBT and Tableau which are used wide across the
technological companies.
At last but not the least is I have learned how to overcome problems and develop a
product that would help people in their lives. The experience has made me technically
http://www.forbes.com/sites/gilpress/2014/09/03/12-big-data-definitions-whats-
[3] Research Gate. The five V’s of Big Data. [Online]. Available:
https://www.researchgate.net/figure/281404634_fig1_Figure-1-The-five-V's-of-
Big-Data-Adapted-from-IBM-big-data-platform-Bringing-big. Accessed in
September 2015.
http://www.slideshare.net/venturehire/what-is-big-data-and-its-characteristics.
2013.
43
[8] Spark. Spark Streaming Programming Guide. [Online]. Available:
http://spark.apache.org/docs/2.0.1/streaming-programming-guide.html. Accessed
in October 2013.
[9] Spark. Spark SQL, DataFrames and Datasets Guide. [Online]. Available:
http://spark.apache.org/docs/2.0.1/sql-programming-guide.html. Accessed in
October 2013.
[10] Spark. Spark SQL, DataFrames and Datasets Guide. [Online]. Available:
http://spark.apache.org/docs/2.0.1/sql-programming-guide.html. Accessed in
October 2013.
http://www.bigskyassociates.com/blog/bid/372186/The-Data-Analysis-Process-