Mtech Thesis 2020-22

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

A thesis on

TWITTER SENTIMENT ANALYSIS ARTICLE 370


Submitted in Partial Fulfilment of the Requirements for the Degree of

MASTER OF TECHNOLOGY
in
COMPUTER SCIENCE & ENGINEERING
by
Gurpreet Kaur Grewal
(200060705004)
Under the supervision of

Mrs. Supriya Shukla


(Assistant Professor)

To the
College Of Engineering Roorkee (COER), Roorkee

Veer Madho Singh Bhandari Uttarakhand Technical


University, Uttarakhand-248001

October, 2022
CANDIDATE’S DECLARATION
I hereby declare that the work which is presented in the thesis named,
“Sentiment analysis of Article 370” submitted by me in partial fulfilment for
the award of degree of Master of Technology (M. Tech.) submitted in
Department of Computer Science & Engineering, Uttarakhand Technical
University is an authentic record of my thesis carried out under the supervision
of Prof. Mrs. Supriya Shukla Department of Computer Science and
Engineering, College Of Engineering Roorkee.

Date:
Gurpreet Kaur Grewal
M. Tech(CSE)
Enrolment No.: 200060705004
College Of Engineering Roorkee, Roorkee

Approved By:
Dr. Taresh Singh
Head Of Department
(Computer Science & Engineering)
College Of Engineering Roorkee, Roorkee

(i)
CERTIFICATE
I hereby submit that the work which is presented in the thesis name,”
Sentiment Analysis of Article 370” in fulfilments for the award of degree of
Master of Technology in Computer Science is a record of my own work
under the supervision of Mrs. Supriya Shukla.

Prof. Mrs. Supriya Shukla


College Of Engineering Roorkee, Roorkee

(ii)
1.3 ABSTRACT
After many days and weeks of much research about the scenes and situations in Jammu &
Kashmir, Narendra Modi government finally has revealed their papers and law. Our respected
Home Minister Mr. Amit Shah announced in the Parliament that the Article 370 has to be
followed for rest of the days.
There has been a lot of functional activities which took place, specially on twitter where
people use to share their views and opinions. So in the thesis I am going to elaborate how we
can analyse what people are sharing on twitter on the particular topic. On the basis of twitter
we can share the report regarding the positive and negative impacts of people and their
thinking. Using this technique, we can better understand that the decision is good or bad.
Python is very simple powerful, high-level, interpreted and dynamic programming language,
which is known for its efficient functionality of processing natural language data.
The goal of this thesis is to clarify the twitter data into positive or negative comments by
using different supervised machine learning classifiers on data collected for different Indian
political parties and to finalize the political party is performing best for public.

(iii)
ACKNOWLEDGEMENT
I would like to thanks my guide Mrs. Supriya Shukla Assistant Professor, Computer Science
and Engineering Department, College Of Engineering Roorkee, Roorkee for helping me
submit my thesis and also to complete my work. I am very thankful to Dr. Taresh, HOD
Computer Science Engineering Department, College Of Engineering Roorkee, Roorkee for
setting good standards for his students and encouraging them time to time .

Last but not the least I would like to thank my parents for their years of unyielding love and
encourage. They wanted the best for me and I admire their sacrifice and determination.

Date:
Gurpreet Kaur Grewal
Enrolment No.: 200060705004

(iv)
DISSERTATION APPROVAL SHEET
This is to certify that the dissertation titles
TWITTER SENTIMENT ANALYSIS ARTICLE 370

By

Gurpreet Kaur Grewal

is approved for the degree of Master of Technology

Guide Name & Signature


(Prof. Mrs. Supriya Shukla, College Of Engineering Roorkee)

Internal Examiner Name & Signature

External Examiner Name & Signature


Date:

(v)
Table Of Contents
Declaration i
Certificate ii
Abstract iii
Acknowledgement iv
Approval Sheet v
Table of Contents vi
List of Abbreviations vii

Chapter 1 (1-17)
1.9.5 Applications 3
1.10 Literature review 5
1.11 Motivation 6
1.12 Problem Statement 7
1.13 Training Data 8
1.14 Data Storage 9
1.15 Objective 11
1.16 Implementation Details 13
1.17 Classifier Accuracy for Training Data 14
1.18 Conclusion 16
1.19 Future Scope 17

(vi)
LIST OF ABBREVIATIONS
NLTK: Natural Language Toolkit
NLP: Natural Language Processing
NB: Naïve Bayes
SVM: Support Vector Machines
MAP: Maximum A Posterior
BJP: Bhartiya Janta Paty
AAP: Aam Aadmi Party
INC: Indian National Congress
API: Application programming Interface

(vii)
LIST OF FIGURES
Figure 1: Hyper-plane in SVM
Figure 2: Applications
Figure 3: Twitter Analysis
Figure 4: Positive tweets of BJP in different Indian states
Figure 5: Example of reactions
Figure 6: Process to classify tweets using build classify
Figure 7: Code of Execution
Figure 8: Reviews
Figure 9: Data Storage for import
Figure 10: Code for extracting features from tweets
Figure 11: Classifiers accuracy for training data
Figure 12: Sentiment Analysis for BJP, APP and INC in 2016

(viii)
CHAPTER 1
INTRODUCTION
In this chapter we will discuss about the introductions on Sentiment Analysis, Python and
Natural Language Toolkit (NLTK). After that we will focus on objective of our thesis. The
requirement of sentiment analysis and the applications of Sentiment Analysis are used in our
daily life.
1.9.1 Introduction to Sentiment Analysis
Sentiment Analysis is process of collecting and analysing data which is based upon
the feelings personally, reviews and thoughts. Sentimental analysis is often called as
opinion mining because it mines the important feature from people opinions.
Sentimental Analysis is performed by using many machine learning techniques, such
as statistical models and Natural Language Processing (NLP) for the extraction of
feature from a huge data.

Twitter is a mini blogging platform where anyone can read or write short form of
messages which are called tweets. The quantity of data gathered on twitter is very
huge. This data is not in a structured manner and written in natural language. Twitter
Sentimental Analysis is the process of accessing tweets for a particular topic and
predicts the emotions.

1.9.2 Introduction to Python


Python is a high level, dynamic programming language which is very much used for
this thesis. It is an interpreted language which makes the testing and debugging quick
as there is no compilation procedure. There are open-source libraries which are
available for this version of python and a large community of people who use it.
Python is simple, powerful, interpreted and dynamic programming language, which is
well known for its functionality of processing natural language data, i.e. spoken
English using NLTK. Other high level languages of programming such as ‘R’ and
‘MATLAB’ were taken into consideration because they have many profits such as
ease of use but they do not offer the flexibility.

1.9.3 Introduction to NLTK


Natural Language Toolkit (NLTK) is library in Python, which provides a base for
building programs and data classification. NLTK is a collection of resources for
Python that can be utilized for text processing, classification and tagging. This
toolbox plays a vital role in changing the text data in the tweets into a format that can
be used to extract sentiment from them. NLTK helps various machine learning
algorithms which are used for training classifier and to calculate accuracy of different
classifiers. In our thesis we are using Python as our base programming language
which is used for writing codes. NLTK is a library of Python which plays a very
crucial role in converting text of natural language to a sentiment either positive or
negative. NLTK also provides various sets of data which are modified for training
classifiers. These datasets are structured and stored in NLTK library , which can be
(1)
used and executed easily with the help of Python. NLTK provides various
functionalities which are used in data pre-processing so that data which is found from
twitter become fit for mining and extracting features.

1.9.4 SVC (Support Vector Classifier)


SVM are supervised machine learning techniques which are used for classification,
regression and models of detection. SVM are more effective for high dimensional
space. SVCs are capable for multi-class classification. SVC and Nu SVC are similar
whereas, Linear SVC are based on kernels.

Figure 1: Hyper-plane in SVM

Sentimental Analysis has various applications. Sentiment Analysis is domain centred, i.e.
results of one domain that cannot be applied to other domain. Sentimental Analysis is used in
many real-life situations, to get reviews about any product or movies, to get the financial
report of any company, for predictions or marketing. It is used to generate opinions for
people of social media by analysing their feelings or thoughts which they provide in form of
text.

2
Figure 2: Applications

1.9.5 APPLICATIONS
 Customer Support: It is very helpful in knowing whether the decision is good or bad
and more analysis could be done. It also helps the government to take future decision
according to people demand.
 Hospital
 Bus service
 Movie review
 Hotel review: It could help the hotel by knowing reviews from people who stay there
that for the services were good or bad for future services
 Company Product review

3
Figure 3: Twitter Analysis

**Much research have been done and made on the subject of sentiment analysis in
past times. Mostly research on sentiment analysis depend on machine learning
algorithms, whose focus is to find that given text is in favour or against. Latest
research in this area is to make sentiment analysis on the generated data by user from
many websites like social networking websites such as Facebook, Twitter, Amazon,
etc. **

4
1.10 LITERATURE REVIEW
1. The benefit of social media platforms to know about the people decisions and take out
their emotions which are considered and explained that how twitter gives the
advantage in politics way during elections. And also, the concept of the hashtag is
used for classification of text as it expresses all the emotions in words.
2. This approach has decreased or we can say reduced the number of tweets or set of
training which further gets applied to Support Vector Machine and Naïve Bayes
classification algorithm to determine the polarity of tweets.
3. Multistage Classification approach was used where an entity classification receives
general tweets with respect to individual candidates for a good comparison.
4. The common approach which was found in almost all the related researches that
constitute data collection using twitter API, Pre-processing of data, filtering of data
and so on.

They are many researchers who proposed a system which is based on different locations.
According to them, Sentiment Analysis is brought out by the Natural Language Processing
(NLP) and some algorithms of Machine Learning. In Twitter, there is an area of tweet
location which can be easily accessed by a script and therefore, data or tweets from particular
location can be gathered for identifying patterns and sequence. They read many applications
of sentiment analysis based on location by using a data source in which data can be taken out
or extracted from different locations very easily.

Figure 4: Positive tweets of BJP in different Indian states


5
1.11 MOTIVATION
Sentiment Analysis focuses on identifying whether a given piece of text is objective or is
completely subjective. if it is subjective then also it is categorized as negative or positive.
Motivation is Sentiment Analysis is 2-fold. Both the consumer and producer highly value all
the opinions of customers about product and services.

Figure 5: Example of reactions

6
1.12 PROBLEM STATEMENT
Sentiment analysis is very necessary in today’s world, as people always get affected by the
thinking and opinions other people. The onclusion of sentiment analysis is classification of
natural text into classes such as ‘+’, ‘-‘ and neutral. In Today’s world, if anyone wants to buy
a product or to give vote, etc. then that person would firstly want to know what other people
reviews, reactions and opinions about that product or candidate or on social media websites
like Twitter movie are, Facebook, Tumbler, etc.
The main objective of the thesis is to perform the sentiment analysis on Indian Political
Parties like BJP, INC and AAP, such that people opinions about these parties progress,
workers, policies, etc. are monitored.
There are many methodologies which are used mentioned as follows:
∑ A thorough study of existing approaches and techniques in field of sentiment analysis.
∑ Collection of relatable data from Twitter with the help of Twitter API
∑ Prior processing of data collected from Twitter so that it can be best for mining.
∑ To build a classifier based on different supervised machine learning techniques.
∑ Training and testing the classifier builder using huge datasets
∑ Computing the result of different classifier using dataset collected from Twitter.
∑ Comparing results of classifiers and plotting a graph that show the trend of ‘+’ and ‘–‘
sentiment for various political parties.

Figure 6: Process to classify tweets using build classify


In this script we use all the keys which we got in API. To collect data, we set up the ‘OAuth’
protocol at first. OAuth is a standard authorization protocol. It allows user to log in any third-
party websites by using any social network website account without exposing passwords.
OAuth provides security and authorization to user.

7
Figure 7: Code Of Execution

1.13 TRAINING DATA


Other data which we gathered for this thesis is the data of training. This data is used to give
knowledge to the classifier which are going to build it. To gather this data, we use library of
Python like NLTK. NLTK contains corpora, which is very huge and consists of structured set
of files in the form of text which are used to perform analysis. In these corpora there are
many types of files like quotes, reviews, chat, history, etc. Out of these corpora we will select
files of movie reviews for our purpose which is training. Sample of these reviews are shown
in the table.

8
Figure 8: Reviews

1.14 DATA STORAGE


Once, we start collecting our data from Twitter API our next step is to keep and store
that data so that we can use it for sentiment analysis. We ran our scripts for one month
and collected different tweets of political parties. CSV separate each field with a
comma, thus make it very easier to access the field which consists of text. CSV files
also give faster read/write time comparatively. We create different directories to keep
tweets of different political parties for each month. Every time we ran the script
described in figure a .csv (comma separated values) file is generated which consists of
tweets that are extracted from Twitter API. We use .csv format for our collected data
files because data consists of many fields. We keep them in our hard drive from where
these can be easily extracted and imported to our snippet and also further proceed for
analysis. Once we collected and stored our tweet we have to pre-process the stored data
before applying it to classifier , reason behind the data we collect from API does not fit
for mining. Therefore, pre-processing the data is our next step.

9
Figure 9 : Data Storage for import

We generate a code in Python in which we define a function


* Remove quotes - provides the user to remove quotations from the text
* remove @ - offers choice of deleting the @ symbol, remove the @ with the username, or
exchange the @ and the user name with a word 'AT_USER' and append it to halt words.
* Remove URL (Uniform resource locator) - gives choices of deleting URLs or replacing
them with word ‘URL’ and append it to stop words
* Remove RT (Re-Tweet) – which delete the word RT from all the tweets
* Remove Emoticons - delete emoticons from tweets and replace them with their specific
meaning * delete duplicates – remove all repetitive words from text so that there will be no
duplicates
* remove # - removes the hash tag
* Remove breaking words – remove all negative and breaking words like a, he, the, etc which
gives no meaning for classifications.

10
Table: Removed and modified content

1.15 OBJECTIVE
1) The main objective of Sentiment Analysis in the thesis is to look forward for the feedback
of people for Article 370 which was passed by the government
2) Our basic motive is to make analysis and research on whether the article has followed the
NICE principle, which is :
N= Need, I=Interest, C=Concern, E=Expectation
Implementation details
The steps for implementation of Sentiment Analysis are :
 Load twitter API
 Load Word Dictionaries
 Search twitter feeds
 Defining text cleaning functions
 Cleaning and splitting twitter feeds
 Analysing twitter feeds
 Plotting high frequency negative or positive words

In order to build our classifier, we use seven in-build which are:


1. Naïve-Bayes Classifier
11
2. Multinomial Classifier
3. Bernoulli Classifier
4. Logistic Regression Classifier
5. SGDC
6. Linear SVC
7. Nu SVC

Figure 10: Code for extracting features from tweets

12
1.16 STEPS IN DETAIL
1) Load twitter API
The first step is to get the registration done in twitter application developer portal and get
the authorization.
You need: Consumer Key
Twitter Consumer Secret Key
Access Key
2) Load Word Dictionaries
Next step is to stack the arrangement of positive and negative assumptions words into the
working catalogue. The words are then released to factors as positive or negative.
3) Search twitter feeds
The following step is to categorize twitter seek and relegating to a variable. Number of
tweets must be removed were allotted to another variable. An ideal opportunity to play
out the twitter hunt and extraction is impressed by this number. A moderate web
association as well as unpredictable inquiry that brings about extra components.
4) Getting text from feed
Twitter consists of huge amounts of extra fields and data. We utilize the gettext()
command to remove all the content fields. The capacity connected to every one out of all
the total tweets.
sustaintweetT=lappy(tweet,function(t)t$getText()).
5) Defining text cleaning functions
In this program, we compose a capacity which executes all the orders to clean the context,
remove punctuation, special characters, etc. this function changes capitalized characters
to lower down the cases of the string utilizing tolower() command. To use this we
compose a blunder getting capacity and install it in the code of content cleaning of the
function.
6) Cleaning and splitting twitter feeds
In this step we generally use to separate all the tweets and the resultant feeds are stored in
a list object called sentiment analysis.
7) Analysing twitter feeds
Here actually we get into the actual task of analysing feeds. We also do the comparison of
the twitter text storage with the word dictionaries and retrieve out all the matching words.
To do this, we first determine and describe a function to count all the positive and
negative words that are matching with our database.

13
1.17 RESULTS AND DECLARATION

Classifier Accuracy for Training Data


Once we ran the script, we get the accuracy of each classifier for movie reviews training
data. The output is shown below:

Figure 11: Classifiers accuracy for training data

14
Figure 12: Sentiment Analysis for BJP, APP and INC for April 2016

15
1.18 CONCLUSION
The thesis helps us to analyse huge amount of data and processes. The data will be gathered
by the API of twitter streaming. The data which got collected will be analysed, based on
score that we analyse how to check the user’s emotions. We can also visualize the user’s
opinion towards other products in the market by drawing it is the form of graph like bar
graph.
16
1.19 FUTURE SCOPE
Some of future scopes that can be collected in our research work are:
∑ Use of parser can be embedded into system for better results.
∑ A web-based application can be made for our good work in future days
∑ We can improve our system that can deal with sentences of multiple meanings.
∑ We can also increase the classification categories so that we can get better results.
∑ We can start work on multi language
17
REFERENCES

1) https://techsparks.co.in/how-to-write-m-tech-thesis-expert-guidelines/
2) http://www.iitk.ac.in/doaaold/thesisguide.pdf
3) http://www.tezu.ernet.in/dener/programme/Guideline_for_MTech_thesis_Writing.pdf
4) https://www.quora.com/What-is-an-M-Tech-thesis-all-about
5) https://www.davietjal.org/wp-content/uploads/2016/03/M.Tech-thesisrules.pdf
18

You might also like