Big Data & Sentiment Analysis Using Python
Big Data & Sentiment Analysis Using Python
Publication History
Manuscript Reference No: IRJCS/RS/Vol.07/Issue06/JNCS10089
Received: 05, June 2020
Accepted: 16, June 2020
Published:26, June 2020
DOI: https://doi.org/10.26562/irjcs.2020.v0706.003
Citation: Akshansh,Aman,Abhhek & Aman(2020). Bigdata & Sentiment Analysis Using Python. IRJCS:: International
Research Journal of Computer Science, Volume VII, 159-166.DOI: https://doi.org/10.26562/irjcs.2020.v0706.003
Peer-review: Double-blind Peer-reviewed
Editor: Dr.A.Arul Lawrence Selvakumar, Chief Editor, IRJCS, AM Publications, India
Copyright: ©2020 This is an open access article distributed under the terms of the Creative Commons Attribution License, Which
Permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited
Abstract: “In present time, social media is having a lot of information of its users. Some of the ways are to extract
information from social media gives us several usage in various types of fields and researches. In Product Analysis,
extracting information from social sites/media is providing number of advantages such as knowledge about the
latest technology, update of a real-time situation in market etc. one of the social media is Twitter which allows the
user post tweets of limited number of characters and share the message(tweet) to their followers. It allows
developer to access the information for their purpose. In the implemented module, details collected and sentiment
analysis is performed on it. Based on the results of data & sentimental analysis tips and information can be provided
to the user. The running module can perform data &sentiment analysis on data available for various fields and
consumer opinions and suggestion on various products. These results can provide to companies to get up-to-date,
etc. With this process, the implemented system can help in predicting the effects of various products and various
activities in various fields.”
Keywords: Big ; data; Sentiment; analysis; Python;
I. INTRODUCTION
In the present era, Social media and many corresponding applications allow all its users to express their opinions
about a particular topic and show their attitudes by liking or disliking content. All are continuously accumulating
actions on social media and generating high variety, volume, velocity, value, variability data termed as big social data.
This kind of data refers to massive set of opinions of individual that can be processed to understand the people
tendencies in the digital world. Various researchers have shown a keen interest in the exploitation of huge social
data in order to explain, determine and predict human mindset in several domains. To Process this kind involve
various research avenues, particularly, text analysis. In fact, 85% of online data is text, and analysis of text data has
become key element for finding the sentiments of public and their valuable opinion towards the content. Sentiment
analysis is also called opinion mining, which aims to find out the sentiments of users about a topic by doing analyses
of their posts and different types of actions on social media. Then, it polarity is going to be classified into three
categories such as positive, negative and so on.
_______________________________________________________________________________ __________________________________________________________________
© 2014-20, IRJCS- All Rights Reserved Page-1
International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842
Issue 06, Volume 07 (June 2020) https://www.irjcs.com/archives
• Opinion and volume (OV)-based approach in which both opinion mining and volume approaches are combined
together. Researcher Finn et al. discovered a new approach to measure political polarization without the use of
text. They have used a co-retweeted network, as well as the retweeting behavior of the social media users.
• Emoji-based approach, in this the classification of posts is based on the use of emoji. Researchers selected
differente moji and categorized them into different categories such as happy, sad, fear, laughter, and angry classes,
then, find the sentiment of the first emoji in the post.
_______________________________________________________________________________ __________________________________________________________________
© 2014-20, IRJCS- All Rights Reserved Page-3
International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842
Issue 06, Volume 07 (June 2020) https://www.irjcs.com/archives
_______________________________________________________________________________ __________________________________________________________________
© 2014-20, IRJCS- All Rights Reserved Page-4
International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842
Issue 06, Volume 07 (June 2020) https://www.irjcs.com/archives
• Step 3 The purpose of this step can be defined as to refine the annotated dictionary: positive posSW(), negative
negSW()and neutral neutSW() dictionaries for each Yi . The task of classification of neutral hashtags is
difficult and that could affect the result, so that’s why we ignored them during the collect. In fact, a tweet
that contains a neutral hashtag such as #modi could be either negative or positive. Therefore, we have to
construct neutral basing on the word occurrence Occ(wj) for all in the different classes. This will allows us
to construct the final dictionaries by using of Algorithm 1.
We conducted empirical test that consists of testing a number of values (between 0.5 and 0.8), in order to constitute
the limit that allows classifying sentiment words with the smallest error rate. In our case 0.7 was the best value. At
the end, we will assign a score to sentiment words: 1, 0, − 1 for positive, neutral and negative, respectively.
_______________________________________________________________________________ __________________________________________________________________
© 2014-20, IRJCS- All Rights Reserved Page-5
International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842
Issue 06, Volume 07 (June 2020) https://www.irjcs.com/archives
k=1
• Step 4 in order to classify tweets, we use p(t) as following: We classify scored tweets into seven classes according
to the polarity degree into C+3, C+2, C+1, C0, C−1, C−2, C−3and these classes will symbolize tweets as highly positive,
moderately positive, lightly positive, neutral, lightly negative, moderately negative, highly negative classes,
respectively). For this, we conduct empirical test to determine the limit of each class. If case1: 0 <p (t) ≤ 3, then the
tweet is classified as lightly positive. If case2: 4≤ p (t) ≤6, it is moderately positive. If case3: p (t) ≥7, then, it is highly
positive. If case4:−3≤ p (t) <0, then, the tweet is classified as lightly negative. If case5: −6≤ p (t) ≤−4, it is moderately
negative. If case6: p (t) ≤−7, then, it is highly negative. There is also a possibility for a sentiment score to be equal to
0, if it is equal to zero, then, the tweet is classified as neutral.
3.3 THIRD STAGE: PREDICTION
Several researchers have considered three classes’ i.e. positive, negative and neutral classes to determine the
sentiment of a document based on the words and/or emoticons and only few ones, such Khatua et.al.
_______________________________________________________________________________ __________________________________________________________________
© 2014-20, IRJCS- All Rights Reserved Page-6
International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842
Issue 06, Volume 07 (June 2020) https://www.irjcs.com/archives
Have examined the polarity degree (i.e. highly, moderately, weakly positive and negative classes). But authors have
considered only two indicators i.e. strongly positive and strongly negative.
IV. CLASSIFICATION ACCURACY EVALUATION
To assess the ability of classifying tweets based on the automatically constructing dynamic dictionary, we have
randomly selected a subset of 210 tweets from the political Twitter corpora: 30 for each class. The tweets were
manually inspected and labeled into classes as positive, moderately positive, highly positive, lightly negative,
moderately negative, strongly negative or neutral for each candidate. Then, the same data was processed, as
mentioned above, following various steps such as “by removing stop words, applying tokenization, stemming and
various filters”. The above step was done by the help of TreeTagger, which is a tool for annotating text with part-of-
speech and lemma information. TreeTagger was also modified to handle various other things such as negation, URLs,
usernames, Twitter mentions and hashtags and intensifiers.
V. CONCLUSION AND DISCUSSION
Sentiment analysis has been proven to be effective in predicting people reaction or opinion by analyzing big social
data on a particular topic. The technique which we proposed consists of various steps starting with building a
dictionary of words’ polarity based on a very small set of positive and negative hashtags related to a particular given
subject, then, posts will be classified into several classes and balancing the sentiment weight by using new metrics
such as uppercase words and the repetition of more than two consecutive letter in a word.
However, the proposed approach still suffer have some challenges. First, it cannot understand emoticons. Second, we
used only Twitter data. Third, we cannot access large data for this algorithm. For further improvement, we wish to
handle these three limitations by proposing a more efficient and global model that can work on larger volumes of
data
REFERENCES
1. Balasubramanyan R, Routledge BR, Smith NA. From tweets to polls: linking text sentiment to public opinion time
series. Icwsm. 2010; 11:1–2.
2. Benamara F, Cesarano C, Picariello A, Recupero DR, Subrahmanian VS. Sentiment analysis: adjectives and adverbs
are better than adjectives alone. In: Proceedings of ICWSM conference. 2007.
3. Bermingham A, Smeaton A. On using Twitter to monitor political sentiment and predict election results. In:
Proceedings of the workshop on sentiment analysis where AI meets psychology. 2011.
4. Bhatt R, Chaoji V, Parekh R. Predicting product adoption in large-scale social networks. In: Proceedings of the 19th
ACM international conference on Information and knowledge management. New York: ACM; 2010. p. 1039–48.
5. Chesley P, Vincent B, Xu L, Srihari RK. Using verbs and adjectives to automatically classify blog sentiment. In:
AAAI symposium on computational approaches to analyzing weblogs (AAAI-CAAW). 2006. p. 27–9.
6. Conover MD, Goncalves B, Ratkiewicz J, Flammini A, Menczer F. Predicting the political alignment of twitter users.
In: 2011 IEEE third international conference on privacy, security, risk and trust and 2011 IEEE third international
conference on social computing. 2011. p. 192–9.
7. De Choudhury M. Predicting depression via social media. ICWSM. 2013; 13:1.
8. Delenn C, Jessica Z, Zappone A. Analyzing Twitter sentiment of the 2016 presidential candidates. Stanford:
Stanford University; 2016.
9. DiGrazia J, McKelvey K, Bollen J, Rojas F. More tweets, more votes: social media as a quantitative indicator of
political behavior. PLOS ONE. 2013; 8(11):e79449.
10.Ekaterina O, Jukka TO, Hannu K. Conceptualizing big social data. J Big Data. 2017;4:3.
11.Finn S, Mustafaraj E, Metaxas PT. The co-retweeted network and its applications for measuring the perceived
political polarization. Faculty Research and Scholarship. 2014.
12.Gayo-Avello D. No, you cannot predict elections with Twitter. IEEE Internet Comput. 2012;16(6):91–4.
13.Hansen LK, Arvidsson A, Nielsen FA, Colleoni E, Etter M. Good friends, bad news-affect and virality in twitter. In:
Future information technology, communications in computer and information science. Berlin: Springer; 2011. p.
34–43. https://doi.org/10.1007/978-3-642-22309-9_5.
14.Hu M, Liu B. Mining and summarizing customer reviews. In: Proceedings of the tenth ACM SIGKDD international
conference on knowledge discovery and data mining, KDD’04. New York: ACM; 2004. p. 168–77.
15.Jahanbakhsh K, Moon Y. The predictive power of social media: on the predictability of US presidential elections
using Twitter. https://arXiv:1407.0622 [physics]. 2014.
16.Jose R, Chooralil VS. Prediction of election result by enhanced sentiment analysis on twitter data using classifier
ensemble Approach. In: 2016 international conference on data mining and advanced computing (SAPIENCE).
2016. p. 64–7.
17.Khatua A, Khatua A, Ghosh K, Chaki N. Can #Twitter_trends predict election results? Evidence from 2014 Indian
general election. In: 2015 48th Hawaii international conference on system sciences. 2015. p. 1676–85.
_______________________________________________________________________________ __________________________________________________________________
© 2014-20, IRJCS- All Rights Reserved Page-7
International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842
Issue 06, Volume 07 (June 2020) https://www.irjcs.com/archives
18.Livne A, Simmons M, Adar E, Adamic L. The party is over here: structure and content in the 2010 election. In: Fifth
International AAAI conference on weblogs and social media. 2011.
19.Mahmood T, Iqbal T, Amin F, Lohanna W, Mustafa A. Mining Twitter big data to predict 2013 Pakistan election
winner. In: INMIC. 2013. p. 49–54.
20.Medhat W, Hassan A, Korashy H. Sentiment analysis algorithms and applications: a survey. Ain Shams Eng J.
2014;5(4):1093–113.
21.Pang B, Lee L, Vaithyanathan S. Thumbs up? Sentiment classification using machine learning techniques .In:
Proceedings of the ACL-02 conference on empirical methods in natural language processing, vol. 10. Stroudsburg:
EMNLP’02, Association for Computational Linguistics; 2002. p. 79–86.
22.Pääkkönen P. Feasibility analysis of AsterixDB and Spark streaming with Cassandra for stream-based processing.
J Big Data. 2016;3:6. https://doi.org/10.1186/s40537-016-0041-8.
23.Ramanathan V, Meyyappan T. Survey of text mining. In: International conference on technology and business and
management. 2013. p. 508–14.
24.Ramteke J, Shah S, Godhia D, Shaikh A. Election result prediction using Twitter sentiment analysis. In: 2016
international conference on inventive computation technologies (ICICT), vol. 1. 2016. p. 1–5.
25.Razzaq MA, Qamar AM, Bilal HSM. Prediction and analysis of Pakistan election 2013 based on sentiment analysis.
In: 2014 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM
2014). 2014. p. 700–3.
26.Ruths D, Pfeffer J. Social media for large studies of behavior. Science. 2014;346(6213):1063–4.
27.Shi L, Agarwal N, Agrawal A, Garg R, Spoelstra J. Predicting US primary elections with Twitter. Stanford: Stanford
University; 2012.ć
28.Smailovi J, Kranjc J, Grčar M, Žnidaršič M, Mozetič I. Monitoring the Twitter sentiment during the Bulgarian
elections. In: 2015 IEEE international conference on data science and advanced analytics (DSAA). 2015. p. 1–10.
29.Soler JM, Cuartero F, Roblizo M. Twitter as a tool for predicting elections results. In: 2012 IEEE/ACM
international conference on advances in social networks analysis and mining. 2012. p. 1194–200.
30.Speriosu M, Sudan N, Upadhyay S, Baldridge J. Twitter polarity classification with label propagation over lexical
links and the follower graph. In: Proceedings of the first workshop on unsupervised learning in NLP, EMNLP’11.
Stroudsburg: Association for Computational Linguistics. p. 53–63.
31.Stavrianou A, Brun C, Silander T, Roux C. NLP-based feature extraction for automated tweet classification. In:
Proceedings of the 1st international conference on interactions between data mining and natural language
processing, vol. 1202, DMNLP’14. Aachen: CEUR-WS.org; 2011. p. 145–146.
32.Tumasjan A. Predicting elections with Twitter: what 140 characters reveal about political sentiment. In: Fourth
international AAAI conference on weblogs and social media. 2010.
33.Tumitan D, Becker K. Sentiment-based features for predicting election polls: a case study on the Brazilian
scenario. In: 2014 IEEE/WIC/ACM international joint conferences on web intelligence (WI) and intelligent agent
technologies (IAT), vol. 2. 2014. p. 126–33.
34.Tunggawan E, Soelistio YE. And the winner is...: Bayesian Twitter-based prediction on 2016 US presidential
election. https://arXiv:1611.00440 [cs]. 2016.
35.Wang H, Can D, Kazemzadeh A, Bar F, Narayanan S. A system for real-time Twitter sentiment analysis of 2012 US
presidential election cycle. In: Proceedings of the ACL 2012 system demonstrations, ACL’12. Stroudsburg:
Association for Computational Linguistics; 2012. p. 115–20.
36.Wang H, Castanon JA. Sentiment expression via emoticons on social media. In: 2015 IEEE international
conference on Big Data (Big Data). 2015. p. 2404–8.
37.Wicaksono AJ, Suyoto P. A proposed method for predicting US presidential election by analyzing sentiment in
social media. In: 2016 2nd international conference on science in information technology (ICSITech). 2016. p.
276–80.
_______________________________________________________________________________ __________________________________________________________________
© 2014-20, IRJCS- All Rights Reserved Page-8
International Research Journal of Computer Science (IRJCS) ISSN: 2393-9842
Issue 06, Volume 07 (June 2020) https://www.irjcs.com/archives
38.Wong FMF, Tan CW, Sen S, Chiang M. Quantifying political leaning from tweets, retweets, and retweeters. IEEE
Trans Knowl Data Eng. 2016; 28(8):2158–72.
39.Xie Z, Liu G, Wu J, Wang L, Liu C. Wisdom of fusion: prediction of 2016 Taiwan election with heterogeneous big
data. In: 2016 13th international conference on service systems and service management (ICSSSM). 2016. p. 1–6.
40.Xing F, Justin ZP. Sentiment analysis using product review data. J Big Data. 2015; 2:5.
41.Yu H, Hatzivassiloglou V. towards answering opinion questions: separating facts from opinions and identifying
the polarity of opinion sentences. In: Proceedings of the 2003 conference on empirical methods in natural
language processing, EMNLP’03. Stroudsburg: Association for Computational Linguistics; 2003. p. 129–36.
_______________________________________________________________________________ __________________________________________________________________
© 2014-20, IRJCS- All Rights Reserved Page-9