Investigating Sentiment Analysis From Social Media Data: Received: Review: Accepted: Published
Investigating Sentiment Analysis From Social Media Data: Received: Review: Accepted: Published
Data
First Author a,*, Second Author a, b, Third Author b
a
First affiliation institution
First affiliation address, City, Country, e-mail
b
Second affiliation institution
Second affiliation address, City, Country, e-mail
This article is an open-access article distributed under the terms and conditions of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/4.0)
Abstract
Natural Language Processing's Sentiment Analysis subfield aims to extract sentiment and
opinions from text input. Ultimately, this thesis aims to provide a web platform that can help
with sentiment analysis. To be more precise, the developed and deployed platform is a machine
learning model that can extract positive, negative, or neutral sentiment from text by classifying
the input as either helpful, detrimental, or neutral. A web program that accepts "tweets" and
"reviews" of movies as inputs can further put this paradigm to use. The software makes it easy
to search for and categorize comments from many social networking sites, like Reddit, Twitter,
and YouTube, using any relevant phrase. To help readers better understand how individuals
feel about specific issues, the results are presented in a format that differs from conventional
surveys. Python and JavaScript have been utilized to implement the different components of
the current project and application being discussed. Details about the training data set and
machine learning model are provided, The training method and the steps taken with each piece
of textual input data are also detailed, along with the data's final classification. Finally, some
recommendations are made to improve the platform in order to provide consumers with more
variations and amenities.
The authors may not translate the abstract and keywords into other language themselves.
aspects, like syntax / grammar and phonology at communication and marketing initiatives.
present, but it also is targeted to process other Modern social network analysis apps allow
elements in input data like prosodic and companies to do double duty: transmit and
sentiment. Some of the NLP applications include distribute marketing and commercial content and
different platforms for text analysis that provide gather market and audience data that is both
a wide range of results of the text analysis valuable and easy to understand. Businesses can
supporting clients to attain important outputs take advantage of this new two-way
from textual data. communication paradigm by putting their trust in
social networks, especially because there are
many commercial applications and solutions
II. OBJECTIVES available, either for purchase or through open
An NLP application that is thought to have source, that cater to their unique needs.
connections to a number of intriguing topics is at There is a lot of evidence for this, including the
the center of the current project and application, number of applications and investments made by
which is centered around the identification of large organizations in this area, as well as the
"sentiment" in written text input. For example, dynamics of the sentiment analysis sector,
recommendation systems and data pertaining to activity on social networks, and patents
different amounts of evaluations on specific pertaining to sentiment analysis. Using the terms
things could benefit greatly by extracting mood "sentiment analysis" and "social networks" in
and emotion from passages. The current project many Google Patents searches, the writers were
and application are made to be accessible to a able to evaluate it. The search terms used were
wide range of users, even those without any "Sentiment Analysis" and "Social Network," two
programming background. This is a generic term of the most prominent subfields of organizational
for a "machine learning model that understands science. In order to be included in the evaluation,
the sentiment of a sentence" that describes a patents had to have a grant year between 2016
somewhat broad application. In particular, this and 2021 that was compatible with the
thesis's application aims to do two things: (1) technologies that were considered, while patents
make it easy for users to analyze their own that were not yet in the grant process were
sentences through the use of a graphical user excluded. From 2016 to 2021, the
interface, and (2) use various online sources to aforementioned questions were included in 8349
test groups of opinions and comments for published patents that belonged to over 3000
specific keywords. The three online resources in distinct companies (see to figure-2 for reference).
question are as follows: first, Twitter, a Despite fewer than 3,000 unique entities holding
microblogging service; second, Reddit, a social patents in the mentioned subject, just fifteen of
news and entertainment website that allows users those entities possessed seventy-five percent of
to submit and vote on a wide variety of content those patents. This indicates the level of interest
types; and third, YouTube, a platform for sharing and investment made by those fifteen entities in
videos online. A number of bespoke machine the field.The following companies have the most
learning models trained on separate datasets and patents: Google (447), Facebook (283), Microsoft
grounded in probability will be used to conduct (260), Apple (217), IBM (149), Samsung (136),
the sentiment analysis of the input data. Oracle (105), Commvault (101), Amazon (86),
Following are some benefits of the suggested Visa (68), Tencent (61), One Trust (61),
project and application: The user-friendly Salesforce (60), and Intel (54).
interface (or, even better, graphical user
interface) allows users to input text and conduct B. Problem Statement
sentiment analysis. The second part is the use of Volume and Variety: for instance social media
two machine learning models, one for tweets and yields a huge volume of information in textual,
one for movie reviews, to achieve an accurate image and video format. Manual analysis of such
result. Lastly, the topic-related data can be large data set and the nature of the data is not
retrieved from social media platforms. feasible because of the great amount of
information.
A. Sentiment Analysis
Within the past few years, sentiment analysis Subjectivity: Social media content is
in social networks has become an increasingly typically lexicalized in ways that are much more
hot topic, drawing attention from academics, a matter of degree than of kind, which
businesses, and other organizational complicates identification of sentiment.
organizations. Through the use of social media,
these organizations have discovered novel
approaches to the management of their
3 Authors/ Mechatronics, Electrical Power, and Vehicular TechnologyXX(20XX) XX-XX
involve breadth and depth, the subjectivity of Real-time feed monitoring technologies of the
language and, in particular, the problem of over- various social networking sites were incorporated
or underestimating the given positive or negative so as update the quantitative sentiment scores as
sentiment. While the current study is mainly new data arrived on the screen.
based on feature engineering, future works may
concentrate on enhancing this framework by F. Evaluation
including sophisticated NLP methodologies
along with big data analysis for refined sentiment The performance of the models for the
analysis contributions. sentiment analysis study was assessed based on
the commonly used experimental parameters like
IV. Implementation accuracy, precision, recall and F1 – score. This
In this section, we explain the results of evaluation was supposed to compare the
applying the sentiment analysis framework capability of the models in terms of classifying
described in Section III to text data samples from sentiment with the ground truth or human-
social media platforms. The process involved annotated data.
several key steps: Data gathering, data
preparation, data feature extraction, model G. Integration
learning procedure, real time analysis and
assessment. Findings from sentiment analysis were
incorporated in other business related processes
A. Data Collection in an attempt to show that the proposed
framework isusable incritical decision making
Tweets from Twitter, posts from Facebook and strategy formulation processes within
walls, and everything from Instagram hashtags organisations..
feed were collected. This meant conducting API
scraping for a diverse selection of posts and V. Results
comments that had been created openly and were Accuracy: Evaluate the performance of
related to the study’s sentiment analysis area of sentiments with reference to the real ground truth
interest. or more often with a reference to judgment made
by people.
B. Preprocessing
· Timeliness: Assess the timelines taken to
The collected data were preprocessed first to produce results of the sentiment analysis process
cleanse the gathered text data and further in order to be able to meet the investigative
standardize them. This involved cleaning the data requirements that may be time-sensitive.
by –Noise handling –Special characters and
emojis –Text formatting that ensured the · Relevance: Determine if the insights
sentiment analysis of the data was accurate. provided by sentiment analysis are reasonable
and if they can be implemented for achieving the
C. Feature Extraction aims and goals of an investigation.
From 2016 to 2021, the aforementioned
In feature extraction, linear-algebra questions were included in 8349 published
techniques such as the TF-IDF, which assigns patents that belonged to over 3000 distinct
quantitative weights to words that participants companies (see to figure-2 for reference). Despite
deem significant within documents, were utilized. fewer than 3,000 unique entities holding patents
in the mentioned subject, just fifteen of those
D. Model Training entities possessed seventy-five percent of those
patents. This indicates the level of interest and
Preprocessing involved various operations investment made by those fifteen entities in the
such as removing stop words, stemming, field.The following companies have the most
tokenization, and stemming of the text data to patents: Google (447), Facebook (283), Microsoft
label them as positive, negative or neutral. More (260), Apple (217), IBM (149), Samsung (136),
advanced methods like Support Vector Machines Oracle (105), Commvault (101), Amazon (86),
(SVM) and Neural Networks were used to Visa (68), Tencent (61), One Trust (61),
enhance more classifications of sentiment. Salesforce (60), and Intel (54).
E. Real-time Analysis
5 Authors/ Mechatronics, Electrical Power, and Vehicular TechnologyXX(20XX) XX-XX
individuals is not made apparent in the story. Mining involves the feeling of the person who
The localIn supervised learning, the goal is to wrote the said sentence.
produce an equation for the dataset that includes VII. APPENDIX
the input variables and the predicted result.
Before the algorithm processes the data vectors, A. Figures
the user supplies training samples that include
feature values and their correct class. The next
step is to apply an algorithm to unlabeled data in
order to label it. It is clear from the problem
statement that this is a classification problem
requiring a supervised solution. 2. Load-based
learning: Contrarily, unsupervised algorithms
differ from their supervised counterparts in that
they do not require developer input pertaining to
targets. These algorithms examine the structure
of the input data and determine which patterns in
the output data to augment with new data based
on the data's similarities and differences. Using
unsupervised approaches, clustering techniques
are implemented. Semi-supervised learning Figure 1. Showing Social Media Data Results
combines supervised and unsupervised learning, B. Tables
and it involves controlled learning using a set of The table compares the accuracy of a model
labels. The semi-supervised learning training using different text normalization techniques:
approach makes use of both large amounts of The final steps are called stemming, which is the
unlabeled data and a small amount of labeled reduction of a word to its root word, and
data. The shortcomings of the first two learning lemmatization, which is the process of
approaches inspired the development of semi- minimizing it to its base form. If no
supervised learning: The problem with normalization is applied, the model lifts a
supervised learning is its expensive approach to reasonable accuracy that peaks at 74%. 97%
data categorization, while the unsupervised accuracy. Applying stemming is used only a little
format has a very limited range of applications. it helps to bring slightly better accuracy of 75.
In this case, the semi-supervised learning method 12% despite it being less accurate than the
involves classifying and grouping the unlabeled previous methods with lemmatization showing a
data by making predictions about the unlabeled significantly higher accuracy of at 75%. 52%.
items based on the labeled data. This shows that among all the normalization
methods discussed above, lemmatization is the
VII. Conclusion most suitable normalization technique for the
SA is the process of using techniques that specific model and dataset applied.
stem from NLP and textual areas of linguistics,
statistics and other cognate fields of Table 1: Normalization results (sentiment140)
computational linguistics for the extraction of Normalization Accuracy
sentiment, emotion and opinion expressed in a None 74.97%
text document. Concisely, Sentiment Analysis Stemming 75.12%
aims at captures the opinion or the tendency of a Lemmatization 75.52%
person regarding any particular issue or else, the a
footnotebfootnote
overall sentiment of a given text. Sentiment
Analysis as part of Natural Language Processing C. Mathematical expressions
is being identified as one of the most difficult Following is the Data Collection Algorithem
concepts to implement within the field of NLP which is used.
since there are numerous factors that are Search Number. Tweet number and tweet type (1)
attributed to emotionality of the content of the
textual input. There are several definitions for VIII. References
Sentiment Analysis, most of which use the same [1] A. Agarwal, B. Xie, I. Vovsha, O. Rambow
meanings and these two names are used and R. Passonneau, "Sentiment Analysis of
interchangeably, however, a few of them Twitter
employs somewhat difference meanings. Data," 2011.
Sentiment Analysis confines itself to the feeling
of the sentence while on the other hand, Opinion
7 Authors/ Mechatronics, Electrical Power, and Vehicular TechnologyXX(20XX) XX-XX
Abstract
8 Authors/ Mechatronics, Electrical Power, and Vehicular TechnologyXX(20XX) XX-XX
Keywords
Research article
R
ev
ie
w
art
icl
e
Br
ief
re
po
rt
Short communication
Research note
Telephone# Fax#
I hereby confirm that the manuscript was prepared in accordance with the instructions
for authors of scientific publications, and that the content of this manuscript, or most
of it, was not published in the journal indicated, and the manuscript was not submitted
for publication elsewhere.
Copyright Agreement
Manuscript title:
Full names of all authors:_______________________________________________________
______________________________________________________________
______________________________________________________________
9 Authors/ Mechatronics, Electrical Power, and Vehicular TechnologyXX(20XX) XX-XX
License Agreement
(1) Authors own all the copyright rights for the paper.
(2) Submitted manuscript is an original paper.
(3) Authors hereby grant the Issues of Journal of LGURJCSIT with an exclusive, royalty-free, worldwide license to email the
paper to all who will ask for it.
(4) All authors have made a significant contribution to the research and are ready to assume joint responsibility for the paper.
(5) All authors have seen and approved the manuscript in the final form as it is submitted for publication.
(6) This manuscript has not been published and also has neither been submitted nor considered for publication elsewhere
(7) The text, illustrations and any other materials, included into the manuscript, do not infringe any existing
intellectual property rights or other rights of any person or entity.
(8) The editors of the Issues of Journal of LGURJCSIT, its personnel or the Editorial Board members accept no
responsibility for the quality of the idea expressed in this publication.
I am the Corresponding author and have full authority to enter into this agreement.