Sentiment Analysis Using Supervised Machine Learning Ijariie13051
Sentiment Analysis Using Supervised Machine Learning Ijariie13051
ABSTRACT
This paper acquaints a methodology with assumption investigation which utilizes different content standardization
methods in Natural Language Processing (NLP) for converting a text into vector and briefly explains the the
importance of standardization methods and how they are used in python with the help of its Natural Language Toolkit
(NLTK) library. Finally this paper analyzes different algorithms in Supervised Machine Learning with comparison of
two Machine Learning models, i.e., Logistic Regression and Naive Bayes for linear classification with the help of a
data set.
Keywords: Natural Language Processing; Text Normalization; Supervised Machine Learning; Logistic
Regression; Naive Bayes.
1 INTRODUCTION
Film surveys help clients to choose if the film merits their time. A rundown of all surveys for a film can assist clients with settling on
this choice by not burning through their time perusing all audits. Film rating sites are regularly utilized by pundits to post remarks
and rate motion pictures which assist watchers with choosing if the film merits viewing. Supposition investigation can decide the
demeanor of pundits relying upon their surveys. Assessment investigation of a film audit can rate how positive or negative a film
survey is and subsequently the general rating for a film. Accordingly, the way toward comprehension if an audit is positive or
negative can be computerized as the machine learns through preparing and testing the information [1].
Characteristic language preparing (NLP) is the connection among PCs and human language.Normal language alludes to dis-
course investigation in both discernible discourse, just as text of a language. NLP systems capture meaning from an input of words
(sentences, paragraphs, pages, etc.). This project intends to execute different content handling strategies in NLP and afterward
manufacture a Machine Learning Model so as to order the given survey as positive or negative.
2 TEXT NORMALIZATION
The process of transforming a text into a canonical (standard) form is called as Text Normalization. A few stages must be acted so
as to standardize the content and convert it into fitting structure as we can’t give the PC text as information ,which would then be
able to be given as contribution to the machine learning (ML) model. Thus the amount of different information that the computer
has to deal with gets reduced subsequently and improves the efficiency. Library utilized for this is given in [2]. Steps associated
with this cycle are shown in Figure 1 and explained in the following sections.
2.2 Tokenization
Tokenization is a method of isolating a bit of text into smaller units called tokens. Here, tokens can be either words, subwords, or
characters. Subsequently, tokenization can be comprehensively characterized into 3 kinds – word, character, and subword tokeniza-
tion. For instance, think about the sentence: ”Never surrender”. The most well-known method of framing tokens depends on space
[4]. Expecting space as a delimiter, the tokenization of the sentence brings about 2 tokens – Never-surrender. As every token is a
word, it turns into a case of word tokenization. Thus, tokens can be either characters or subwords as shown in this Example 1.
Example 1. Let us consider “smarter”. Then Character tokens: s-m-a-r-t-e-r and Subword tokens: smart-er.
SExprTokenizer: This tokenizer is Symbolic Expressions Tokenizer. It splits a string into substrings using a regular expression
which matches either the tokens or the separators between tokens. It isolates the string into tokens dependent on parenthesized
articulations and whitespace. Here is an example:
RegexpTokenizer: This tokenizer splits a string into substrings using a regular expression which matches either the tokens or the
separators between tokens. Parameter of pattern is used to build this tokenizer. This can be an ideal tokenizer in classification tasks
like sentiment analysis, since, more flexible and more control is in our hands to decide how to form tokens. A regular expression
(sometimes called a rational expression) is a sequence of characters that define a search pattern, mainly for use in pattern matching
with strings, or string matching, i.e., ”find and replace” - like operations [6]. Regular expression which can be used: ’ w +’, where
w matches any word character. Basically alpha-numeric, special characters are excluded like , !, %, $. The customary\ articulation
\
ab+c will give abc, abbc, abbc,., etc. For example
Consider the review : Movie was awesome!. I would love to watch it 100 times and it costed me 50$ for one show!
For this sentence after applying RegexpTokenizer tokens generated are
Tokens : [’Movie’, ’was’, ’awesome’, ’I’, ’would’, ’love’, ’to’, ’watch’, ’it’, ’100’, ’times’, ’it’, ’costed’, ’me’, ’50’, ’for’,
’one’, ’show’]
Stemming for ”studies” is ’studi’, Stemming for ”studying” is ’studi’, Stemming for ”cries” is ’cri’.
Actually, Lemmatization is an all the more remarkable activity, and it thinks about morphological examination of the words. It
restores the lemma which is the base type of all its inflectional structures. Inside and out semantic information is needed to make
word references and search for the correct type of the word.
Lemma for ”studies” is ’study’, Lemma for ”studying” is ’study’, Lemma for ”cries” is ’cry’, Lemma for ”cry” is ’cry’.
1. PorterStemmer
2. LancasterStemmer
There are other non-english stemmers also.
PorterStemmer: In this the calculation doesn’t follow phonetics rather a set of 5 rules for various cases that are applied in
stages (bit by bit) to create stems. It is known for its effortlessness and speed. It utilizes Suffix Stripping to create stems [9]. For
example:
LancasterStemmer: It is an iterative calculation with one table containing around 120 standards listed by the last letter of a
suffix. In every cycle, it attempts to locate a material guideline by the last character of the word. Each standard indicates either an
erasure or substitution of a completion. In the event with no such guideline, it ends. It additionally ends if a word begins with a vowel
and there are just two letters left or if a word begins with a consonant and there are just three characters left. If something else is
there, the standard is applied and the cycle rehashes. LancasterStemmer is likewise basic, yet hefty stemming because of emphases,
and over-stemming may happen. Over-stemming makes the stems not linguistics, they may have no significance. Over-stemming
makes the stems not phonetic, or they may have no meaning. For instance:
1. Bag-of-Words
2. TF-IDF
Bag-of-Words: It is a method to extract features from text documents. These highlights can be utilized for preparing ML
calculations. It makes a jargon of the apparent multitude of extraordinary words happening in all the archives in the preparation set.
In this we make a Feature Matrix based on one hot encoding. A significant disadvantage in utilizing this model is that it leads to a
high dimensional feature vector due to large size of vocabulary, V. Bag-of-words doesn’t leverage co-occurrence statistics between
words. It prompts a profoundly scanty vectors as there is nonzero esteem in measurements comparing to words that happen in the
sentence. The request for the event of words is lost, as we make a vector of token in randomized order - ’a good movie’, ’not a good
movie’, ’did not like’. One solution for this is considering N-grams (mostly bigrams) instead of individual words, i.e., unigrams.
TF-IDF Vectorizer: TF-IDF is an abbreviation for Term Frequency Inverse Document Frequency. This is very common
algorithm to transform text into a meaningful representation of numbers which is used to fit machine algorithm for prediction. Term
frequency specifies how frequently a term appears in the entire document. It can be thought of as the probability of finding a word
within the document and can be expressed as
𝑁o. of times 𝑤𝑖 occurs i𝑛 𝑟𝑗
𝑡𝑓(𝑤𝑖 , 𝑟𝑗 ) =
𝑇otal no. of words in 𝑟𝑗
TF-IDF = TF * IDF
Therefore, a high TF-IDF score is obtained by a term that has a high frequency in a document, and low document frequency in the
corpus [11].
3 CLASSIFICATION ALGORITHMS IN ML
In machine learning and statistics, classification is a supervised learning approach in which the computer program learns from the
input data and then uses this learning to classify new observations. Few types of classification algorithms in machine learning are:
1. Logistic Regression
2. Naive Bayes
\ 𝑃(𝐵\𝐴)𝑃(𝐴)
𝑃(𝐴\𝐵) = 𝑃(𝐵)
where, P(A\B) is the probability of A given B. Here, A will be dependent variable (which is to be predicted) and B will be
independent variable (Features).
\
4 APPLICATION
In this paper we have made a comparison of the Logistic Regression and Naive Bayes methods for sentiment analysis and shown
the analysis.
1. Both algorithms (Naive Bayes and Logistic Regression) are used for classification Cases
Though both algorithms are mainly used for solving problems that include classification tasks , the main difference is ,
Logistic Regression is limited for only binary classification e.g. Predicting whether a person is infected with a disease or not
, mail is spam or not spam , given sentiment is positive or negative , etc. whereas Naive Bayes Algorithm can be applied on
multiclass classification problems also.
2. Algorithm’s Working
The workings of both the algorithms are very different . Naive Bayes algorithm is a probabilistic approach whereas , in
Logistic Regression we use sigmoid function as activation function to map the result between 0 and 1 . Naive Bayes classifi-
cation calculation includes P(y—x) (which means probability of y given x) . So when there are multiple features then Naive
Bayes classifier assumes that all features are independent of each other which does not hold true while dealing with real world
problems . Therefore , if binary classification is considered , then it is generally observed that logistic regression gives better
results as compared to Naive bayes classifier .
3. Model assumptions.
Naive Bayes assumes all the features are different from each other.It is dependent upon each other alot hence accuracy is less.
Logistic regression splits it linearly, hence it is important to watch the accuracy according to it.
4.2 Results
The data shown in Figure 2 of movie reviews was taken from [14] and was implemented in Spyder IDE using python.
This data was trained and tested in two machine learning classification models, i.e., Logistic Regression and Naive Bayes for
linearly classifying whether the review is positive or negative. 80% of data was trained and rest 20% was tested to check whether
the model was accurate or not. A model is said to be accurate if the output of the model matches with the given sentiment, where ’0’
means positive and ’1’ means negative. Hence, accuracy score was calculated for both the ML models respectively. The obtained
results are shown in and Figure 3 and Figure 4.
From the results shown in the figures we see that Logistic Regression classification for movie-reviews has achieved the score on
training data as 93.59% and on testing data as 89.76%. The classification accuracy score with Naive Bayes show the training data
score as 90.6% and testing data score as 86.14%. The findings indicate that Logistic Regression classification for movie-reviews
has achieved higher classification accuracy score as compared to Naive Bayes, as the linearly separable and the classification task
in this case is binary, i.e. positive or negative. Therefore, logistic regression is found to be more efficient and accurate as compared
to Naive Bayes classification algorithm.
5 CONCLUSION
In this paper we have reviewed different standardization methods for Natural Language Processing its importance and implementa-
tion in python using its NLTK library. This standardization methods are used for normalization of text which can also be called as
data preprocessing before feeding it into the machine learning models for its linear classification.
Comparison of Logistic Regression and Naive Bayes models for linear classification also have been discussed and implemented
for a given dataset of movie-reviews. After a brief comparison, it turned out that Logistic Regression model was more accurate than
Naive Bayes model for the given movie-reviews dataset.
References
[1] Igor Mozetic, Miha Grcar, and Jasmina Smailovic. Multilingual twitter sentiment classification: The role of human annotators.
PLoS One, 11(5):1–26, 2016.
[2] Ewan Klein Steven Bird and Edward Loper. Natural Language Processing with Python – Analyzing Text with the Natural
Language Toolkit. O’Reilly Media, 2009.
[3] Hassan Saif, Miriam Fernandez, Yulan He, and Harith Alani. On stopwords, filtering and data sparsity for sentiment analysis
of Twitter. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), May
2014.
[4] Stanford NLP Group. CoreNLP. Stanford University, Stanford USA. https://stanfordnlp.github.io/CoreNLP/index.html.
[5] S. Vijayarani and R.Janani. Text mining: Open source tokenization tools – an analysis. Advanced Computational Intelligence:
An International Journal(ACII), 3(1):37–47, January 2016.
[6] M. Erwig and R. Gopinath. Explanations for regular expressions. In A. Zisman J. de Lara, editor, Fundamental Approaches
to Software Engineering, volume 7212 of Lecture Notes in Computer Science, pages 394–408, Berlin, 2012. Springer.
[7] Leveraging Inflection Tables for Stemming and Lemmatization, volume 1, Berlin, January 2016. Association for Computational
Linguistics.
[8] Amri Samir and Zenkouar Lahbib. Stemming and lemmatization for information retrieval systems in amazigh language.
In M. Al Achhab N. Enneya Y. Tabii, M. Lazaar, editor, Big Data, Cloud and Applications. BDCA 2018, volume 872 of
Communications in Computer and Information Science, pages 222–233, Kenitra, Morocco, August 2018. Springer, Cham.
[9] Anjali Ganesh Jivani. A comparative study of stemming algorithms. International Journal of Comp. Tech. Appl, 2(6):1930–
1938, 2011.
[10] Anna Stavrianou, Caroline Brun, Tomi Silander, and Claude Roux. Nlp-based feature extraction for automated tweet clas-
sification. In Proceedings of the 1st International Conference on Interactions between Data Mining and Natural Language
Processing, volume 1202, pages 145–146, September 2014.
[11] Ammar Ismael Kadhim. Term weighting for feature extraction on twitter: A comparison between bm25 and tf-idf. In Interna-
tional Conference on Advanced Science and Engineering, pages 1–6, Iraq, April 2019. IEEE.
[12] Chao-Ying Joanne Peng, Kuk Lida Lee, and Gary M. Ingersoll. An introduction to logistic regression analysis and reporting.
The Journal of Educational Research, 96(1):3–14, 220.
[13] I. Rish. An empirical study of the naive bayes classifier. Technical report, T.J.Watson Research Center, Hawthorne, NY, 2001.
[14] Kaggle. Sentiment analysis on movie reviews. https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews/data, March
2014