Sentiment Analysis based on vector embeding
Sentiment Analysis based on vector embeding
Abstract—Word vector representation is a major stage in sentence so that the relationship of words and its meaning are
Natural Language Processing (NLP). It can be applied in various also influenced for each paragraph. Word2Vec is introduced
application such as sentiment analysis, text mining, topic de- by Mikolov et al. [9] in 2013. This approach is widely applied
tection, document summarization, information retrieval and has
an impact to the performance. In literature, different proposed in NLP and SA due to its efficiency [10], [11]. In [12], the
method focus on enhancing word representation model by N- authors demonstrated that Word2Vec is better than BoW for
gram, TF-IDF, and word embedding. This paper investigates sentiment analysis. Hitesh et al. [10] demonstrated that text
several word vector representation for Vietnamese sentiment representation by Word2Vec achieved a good performance on
analysis including TF-IDF, Word2Vec, GloVe, and Doc2Vec. The comparing with BoW and TF-IDF. Additionally, the method
experiment is evaluated on the five common classifiers and two
Vietnamese sentiment analysis dataset. based on word embedding Doc2Vec [13] and Glove [14]
Index Terms—Word embedding, Sentiment analysis, Feature are widely applied in many works [15], [16], [17]. Seyed
Extraction, Word representation Mahdi Rezaeinia et al. [18] combined several NLP tech-
niques such as lexion approaches, word position algorithm and
I. INTRODUCTION Word2Vec/GloVe methods. The proposed method improved
Sentiment Analysis (SA) is a task to analyze the comments more than 1.5% of accuracy. Djaballah and Othman [11], [19]
or reviews in order to have a hiding opinion of customer. In combined word embedding Word2Vec and the computation
recent years, SA have received a lot of attention due to its of weighted average TF-IDF. The clustering-based approach
various potential application such as stock market analysis [1], is considered to reduce the feature space by combining with
customer review analysis [2], travel [3], booking hotel [4], word embedding [20].
education [5], and political communication [6]. SA is a sub- However, the word embedding method is not the unique
field of NLP can be considered as a problem of sentiment impact to the performance which mainly depends on an
classification. There are many methods to classify sentiment appropriate word representation. Avinash and Sivasankar [21]
by using dictionary based approach, lexicon based approach. compared the TF-IDF and Doc2Vec and evaluated on 5 distinct
A vectorization step is needed to characterize the text before datasets. The TF-IDF only achieved good results on the second
training stage. Word vector representation is also a feature ex- and fifth dataset. We can found the same conclusion in [22],
traction step. Among the traditional feature extraction methods [23]. In this paper, we propose to investigate different word
such as Bag-of-words (BOW), Bag-of-ngrams (N-gram), Term vector representation for SA in Vietnamese language. The rest
Frequency-Invert Document Frequency (TF-IDF), the latter is of this paper is organized as follows. Section 2 describes the
applied in many works [2], [4] for SA. Nguyen et al. [4] word vector representation techniques. Section 3 illustrates
combined BOW and TF-IDF for feature extraction. Dzisevic et our proposed method. Then, section 4 shows the experimental
al. [7] fused TF-TDF and Latent Semantic Analysis (LSA) and results. Finally, the conclusion is discussed in section 5.
Linear Discriminant Analysis (LDA) to reduce the dimension
space. Ahuja et al. [8] indicated that TF-IDF word level is II. WORD VECTOR REPRESENTATION TECHNIQUES
more efficient than using feature extraction by N-gram. The word embedding is an essential step to vectorize text
The TF-IDF is applied in different works, however it has into continuous vector space in order to train by classifier.
a limitation that it produce a high-dimensional space. In a There are two main word embedding method [24], [25] such
large corpus, it might have an impact to the performance. as: Count based embedding and Prediction based embedding.
Additionally, this method skips the position of words for each The Count based embedding method count the words that
166
2022 9th NAFOSTED Conference on Information and Computer Science (NICS)
gives good results in cases of limited training data and also datasets will be divided into trainning data and testing data
reduce the feature space by using the Skip-gram model. by ratio 70:30. We use accuracy to measure the classification
performance.
III. METHODOLOGY
The whole process of comparison different word represen- TABLE I: Summary of dataset for experimental
tation methods is illustrated in figure 3. All collected text will
No Name Emotional Polarity Comments Total of words
go through a pre-processing stage and represented by word
Positive 15,000
vector representation. Finally, a considered classifier is applied 1 Dataset 1 2,962,235
to predict its label. Negative 15,000
Positive 5,000
2 Dataset 2 1,003,237
Negative 5,000
167
2022 9th NAFOSTED Conference on Information and Computer Science (NICS)
[2] Tanjim Ul Haque, Nudrat Nawal Saber, and Faisal Muhammad Shah.
Sentiment analysis on large scale amazon product reviews. In 2018
Results of word vector representation IEEE International Conference on Innovative Research and Development
100
(ICIRD), pages 1–6. IEEE, 2018.
90 [3] Ana Valdivia, Emiliya Hrabova, Iti Chaturvedi, M. Victoria Luzón,
80 Luigi Troiano, Erik Cambria, and Francisco Herrera. Inconsistencies
70 on TripAdvisor reviews: A unified index between users and sentiment
60
analysis methods. 353:3–16, 2019.
Accuracy (%)
50
[4] Thuy Nguyen-Thanh and Giang T.C. Tran. Vietnamese sentiment
analysis for hotel review based on overfitting training and ensemble
40
learning. In Proceedings of the Tenth International Symposium on
30 Information and Communication Technology - SoICT 2019, pages 147–
20 153. ACM Press, 2019.
10 [5] Kiet Van Nguyen, Vu Duc Nguyen, Phu X. V. Nguyen, Tham T. H.
0 Truong, and Ngan Luu-Thuy Nguyen. UIT-VSFC: Vietnamese students’
LR NB kNN RF SVM feedback corpus for sentiment analysis. In 2018 10th International
Classifier Conference on Knowledge and Systems Engineering (KSE), pages 19–24.
TF - IDF Word2Vec Doc2Vec GloVe
IEEE, 2018.
[6] Martin Haselmayer and Marcelo Jenny. Sentiment analysis of political
communication: combining a dictionary approach with crowdcoding.
51(6):2623–2646, 2017.
[7] Robert Dzisevic and Dmitrij Sesok. Text classification using different
Fig. 4: The comparison of different word embedding method feature extraction approaches. In 2019 Open Conference of Electrical,
and classifiers on dataset 1. Electronic and Information Sciences (eStream), pages 1–4. IEEE, 2019.
[8] Ravinder Ahuja, Aakarsha Chug, Shruti Kohli, Shaurya Gupta, and
Pratyush Ahuja. The impact of features extraction on the sentiment
analysis. 152:341–348, 2019.
[9] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient
Results of word vector representation
estimation of word representations in vector space. 2013.
90
[10] Msr Hitesh, Vedhosi Vaibhav, Y.J Abhishek Kalki, Suraj Harsha Kam-
80
tam, and Santoshi Kumari. Real-time sentiment analysis of 2019 election
70 tweets using word2vec and random forest model. In 2019 2nd Inter-
60
national Conference on Intelligent Communication and Computational
Techniques (ICCT), pages 146–151. IEEE, 2019.
Accuracy (%)
50
[11] Kamel Ahsene Djaballah, Kamel Boukhalfa, and Omar Boussaid. Senti-
40 ment analysis of twitter messages using word2vec by weighted average.
30 In 2019 Sixth International Conference on Social Networks Analysis,
20
Management and Security (SNAMS), pages 223–228. IEEE, 2019.
[12] Elena Rudkowsky, Martin Haselmayer, Matthias Wastian, Marcelo Jenny,
10
Stefan Emrich, and Michael Sedlmair. More than bags of words:
0 Sentiment analysis with word embeddings. 12(2):140–157, 2018.
LR NB kNN RF SVM
[13] Quoc V. Le and Tomas Mikolov. Distributed representations of sentences
Classifier and documents. 2014.
TF - IDF Word2Vec Doc2Vec GloVe [14] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove:
Global vectors for word representation. In Proceedings of the 2014
Conference on Empirical Methods in Natural Language Processing
(EMNLP), pages 1532–1543. Association for Computational Linguistics,
Fig. 5: The comparison of different word embedding method 2014.
[15] Metin Bilgin and Izzet Fatih Senturk. Sentiment analysis on twitter
and classifiers on dataset 2. data with semi-supervised doc2vec. In 2017 International Conference
on Computer Science and Engineering (UBMK), pages 661–666. IEEE,
2017.
[16] Md. Tazimul Hoque, Ashraful Islam, Eshtiak Ahmed, Khondaker A.
GloVe. In the classification, the study has used five well- Mamun, and Mohammad Nurul Huda. Analyzing performance of differ-
know classifiers which are Logistics Regression, Naive Bayes, ent machine learning approaches with doc2vec for classifying sentiment
kNN, Random Foreset and SVM. The highest result is on SVM of bengali natural language. In 2019 International Conference on
Electrical, Computer and Communication Engineering (ECCE), pages
classifier with TF-IDF word representation method. The result 1–5. IEEE, 2019.
will be improved when we have a large corpus for building [17] Y. Sharma, G. Agrawal, P. Jain, and T. Kumar. Vector representation of
word embedding. words for sentiment analysis using glove. pages 279–284, 2017.
[18] Seyed Mahdi Rezaeinia, Rouhollah Rahmani, Ali Ghodsi, and Hadi
Veisi. Sentiment analysis based on improved pre-trained word embed-
DATA AVAILABILITY dings. 117:139–147, 2019.
[19] Rania Othman, Youcef Abdelsadek, Kamel Chelghoum, Imed Kacem,
The datasets generated during and/or analysed during the and Rim Faiz. Improving sentiment analysis in twitter using sentiment
current study are available from the corresponding author on specific word embeddings. In 2019 10th IEEE International Conference
reasonable request. on Intelligent Data Acquisition and Advanced Computing Systems:
Technology and Applications (IDAACS), pages 854–858. IEEE, 2019.
[20] Eissa M. Alshari, Azreen Azman, Shyamala Doraisamy, Norwati
REFERENCES Mustapha, and Mustafa Alkeshr. Improvement of sentiment analysis
based on clustering of word2vec features. In 2017 28th International
[1] Dattatray P. Gandhmal and K. Kumar. Systematic analysis and review Workshop on Database and Expert Systems Applications (DEXA), pages
of stock market prediction techniques. 34:100190, 2019. 123–126. IEEE, 2017.
168
2022 9th NAFOSTED Conference on Information and Computer Science (NICS)
169