0% found this document useful (0 votes)
8 views

Sentiment Analysis based on vector embeding

This paper investigates various word vector representation techniques for sentiment analysis in the Vietnamese language, including TF-IDF, Word2Vec, GloVe, and Doc2Vec. The study evaluates the performance of these methods using two datasets and five classifiers, finding that TF-IDF consistently outperforms other techniques. The results highlight the importance of appropriate word representation in enhancing sentiment analysis accuracy.

Uploaded by

ngovubaomy9914
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Sentiment Analysis based on vector embeding

This paper investigates various word vector representation techniques for sentiment analysis in the Vietnamese language, including TF-IDF, Word2Vec, GloVe, and Doc2Vec. The study evaluates the performance of these methods using two datasets and five classifiers, finding that TF-IDF consistently outperforms other techniques. The results highlight the importance of appropriate word representation in enhancing sentiment analysis accuracy.

Uploaded by

ngovubaomy9914
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2022 9th NAFOSTED Conference on Information and Computer Science (NICS)

Sentiment Analysis based on word vector


representation for short comments in Vietnamese
language
1st Thien Ho Huong 3rd Kiet Tran-Trung
Ho Chi Minh City Open University, Vietnam Ho Chi Minh City Open University, Vietnam
[email protected] [email protected]

2nd Daphne Teck Ching Lai 4th Vinh Truong Hoang


Universiti Brunei Darussalam, Brunei Darussalam Ho Chi Minh City Open University, Vietnam
[email protected] [email protected]

Abstract—Word vector representation is a major stage in sentence so that the relationship of words and its meaning are
Natural Language Processing (NLP). It can be applied in various also influenced for each paragraph. Word2Vec is introduced
application such as sentiment analysis, text mining, topic de- by Mikolov et al. [9] in 2013. This approach is widely applied
tection, document summarization, information retrieval and has
an impact to the performance. In literature, different proposed in NLP and SA due to its efficiency [10], [11]. In [12], the
method focus on enhancing word representation model by N- authors demonstrated that Word2Vec is better than BoW for
gram, TF-IDF, and word embedding. This paper investigates sentiment analysis. Hitesh et al. [10] demonstrated that text
several word vector representation for Vietnamese sentiment representation by Word2Vec achieved a good performance on
analysis including TF-IDF, Word2Vec, GloVe, and Doc2Vec. The comparing with BoW and TF-IDF. Additionally, the method
experiment is evaluated on the five common classifiers and two
Vietnamese sentiment analysis dataset. based on word embedding Doc2Vec [13] and Glove [14]
Index Terms—Word embedding, Sentiment analysis, Feature are widely applied in many works [15], [16], [17]. Seyed
Extraction, Word representation Mahdi Rezaeinia et al. [18] combined several NLP tech-
niques such as lexion approaches, word position algorithm and
I. INTRODUCTION Word2Vec/GloVe methods. The proposed method improved
Sentiment Analysis (SA) is a task to analyze the comments more than 1.5% of accuracy. Djaballah and Othman [11], [19]
or reviews in order to have a hiding opinion of customer. In combined word embedding Word2Vec and the computation
recent years, SA have received a lot of attention due to its of weighted average TF-IDF. The clustering-based approach
various potential application such as stock market analysis [1], is considered to reduce the feature space by combining with
customer review analysis [2], travel [3], booking hotel [4], word embedding [20].
education [5], and political communication [6]. SA is a sub- However, the word embedding method is not the unique
field of NLP can be considered as a problem of sentiment impact to the performance which mainly depends on an
classification. There are many methods to classify sentiment appropriate word representation. Avinash and Sivasankar [21]
by using dictionary based approach, lexicon based approach. compared the TF-IDF and Doc2Vec and evaluated on 5 distinct
A vectorization step is needed to characterize the text before datasets. The TF-IDF only achieved good results on the second
training stage. Word vector representation is also a feature ex- and fifth dataset. We can found the same conclusion in [22],
traction step. Among the traditional feature extraction methods [23]. In this paper, we propose to investigate different word
such as Bag-of-words (BOW), Bag-of-ngrams (N-gram), Term vector representation for SA in Vietnamese language. The rest
Frequency-Invert Document Frequency (TF-IDF), the latter is of this paper is organized as follows. Section 2 describes the
applied in many works [2], [4] for SA. Nguyen et al. [4] word vector representation techniques. Section 3 illustrates
combined BOW and TF-IDF for feature extraction. Dzisevic et our proposed method. Then, section 4 shows the experimental
al. [7] fused TF-TDF and Latent Semantic Analysis (LSA) and results. Finally, the conclusion is discussed in section 5.
Linear Discriminant Analysis (LDA) to reduce the dimension
space. Ahuja et al. [8] indicated that TF-IDF word level is II. WORD VECTOR REPRESENTATION TECHNIQUES
more efficient than using feature extraction by N-gram. The word embedding is an essential step to vectorize text
The TF-IDF is applied in different works, however it has into continuous vector space in order to train by classifier.
a limitation that it produce a high-dimensional space. In a There are two main word embedding method [24], [25] such
large corpus, it might have an impact to the performance. as: Count based embedding and Prediction based embedding.
Additionally, this method skips the position of words for each The Count based embedding method count the words that

978-1-6654-5422-3/22/$31.00 ©2022 IEEE 165


2022 9th NAFOSTED Conference on Information and Computer Science (NICS)

appear in the text and represent them as a vector which


consist of Bag-of-Word (BoW), Bag-of-Ngram (N-gram), Co-
occurrence matrix, and TF-IDF.
A. TF-IDF
TF-IDF is a simple way to represent textual data as feature
vector. The Term Frequency is a frequency of word appearance
and the number of time a word appears in a document, divided
by the total number of words in that document. Where t is a
word in document, f(t,d) is a frequency of word occurs in
document, T is a total number of words in document.

f (t, d) Fig. 1: Word2Vec CBOW model and Word2Vec Skip-gram


T F (t) = (1)
T model.
Inverse Document Frequency (IDF) is a score of the im-
portance of words. There are some words appear in most
documents but it has no meaning in sentiment classification. the word count of the original text set. A matrix X is created
For example "thì" (to be), "mà" (yet), "nhưng" (but) etc. IDF is with the number of rows and columns corresponding to the
calculated as logarithm of the number of the documents in the words appearing in the text. The value at Xij is the number
copus divide by the number of document where the specific of occurrences of pairs of words i and j in the entire text. The
as below: formula to calculate the probability of the word j appearing
when there is i is calculated as follows:
N
IDF (t, D) = log (2) Xij
|{d ∈ D : t ∈ d}| Pij = P (j|i) = (4)
Xi
Where N is a total number of documents and denominator
is the number of document contains word t. And finally the The GloVe text representation method achieve a good per-
IF-IDF is calculated as below: formance with small dataset and feature vector space [14].
In addition, this method can do better for several tasks such
T F − IDF (t, d, D) = T F (t) × IDF (t, D) (3) as finding similar words, semantic similarity and identifying
entity names.
B. Word2Vec embedding
Mikolov et al [9] introduced Word2Vec in 2013. It is a D. Paragraph Vector embedding
neural network consisting of an input layer, an output layer, Quoc Le et al. [13] developed the Paragraph Vector em-
and a unique hidden layer. Each word in the text set is bedding method which is mainly based on Word2Vec ap-
represented as a corresponding one-hot-vector in space. The proach. Instead of representing word-for-word like Word2Vec,
input Word2Vec is a vector of the form w1, w2, ..., wv. In in Doc2Vec method each paragraph is represented as a single
which, w is the number of words. The input is a one-hot- vector in the matrix D and each word from the text is
vector so each word is marked as 1 at the word’s position, represented as a unique vector in the matrix W (as illustrated
while any other position on the vector has a value of 0. in figure 2). There are two models of Doc2Vec: Distributed
Word2Vec includes two models (illustrated in Figure 1): The Memory (DM) (corresponding to CBoW) and Distributed bag
Continuous Bag of Words (CBOW) and the Skip Gram. The of words (DBoW) (corresponding to Skip-gram).
CBOW model is a prediction of a word’s probability based on
the words next to it. In contrast, the Skip-Gram model predicts
words closer to the center word based on words around it
in an textual context. The nearest words are considered and
represented by a parameter of window size.
C. Global Vectors for Word Representation
Global Vectors for Word Representation (GloVe) is intro-
duced by Pennington et al [14]. This is an unsupervised
learning method which based on the construction of word-
word co-occurrence matrix. This matrix is created from an
input text and the probability of words occurrence. Fig. 2: Paragraph Vector embedding model.
This is a symmetrical square matrix where each row or
column is a vector representing the corresponding word in A major advantage of the Doc2Vec method is that it can
the corpus. The dimensional number of this matrix appears as be learned from unlabeled data. Therefore, this method often

166
2022 9th NAFOSTED Conference on Information and Computer Science (NICS)

gives good results in cases of limited training data and also datasets will be divided into trainning data and testing data
reduce the feature space by using the Skip-gram model. by ratio 70:30. We use accuracy to measure the classification
performance.
III. METHODOLOGY
The whole process of comparison different word represen- TABLE I: Summary of dataset for experimental
tation methods is illustrated in figure 3. All collected text will
No Name Emotional Polarity Comments Total of words
go through a pre-processing stage and represented by word
Positive 15,000
vector representation. Finally, a considered classifier is applied 1 Dataset 1 2,962,235
to predict its label. Negative 15,000
Positive 5,000
2 Dataset 2 1,003,237
Negative 5,000

Due to the informal and loose language structure and brevity


found in comments, models that hold local information with
regards to the documents such as TF-IDF performs better than
global approaches such as GloVe, Word2Vec and Doc2Vec.
Models with global structure may not have the best represen-
tation to reflect all the other comments expressed informally,
loosely structured or briefly. Table II and III presents the
Fig. 3: Overall of word vector representation comparison. classification results on dataset 1 and 2, respectively. For
dataset 1, the best accuracy is obtained with SVM classifier
Text Pre-processing: As these are online comments on and TF-IDF method. We present the average accuracy on the
the social media, the content may contain less meaningful last column of each Word Embedding technique, the ranking is
words. Therefore, these characters are needed to remove from listed by its performance such as: TF-IDF, GloVe, Word2Vec,
the datasets. Text Pre-processing is necessary step to clean Doc2Vec. Figure 4 and 5 illustrate the performance of the five
and reduce noise for comments. These basic comment pre- classifier by using four different word embedding techniques
processing are tokenization, punctuations removal, removing on dataset 1 and 2, respectively. We observe that TF-IDF
URLs, removing email, removing hashtag, removing @user, clearly outperforms other methods on two dataset for all
emoticons handling, removing numbers, removal of duplicate classifiers.
letters.
Vietnamese language is different from other language. A TABLE II: Results of Word vector representation applied on
word can have a completely different meaning when used dataset 1
individually, or in a phrase. For example, the meaning of the
Word Embedding Classifier
word "đất" (sand) and "nước" (water) when its are combined is
Technique LR NB 1-NN RF SVM Average
"đất nước" (country). In this study, the pyvi library is applied
for Vietnamese word segmentation. We also focus on removal TF-IDF 86.77 83.37 68.66 84.89 86.80 82.09
of stopwords in Vietnamese language because its contents GloVe 68.26 49.68 61.92 68.68 69.04 63.51
are less meaningful for sentiment analysis. Some Vietnamese Word2Vec 49.86 49.68 50.57 49.84 49.88 50.00
stopwords are "thì" (to be), "nhưng" (but), "là" (to be), "vì" Doc2Vec 49.73 49.78 50.41 48.71 49.87 49.70
(because)...
Word vector representation: In this study, we compare
these word vector representation techniques: TABLE III: Results of Word vector representation applied on
• Term Frequency - Invert Document Frequency dataset 2.
• Word2Vec embedding.
Word Embedding Classifier
• GloVe embedding.
Technique LR NB 1-NN RF SVM Average
• Doc2Vec embedding.
TF-IDF 85.67 82.83 70.63 83.27 84.83 81.44
Classification: Several common classifiers are considered
GloVe 63.93 50.87 60.70 66.67 63.67 61.16
for natural language processing and sentiment analysis tasks
[26], [27] such as Logistics Regression, Naive Bayes, kNN, Word2Vec 49.30 50.87 50.93 49.40 49.37 49.97
Random Forest and Support Vector Machine. Doc2Vec 48.67 49.23 51.87 48.90 48.73 49.48

IV. EXPERIMENTAL AND RESULTS


Based on study [27], we have used two datasets for ex- V. CONCLUSION
periment. These are comments and reviews on food which In this study, we have compared several word vector rep-
were collected by streetcodevn.com and have two classes. The resentation techniques applied for Vietnamese language sen-
characteristic of each dataset is presented in table I. The timent analysis, namely TF-IDF, Word2Vec, Doc2Vec and

167
2022 9th NAFOSTED Conference on Information and Computer Science (NICS)

[2] Tanjim Ul Haque, Nudrat Nawal Saber, and Faisal Muhammad Shah.
Sentiment analysis on large scale amazon product reviews. In 2018
Results of word vector representation IEEE International Conference on Innovative Research and Development
100
(ICIRD), pages 1–6. IEEE, 2018.
90 [3] Ana Valdivia, Emiliya Hrabova, Iti Chaturvedi, M. Victoria Luzón,
80 Luigi Troiano, Erik Cambria, and Francisco Herrera. Inconsistencies
70 on TripAdvisor reviews: A unified index between users and sentiment
60
analysis methods. 353:3–16, 2019.
Accuracy (%)

50
[4] Thuy Nguyen-Thanh and Giang T.C. Tran. Vietnamese sentiment
analysis for hotel review based on overfitting training and ensemble
40
learning. In Proceedings of the Tenth International Symposium on
30 Information and Communication Technology - SoICT 2019, pages 147–
20 153. ACM Press, 2019.
10 [5] Kiet Van Nguyen, Vu Duc Nguyen, Phu X. V. Nguyen, Tham T. H.
0 Truong, and Ngan Luu-Thuy Nguyen. UIT-VSFC: Vietnamese students’
LR NB kNN RF SVM feedback corpus for sentiment analysis. In 2018 10th International
Classifier Conference on Knowledge and Systems Engineering (KSE), pages 19–24.
TF - IDF Word2Vec Doc2Vec GloVe
IEEE, 2018.
[6] Martin Haselmayer and Marcelo Jenny. Sentiment analysis of political
communication: combining a dictionary approach with crowdcoding.
51(6):2623–2646, 2017.
[7] Robert Dzisevic and Dmitrij Sesok. Text classification using different
Fig. 4: The comparison of different word embedding method feature extraction approaches. In 2019 Open Conference of Electrical,
and classifiers on dataset 1. Electronic and Information Sciences (eStream), pages 1–4. IEEE, 2019.
[8] Ravinder Ahuja, Aakarsha Chug, Shruti Kohli, Shaurya Gupta, and
Pratyush Ahuja. The impact of features extraction on the sentiment
analysis. 152:341–348, 2019.
[9] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient
Results of word vector representation
estimation of word representations in vector space. 2013.
90
[10] Msr Hitesh, Vedhosi Vaibhav, Y.J Abhishek Kalki, Suraj Harsha Kam-
80
tam, and Santoshi Kumari. Real-time sentiment analysis of 2019 election
70 tweets using word2vec and random forest model. In 2019 2nd Inter-
60
national Conference on Intelligent Communication and Computational
Techniques (ICCT), pages 146–151. IEEE, 2019.
Accuracy (%)

50
[11] Kamel Ahsene Djaballah, Kamel Boukhalfa, and Omar Boussaid. Senti-
40 ment analysis of twitter messages using word2vec by weighted average.
30 In 2019 Sixth International Conference on Social Networks Analysis,
20
Management and Security (SNAMS), pages 223–228. IEEE, 2019.
[12] Elena Rudkowsky, Martin Haselmayer, Matthias Wastian, Marcelo Jenny,
10
Stefan Emrich, and Michael Sedlmair. More than bags of words:
0 Sentiment analysis with word embeddings. 12(2):140–157, 2018.
LR NB kNN RF SVM
[13] Quoc V. Le and Tomas Mikolov. Distributed representations of sentences
Classifier and documents. 2014.
TF - IDF Word2Vec Doc2Vec GloVe [14] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove:
Global vectors for word representation. In Proceedings of the 2014
Conference on Empirical Methods in Natural Language Processing
(EMNLP), pages 1532–1543. Association for Computational Linguistics,
Fig. 5: The comparison of different word embedding method 2014.
[15] Metin Bilgin and Izzet Fatih Senturk. Sentiment analysis on twitter
and classifiers on dataset 2. data with semi-supervised doc2vec. In 2017 International Conference
on Computer Science and Engineering (UBMK), pages 661–666. IEEE,
2017.
[16] Md. Tazimul Hoque, Ashraful Islam, Eshtiak Ahmed, Khondaker A.
GloVe. In the classification, the study has used five well- Mamun, and Mohammad Nurul Huda. Analyzing performance of differ-
know classifiers which are Logistics Regression, Naive Bayes, ent machine learning approaches with doc2vec for classifying sentiment
kNN, Random Foreset and SVM. The highest result is on SVM of bengali natural language. In 2019 International Conference on
Electrical, Computer and Communication Engineering (ECCE), pages
classifier with TF-IDF word representation method. The result 1–5. IEEE, 2019.
will be improved when we have a large corpus for building [17] Y. Sharma, G. Agrawal, P. Jain, and T. Kumar. Vector representation of
word embedding. words for sentiment analysis using glove. pages 279–284, 2017.
[18] Seyed Mahdi Rezaeinia, Rouhollah Rahmani, Ali Ghodsi, and Hadi
Veisi. Sentiment analysis based on improved pre-trained word embed-
DATA AVAILABILITY dings. 117:139–147, 2019.
[19] Rania Othman, Youcef Abdelsadek, Kamel Chelghoum, Imed Kacem,
The datasets generated during and/or analysed during the and Rim Faiz. Improving sentiment analysis in twitter using sentiment
current study are available from the corresponding author on specific word embeddings. In 2019 10th IEEE International Conference
reasonable request. on Intelligent Data Acquisition and Advanced Computing Systems:
Technology and Applications (IDAACS), pages 854–858. IEEE, 2019.
[20] Eissa M. Alshari, Azreen Azman, Shyamala Doraisamy, Norwati
REFERENCES Mustapha, and Mustafa Alkeshr. Improvement of sentiment analysis
based on clustering of word2vec features. In 2017 28th International
[1] Dattatray P. Gandhmal and K. Kumar. Systematic analysis and review Workshop on Database and Expert Systems Applications (DEXA), pages
of stock market prediction techniques. 34:100190, 2019. 123–126. IEEE, 2017.

168
2022 9th NAFOSTED Conference on Information and Computer Science (NICS)

[21] M. Avinash and E. Sivasankar. A study of feature extraction techniques


for sentiment analysis. In Ajith Abraham, Paramartha Dutta, Jyotsna Ku-
mar Mandal, Abhishek Bhattacharya, and Soumi Dutta, editors, Emerg-
ing Technologies in Data Mining and Information Security, volume 814,
pages 475–486. Springer Singapore, 2019. Series Title: Advances in
Intelligent Systems and Computing.
[22] Xiaofang Jin and Ying Xu. Research on the sentiment analysis based on
machine learning and feature extraction algorithm. In 2019 IEEE 10th
International Conference on Software Engineering and Service Science
(ICSESS), pages 366–369. IEEE, 2019.
[23] Helmi Imaduddin, Widyawan, and Silmi Fauziati. Word embedding
comparison for indonesian language sentiment analysis. In 2019 Interna-
tional Conference of Artificial Intelligence and Information Technology
(ICAIIT), pages 426–430. IEEE, 2019.
[24] K.S. Kalaivani, S. Uma, and C.S. Kanimozhiselvi. A review on feature
extraction techniques for sentiment classification. In 2020 Fourth Inter-
national Conference on Computing Methodologies and Communication
(ICCMC), pages 679–683. IEEE, 2020.
[25] Katic Tamara and Nemanja Milicevic. Comparing sentiment analysis and
document representation methods of amazon reviews. In 2018 IEEE 16th
International Symposium on Intelligent Systems and Informatics (SISY),
pages 000283–000286. IEEE, 2018.
[26] Huu-Thanh Duong and Vinh Truong Hoang. A survey on the multiple
classifier for new benchmark dataset of vietnamese news classification.
In 2019 11th International Conference on Knowledge and Smart Tech-
nology (KST), pages 23–28. IEEE, 2019.
[27] Thien Ho Huong and Vinh Truong Hoang. A data augmentation
technique based on text for vietnamese sentiment analysis. In Proceed-
ings of the 11th International Conference on Advances in Information
Technology, IAIT2020, New York, NY, USA, 2020. Association for
Computing Machinery.

169

You might also like