See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/356363328
Sentiment Analysis using Machine Learning and Deep Learning Models on
Movies Reviews
Conference Paper · October 2021
DOI: 10.1109/NILES53778.2021.9600548
CITATIONS READS
4 248
2 authors:
Yomna Eid Walaa Medhat
Nile University Nile University
7 PUBLICATIONS 70 CITATIONS 49 PUBLICATIONS 3,107 CITATIONS
SEE PROFILE SEE PROFILE
All content following this page was uploaded by Yomna Eid on 06 December 2021.
The user has requested enhancement of the downloaded file.
Sentiment Analysis using Machine
Learning and Deep Learning
Models on Movies Reviews
Yomna Eid Rizk Walaa Medhat Asal
Information technology and Information technology and
computer science School, computer science School,
Center for Informatics Center for Informatics
Sciences Sciences
Nile University Nile University
Giza, Egypt Giza, Egypt
[email protected] [email protected] Abstract— The huge amount of data being generated and These techniques vary from rule-based approaches, machine
transferred each day on the Internet leads to an increase of the need learning approaches, and recently, deep learning approaches [2]
to automate knowledge-extraction tasks. Sentiment analysis is an [3]. The subtasks of sentiment analysis are represented as follows:
ongoing research field in knowledge extraction that faces many first, text that contains sentiment is read in a proper format [4].
challenges. In this paper, different machine learning, neural
networks, deep learning models were evaluated over the IMDB
Second, the input text is preprocessed, e.g., text normalization to
benchmark dataset for movies reviews. Moreover, various word- lower case, stemming, removing stop-words, and text
embedding techniques were tested. Among all the presented tokenization. Then, features are extracted by encoding the text and
models, the results of this work showed that the Long Short-Term representing it as numbers instead of characters. This encoded
Memory (LSTM) model with Bidirectional Encoder data is fed to a sentiment classification algorithm to be trained on.
Representations from Transformer (BERT) embeddings has After all these preceding steps, the sentiment polarity is detected
achieved the highest results with an accuracy of 93%. using the proper classification model.
Keywords—Sentiment Analysis, Natural Language In this article, various machine learning, neural networks,
Processing, Deep Learning, Machine Learning. deep learning, and word-embedding models are presented and
evaluated. According to the obtained results, the highest test
I. INTRODUCTION accuracy was achieved by the LSTM-bidirectional encoding
model with the BERT embeddings.
Today’s data-driven world has given rise to the
automation of most of the tasks that were manually The main contribution of this paper is that different
performed. Data is generated from different resources and in implementations of classic machine learning, neural networks,
many formats that vary from being structured, semi- and recent deep learning models are presented for the
structured, or unstructured. Moreover, the data generation rate sentiment analysis task. Their achieved results are reported.
is ever-increasing, which causes some challenges. The Moreover, various word-embedding techniques were tested
characteristics of the dataset play a crucial role in the along with these different architectures.
effectiveness of the models [1]. This fact reflects some
challenges, like data storage, data access, data preprocessing, The paper is organized as follows: First, an overview of the
and computational cost. recent related work was presented in section 2. Then the
conducted experiments were discussed in detail in section 3.
The generated data from different social media platforms Section 4 has shown the results of tested architectures, along
are considered as the backbone of many Natural Language with a discussion. Finally, section 5 stated a conclusion of the
Processing (NLP) tasks. Hence, monitoring social media data work done in this study.
is crucial for many NLP applications, such as
recommendation systems, machine translation, text II. RELATED WORK
summarization and sentiment analysis, etc.
Customers’ opinions have powerful and informative data that
Sentiment analysis or opinion mining is one of the most should be retrieved. Hence, sentiment analysis using automated
essential NLP applications in the field of text mining. In frameworks is a crucial mission for companies. These
sentiment analysis, context is being extracted to reach the frameworks can assist the companies in customers guidance,
meaning behind the written words, “sentiment” refers to the recommending proper products and resolving negative feedbacks.
real meaning and intuition behind written or spoken words. Moreover, sentiment analysis can be applied to assess competitors
and to avoid repeating their mistakes. Different models can be
Sentiment analysis, then, is the act of inferring the opinion applied to extract the sentiment of a sentence.
related to someone or something. Being able to automate the
sentiment analysis task means saving lots of time and effort. These models vary from being rule based, machine learning
Accordingly, there are various techniques to achieve this goal. and recent deep learning models. Moreover, there are different
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
challenges that hinder these models, such as the training speed, The IMDB dataset is balanced with 12.5K positive
the context capturing, the model complexity, etc. This work sentiment reviews and 12.5K negative sentiment reviews.
presents a novel technique that tries to overcome these However, negative sentiment reviews tend to be shorter in
limitations with the highest accuracy achieved using LSTM length than positive ones. All reviews content is represented in
models along with BERT embeddings. the English language with various forms of symbols like
hashtags and exclamation marks that will be truncated later.
The authors of [1] stated that combining deep learning
architectures with word-embeddings has outperformed the
traditional Term Frequency-Inverse Document Frequency- B. Data Preprocessing
based models (TF-IDF). In [5], the authors have investigated
different classification models over binary and multi-class In this study, different methods were used to perform the
labels datasets. This work tried to overcome the high sentiment classification task. As mentioned before, sentiment
computational cost challenge in the Recursive Neural Tensor analysis follows a sequence of steps that varies from data
Networks (RNTN) by implementing a low-rank RNTN. In [6], preparation and preprocessing to data processing. First, the
the Random Forest algorithm has proven to be the best dataset is represented as a data frame using Pandas Python
library. Then, the data preprocessing steps are carried out as
algorithm to classify the IMDB dataset.
follows:
On the other hand, the Ripper Rule Learning has achieved • Tokenization: All reviews are split into individual
the worst results among the tested models. In [7], the author words to form a vector of words using the word-
relied on machine learning techniques to classify IMDB movie embedding techniques. The tokenizer used in the
reviews. It leveraged the Logistic Regression model to classify logistic regression and random forest models is
the sentiment. The achieved result was 0.6806 in kappa PorterStemmer. The tokenizer used with the rest of the
statistics. models is the Keras tokenizer.
• Data Cleaning and Normalization: Performed over the
With regards to word-embedding as a numerical
whole dataset for proper interpretation. The aim is to
representation technique for words, Word2Vec was used in [8]
remove non-alphabetic characters, HTML tags, and
with different deep learning and hybrid models. The results of
stop words that can be truncated using the NLTK
this research showed that the hybrid approach outperformed the
wordnet [14]. Punctuation marks are also deleted,
rest of the models, with an accuracy of 89.2%. A fine-tuned,
although they can play an important role in some NLP
fast, and small-size pre-trained language model called
tasks, especially in sarcasm detection [15], moreover,
DistilBERT was introduced in [9]. The contribution of this
multiple spaces are also removed.
research was that DistilBERT has proved to be 40% lower in
size than the original BERT model. It also retained 97% of the • Data Encoding: One of the breakthroughs of deep
BERT’s gained information. Moreover, it was 60% faster than learning is to represent words in real-valued vectors.
the original BERT. Different word-embedding techniques can be used for
text representation, varying from the traditional Bag of
The authors of [10] relied on direct word-embedding for Words (BOW) with TF-IDF, to the famous Word2Vec
small text regions, instead of using word vectors or even a model by Google [16], FastText representation by
traditional bag of n-grams. The best performance was achieved Facebook [17], GloVe by Stanford University [18], and
by a seq2-bown- CNN model. It has three parallel layers, two of finally, the cutting-edge language model, called BERT
which had 1000 neurons per layer and the other one had just 20 [19].
neurons.
In this research, BOW, GloVe, and BERT embeddings
Long text data faces many challenges, one of which is the were used. BOW is an algorithm for feature extraction from
quadratic complexity of self-attention. To beat this limitation, text, it is called ‘Bag’ because the word’s order is neglected
the authors of [11] proposed a BP-Transformer that tried to [20]. TF- IDF method is depending on BOW, it tries to
balance the tradeoff between model capacity and assign a weight for each word that reflects its importance
computational complexity. The proposed model has obtained [21]. GloVe (Global Vectors) is an unsupervised learning
an accuracy of 92.12% on the IMDB dataset. In [12], a novel algorithm to obtain vector representations for words by
and stable RNN-based architecture for complex sequential counting the co- occurrence statistics of word-word from a
data processing was presented. It was developed based on the corpus.
time-discretization of second-order systems for non-linear
ordinary differential equations. BERT is a pre- trained language model for high quality
features extraction. BERT embedding consists of three
III.EXPERIMENTAL WORK layers with a maximum sequence length of 500. The input
A. Dataset layer has three variables, ID, masks, and segments. The
second layer is a hidden layer with 768 units and a ReLU
IMDB is a benchmark dataset collected by Stanford activation function. The third layer is the output layer,
researchers for sentiment polarity classification tasks [13]. which is a dense layer with 2 units and a soft-max
The IMDB website could be considered as a professional activation function.
movie reviews repository. The classification task is to
detect whether the text that contains sentiment is positive or
negative based on the extracted features. C. Data Processing
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
To be able to classify movie reviews, different facilitates research work as there is no need to install most of
approaches and word-embedding methods were tested. the packages [22]. The personal computer specifications were,
The tested models throughout this paper are illustrated Core i5-5200U (processor), 6 Gigabyte of RAM, running on
in figure 1 and they were: Windows 10 with Python 3. Libraries used for implementation
were: Keras, TensorFlow, scikit-learn, BERT, NLTK, NumPy,
and Pandas.
IV.RESULTS AND DISCUSSION
After applying the required preprocessing techniques, the
training step is performed. The data is split into 80% for
training (20K movie reviews) and 20% for testing (5K movie
reviews). The reason behind this separating ratio is that it is
commonly used in the machine learning field. Training and
testing data are balanced with an equal number of positive
and negative sentiment reviews. The total number of
instances used in the modeling process is shown in table 1.
Figure 1. Framework of presented models.
TABLE I. Total numbers of items in the dataset.
• Logistic Regression Classifier: A statistical machine Positive reviews 12500
learning model for binary classification tasks. It is used in
this research to classify the review sentiment based on the Vocabulary size after tokenization 67196
previously extracted features. Most frequent words 5000
• Naïve Bayes classifier: It applies Bayes theorem, with an Maximum sentence length in 100
assumption that there are no dependencies between the padding layer
variables given the outcome. Recent researchers showed Maximum sequence length 500
that it has proved to be a competitive text classification (BERT model)
algorithm. The results obtained in all tested models were shown in figure
• Random Forest classifier: A decision tree-based 2. According to the reposted results, the LSTM model with
classification algorithm. In this task, each decision tree leaf BERT embeddings has outperformed all models with an
node indicates the predicted label. accuracy of 93%.
• A Simple Neural Network (NN) classifier: A technique
that mimics the human brain in performing different tasks.
In this research, a simple NN was developed in three basic
layers: an embedding layer with an input length of 100 and
output vector dimension of 100; flattening layers for text
vectorization; and finally, a dense layer with a sigmoid
activation function. The total number of parameters are in
the sense layer are 10001. Lastly, the model is compiled
then with an Adam optimizer and binary cross-entropy loss.
• Convolutional Neural Network (CNN) classifier: A deep
learning algorithm that deals with any input as an image.
The tested CNN architecture consists of, first, embedding Figure 2. Models test comparison.
layers, then, a 1D CONV layer with 128 features and a
kernel of size 5, with a sigmoid activation function. These This result is comparative to other developed models. The
two layers were followed by a max-pooling layer for reason behind BERT embedding outperformance is the dynamic
scaling down the amount of generated information. The words representation. The combination of BERT with LSTM
max sentence length in the max-pooling layer is 100. The has achieved the best performance because BERT solves many
last layer is a dense layer with one unit and a sigmoid of the limitations in LSM, such as the slow training, by being a
activation function. pre- trained model. Moreover, BERT learns better contexts from
both directions simultaneously.
• LSTM classifier: A basic form of Recurrent Neural
Networks (RNNs). It can learn from long-term dependencies The detailed test results achieved in this research for all
to catch the context behind the sentences. The proposed LSTM models were illustrated in table 2. All these results were stated
architecture consists of the embedding layer, followed by a considering the different architectures and word-embedding
128 units’ layer. Then a dense layer with a sigmoid activation techniques. For the use-case presented in this work, accuracy is a
function. good evaluation metric as it’s dealing with a balanced dataset.
All the preceding models were optimized using Adam Table 3 illustrates the features that achieved the highest test
optimizer. The experiments were conducted on Colab, a accuracies for each architecture. It shows that the BOW
powerful cloud notebook developed by Google that embedding model has achieved the highest results while being
used with the Linear Regression and simple NN models.
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
TABLE II. Machine Learning (ML) and Deep Learning (DL) Models Results. [8] N. Mohamed Ali, M. M. A. El Hamid, and A. Youssif, “Sentiment
Analysis for Movies Reviews Dataset Using Deep Learning Models,” Int.
J. Data Min. Knowl. Manag. Process, vol. 09, no. 03.
Model Type BOW GloVe BERT [9] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled
Logistic Regression ML 0.860 0.789 0.790 version of BERT: smaller, faster, cheaper and lighter,” pp. 2–6, 2019,
[Online]. Available:http://arxiv.org/abs/1910.01108.
Naïve Bayes ML 0.711 0.871 0.769 [10] R. Johnson and T. Zhang, “Effective use of word order for
Random Forest ML 0.836 0.653 0.864 textcategorization with convolutional neural networks,” NAACL HLT 2015
- 2015 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang.
Simple NN DL 0.801 0.732 0.745 Technol. Proc. Conf., , pp. 103–112,2015.
[11] Z. Ye, Q. Guo, Q. Gan, X. Qiu, and Z. Zhang, “BP-Transformer:
CNN DL 0.836 0.837 0.711 Modelling Long-Range Context via Binary Partitioning,” 2019.
LSTM DL 0.721 0.827 0.93 [12] T. K. Rusch and S. Mishra, “Coupled Oscillatory Recurrent Neural
Network (coRNN): An accurate and (gradient) stable architecture for
learning long time dependencies,” 2020, [Online].
On another hand, the GloVe embedding method has obtained [13] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts,
the highest scores when it’s utilized with the NaiveBayes and “Learning word vectors for sentiment analysis,” ACL-HLT 2011 - Proc.
49th Annu. Meet. Assoc. Comput. Linguist. Hum. Lang.
CNN classifiers. Finally, the BERT embedding model has
[14] M. Mäntylä, D. Graziotin, M. Kuutila, The Evolution of Sentiment
proven to be the best model to be used with the Random Forest Analysis - A Review of Research Topics, Venues, and Top Cited Papers,
and LSTM models. Computer Science Review, 2018.
[15] P. Tungthamthiti, K. Shirai, and M. Mohd, “Recognition of sarcasm in
TABLE III. Highest Embeddings Results tweets based on concept level sentiment analysis and supervised learning
approaches,” Proc. 28th Pacific Asia Conf. Lang. Inf. Comput. PACLIC
BOW GloVe 2014, pp. 404–413,2014.
Model BERT
[16] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, Distributed
Logistic Regression ✓ - - Representations of Words and Phrases and their Compositionality, In Proc.
✓ Advances in Neural Information Processing Systems 26 3111– 3119
NaiveBayes - - (2013).
Random Forest - - ✓ [17] A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov, “Bag of tricks for
efficient text classification,” 15th Conf. Eur. Chapter Assoc. Comput.
Simple NN ✓ - - Linguist. EACL 2017 - Proc. Conf, pp. 427–431,2017.
✓ [18] J. Pennington, R. Socher, Ch. Manning, GloVe: Global Vectors for Word
CNN - - Representation, the 2014 Conference on Empirical Methods in Natural
LSTM - - ✓ Language Processing (EMNLP), 2014.
[19] J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- training
V.CONCLUSION of deep bidirectional transformers for language understanding,” NAACL
HLT 2019 - 2019 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum.
In this paper, various machine learning and deep learning Lang. Technol. - Proc. Conf., 2019.
models on the IMDB benchmark dataset were investigated. [20] D. Mohey, “Enhancement Bag-of-Words Model for Solving the Challenges
Different word-embedding methods were also tested on all of Sentiment Analysis,” Int. J. Adv. Comput. Sci. Appl., vol. 7, no. 1, pp.
244–252, 2016, doi: 10.14569/ijacsa.2016.070134.
architectures. The BERT word-embedding model has achieved
[21] B. Das and S. Chakraborty, “An Improved Text Sentiment Classification
the highest score, compared to all other embedding models. The Model Using TF-IDF and Next Word Negation,” 2018.
obtained results showed that the LSTM-BERT embedding [22] T. Carneiro, R. V. M. Da Nobrega, T. Nepomuceno, G. Bin Bian, V.
model has achieved state-of-the-art results with an accuracy of H.C. De Albuquerque, and P. P. R. Filho, “Performance Analysis of Google
93%. The future work of this research is to apply a more Colaboratory as a Tool for Accelerating Deep Learning Applications,” IEEE
versatile set of algorithms and word-embedding techniques to Access, vol. 6, pp. 61677–61685, 2018.
leverage the continuous development of machine learning,
deep learning models and embedding-techniques, especially
over different languages.
REFERENCES
[1] N. C. Dang, M. N. Moreno-García, and F. De la Prieta, “Sentiment
analysis based on deep learning: A comparative study,” Electron.
[2] Birjali, M., Kasri, M., & Beni-Hssane, A. (2021), “A comprehensive
survey on sentiment analysis: Approaches, challenges and trends”,
Knowledge-Based Systems, 226, 107134.
[3] Yang, P., & Chen, Y. (2017), “A survey on sentiment analysis by using
machine learning methods”, 2017 IEEE 2nd Information Technology,
Networking, Electronic and Automation Control Conference.
[4] W. Medhat, A. Hassan, and H. Korashy, “Sentiment analysis algorithms
and applications: A survey,” Ain Shams Eng. J., vol. 5.
[5] H. Pouransari, “Deep learning for sentiment analysis of movie reviews,”
CS224N Proj., pp. 1–8, 2014, [Online]. Available:
http://web.stanford.edu/class/cs224d/reports/PouransariHadi.pdf.
[6] M. Yasen and S. Tedmori, “Movies reviews sentiment analysis and
classification,” 2019 IEEE Jordan Int. Jt. Conf. Electr. Eng. Inf.
Technol. JEEIT 2019 - Proc., no. April, pp. 860–865,2019.
[7] A. Li, “Sentiment Analysis for IMDb Movie Review,” no. December
2019.
XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE
View publication stats