0% found this document useful (0 votes)
23 views

Data Science Project

Title: Predictive Analytics for Customer Churn in Telecommunications Introduction: In the dynamic landscape of telecommunications, understanding and predicting customer churn is paramount for sustaining business growth. The Data Science project, titled "Predictive Analytics for Customer Churn," aims to leverage advanced analytical techniques to forecast the likelihood of customers leaving the service provider. Objective: The primary goal of this project is to develop a robust predictive model

Uploaded by

Thanh Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Data Science Project

Title: Predictive Analytics for Customer Churn in Telecommunications Introduction: In the dynamic landscape of telecommunications, understanding and predicting customer churn is paramount for sustaining business growth. The Data Science project, titled "Predictive Analytics for Customer Churn," aims to leverage advanced analytical techniques to forecast the likelihood of customers leaving the service provider. Objective: The primary goal of this project is to develop a robust predictive model

Uploaded by

Thanh Pham
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Sentiment analysis

for up-to-date film reviews


January 2024, HUST
Data Sciene

Pham Duc Thanh


20210795
Vu Nhat Minh
20214919
Bui Anh Nhat
20210657
Dao Ha Xuan Mai
20210562
Ta Quang Duy
20214884
Table of Contents
1 Abtract 2

2 Introduction 2

3 Dataset 2
3.1 Data source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3.2 Data preproccessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3 Exploratory Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3.1 Analyzing text statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3.3.1.1 Number of words in reviews . . . . . . . . . . . . . . . . . . . . 5
3.3.1.2 Number of stop words in reviews . . . . . . . . . . . . . . . . . 8
3.3.2 N-gram exploration and word cloud . . . . . . . . . . . . . . . . . . . . . 8
3.3.3 Topic modeling exploration with pyLDAvis . . . . . . . . . . . . . . . . . 9
3.4 TD-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.5 Word Embedding (word2vec) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5.1 One-Hot Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5.2 The Skip-Gram Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5.3 The Continuous Bag of Words Model . . . . . . . . . . . . . . . . . . . . 13

4 Methodology 14
4.1 Machine learning methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.1 XGBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.2 SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 Deep learning methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.1 Convolutional neural network . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.2 RNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2.3 LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.4 GRU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.5 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Results 18
5.1 Metrics used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5.2 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

6 Implementing in real life 21

7 Conclusion and future work 22

8 References 23

1
Sentiment analysis for up-to-date film reviews

1 Abtract

Nowadays, the number of movies being released has skyrocketed thanks to the development
of the film industry. As a result, there are numerous good movies, but there remain also many
bad-quality movies. On the other hand, there are only a few websites for reviewing up-to-date
movies. Therefore, our project aims to gather opinions (comments, ratings) of old-released films
by scraping data from rotten tomatoes and Imdb. This will allow us to obtain a dataset of
reviews. We will then clean and preprocess the dataset and utilize sentiment analysis techniques
to classify correctly the reviewers’ opinions which can reflect the quality of the old movies. By
doing so, we aim to create a program that can qualify up-to-date movies based on their review.

2 Introduction
In the vibrant realm of movies, where stories unfold and emotions run high, the voices of
film enthusiasts reverberate through the digital corridors of reviews. This project sets out to
explore the beating heart of audience reactions in up-to-date film critiques. Using sentiment
analysis, we aim to tap into the diverse sentiments expressed by viewers, unraveling the threads
of excitement, disappointment, and everything in between. As we navigate the ever-changing
landscape of cinema, this analysis promises to unveil the authentic emotions that shape our
cinematic experiences.
Movies, with their ability to transport us to different worlds and evoke a myriad of feelings,
inspire conversations that resonate across online platforms. In this project, we take a closer
look at the sentiments embedded in contemporary film reviews, embracing the vast spectrum
of opinions shared by audiences. Through the lens of sentiment analysis, our goal is to capture
the essence of what viewers truly feel about the latest cinematic releases. In a world buzzing
with instant reactions, this exploration promises to uncover the collective emotions that define
our relationship with the silver screen.
In a world where every film elicits a unique emotional response, the digital landscape be-
comes a canvas for audiences to paint their opinions. This project immerses itself in the rich
tapestry of up-to-date film reviews, employing sentiment analysis as a compass to navigate the
sea of sentiments. As we embark on this journey, our aim is to understand the ebb and flow
of audience reactions, from the highs of cinematic delight to the lows of critique. Through the
natural language processing lens, we endeavor to capture the genuine emotions that breathe
life into the words of film enthusiasts in our interconnected world.

3 Dataset
3.1 Data source
Our dataset comprises over 140,000 film reviews gathered from prominent movie-review
platforms. The data extraction was performed on IMDB and Rotten Tomatoes, both globally
recognized platforms. Selenium, an automated web browser interaction tool, was utilized for
the crawling process. The procedure for crawling can be outlined as follows:
- Find the list of the title of the movies and their corresponding URL links.

Page 2
Sentiment analysis for up-to-date film reviews

- For each movies’ URLs , we crawl reviewers’ name, ratings, the date that they posted
reviews , review body and the movie name.
- We classify the class of the reviews based on their rating score: (1-2: Very Bad, 3-4: Bad,
5-6: Decent, 7-8: Recommend, 9-10: Exceptional)
After crawled, our dataset has over 140 000 samples, with the distribution of ratings as
Figure below:

Fig. 1: Distribution of ratings (1)

Fig. 2: Distribution of ratings (2)

Overview of the data:

• The data is skewed, there are nearly 40.000 instances on Recommend sentiment and
Exceptional sentiment, while only more than 15.000 instances on Bad and 20.000 instances
on Very Bad.

• There are many irrelevant data. (Eg : Customer just types an irrelevant text to fill in
the comment)

Page 3
Sentiment analysis for up-to-date film reviews

• The data has many misspelled words, emojis, emoticons, . . . which need to be prepro-
cessed.

• These kinds of problems make our dataset noisy and it is a big challenge for us to clean
the data so that it can be used in the models.

3.2 Data preproccessing


Data preprocessing is one of the critical steps in any machine learning project. It includes
cleaning and formatting the data before feeding into a machine learning algorithm. For NLP,
the preprocessing steps are comprised of the following tasks: Splitting the data into training,
validation, and testing sets with a ratio of 0.68:0.12:0.2 is the initial step. Following this,
preprocessing will be conducted on the training set and later applied to the validation and
testing sets. The following observations will be addressed during preprocessing:

• Remove the URLs in sentences: In general, there exists numerous reviews that contain
URLs (e.g.: URLs of actors’ profile...). If we don’t handle these URLs, our data will have
many noises and our models can’t learn properly as these URLs contain irrelevant words.

• Lowercasing and Punctuation Handling: Lowercasing, a standard technique in text data


preprocessing, aids in consistent word understanding by the model. Regarding punctua-
tion, after conducting tests, it was decided to retain commas, dots, question marks, and
exclamation points, while removing other punctuations. These selected punctuations are
deemed valuable for conveying sentiment information (e.g., exclamation points expressing
surprise).

• Remove Duplicate Characters in Words and Words in Sentences: Handling these dupli-
cates ensures that models recognize them correctly, preventing the loss of information
from sentences.

• Stemming: The purpose of Stemming is to reduce the size of our vocabulary by converting
a word to its most general form, or stem.

• Handle emoticons and emojis: Since body language and verbal tone do not translate in
our text messages or e-mails, people have developed alternate ways to convey nuanced
meaning. The most prominent change to our online style has been emoticons and emoji.
Emoticons are punctuation marks, letters, and numbers used to create pictorial icons that
generally display an emotion or sentiment. In other side, emoji are pictographs of faces,
objects, and symbols. In our datasets, there are many emoji and emoticons, and how
to handle them properly is crucial, as it may contain sentimental information we need.
We have tested two cases: remove all emoji/emoticons and replace emoji/ emoticons by
words and obtain that by replacing them by word using Emoji library, we can have better
results.

3.3 Exploratory Data Analysis


3.3.1 Analyzing text statistics
Text statistics visualizations are simple but very insightful techniques.

Page 4
Sentiment analysis for up-to-date film reviews

3.3.1.1 Number of words in reviews


To do so, we will be mostly using histograms (continuous data) and bar charts (categorical
data).
First, let’s take a look at the number of characters present in each review’s sentence. This
can give us a rough idea about the reviews’ length.

The histogram shows that reviews range from 0 to 6000 characters and generally, it is
between 0 to 2000 characters. In details, reviews range from 0 to 1.000 words and mostly
between 0 to 250 words.
Figure below presents the kernel density estimation of the lengths of reviews. Kernel density
estimation involves applying kernel smoothing for the purpose of probability density estimation.
Consider a set of independent and identically distributed samples x1 , x2 , . . . , xn drawn from
some univariate distribution with an unknown density f at any given point x. The kernel
density estimator for f is expressed as:
n  
1 X x − x i
fˆh (x) = K
nh i=1 h

In this formula:
• n represents the number of samples.

• x1 , x2 , . . . , xn are independent and identically distributed samples.

Page 5
Sentiment analysis for up-to-date film reviews

• K is the kernel, a non-negative function.

• h is the bandwidth, a smoothing parameter.

The provided distribution exhibits a mean of approximately 254 and a variance of about
42.297. Our objective is to determine whether this distribution follows a normal distribution.
To achieve this, we will employ two numerical indicators of shape, namely skewness and kurtosis.

• Skewness serves as a measure of the asymmetry in the probability distribution of a random


variable concerning its mean. It is computed using the formula:
1
Pn
(xi − x̄)3
skewness = nP i=1 3
1 n 2 2
n i=1 (x i − x̄)

• Kurtosis functions as an indicator of whether the dataset exhibits heavy-tailed or light-


tailed characteristics in comparison to a normal distribution. High kurtosis in the data
suggests the presence of heavy tails or outliers. Conversely, datasets with low kurtosis
tend to have light tails or a lack of outliers.
n n
!2
1X 1 X
kurtosis = (xi − x̄)4 / (xi − x̄)2
n i=1 n i=1

For this particular distribution, the skewness is 1.86, indicating a positive skew. This implies
that the distribution is skewed to the right. The kurtosis is 4.36, suggesting that the distribution
has more values concentrated in the tails or is more peaked compared to a normal distribution
with the same variance. Based on these measures, it can be inferred that the distribution of
review lengths deviates from a normal distribution.

Page 6
Sentiment analysis for up-to-date film reviews

We want to test whether there is a relationship between the number of words and the
sentiments. To do so, we will use Point Biserial Correlation to measure the relation between
the reviews’ lengths and the sentiment. The point biserial correlation coefficient rpb is a special
case of Pearson’s correlation coefficient that measures the relationship between two variable:
One continuous variable and one naturally binary variable. It has the formula:
s
¯ ¯
Y1 − Y0 N1 N0
rpb =
Sy N (N − 1)

Where Y1 , Y0 are the means of the metric observations coded 0 and 1 respectively, re-
spectively; sy is the standard deviation of all the metric observations; N1 , N0 are number of
observations coded 0 and 1 respectively; N is the total number of observations.

Figure shows the box plots for reviews’ lengths in all sentiments.
A value of rpb that is significantly different from zero is completely equivalent to a significant
difference in means between the two groups. Thus, an independent groups t − T est with N − 2
degrees of freedom may be used to test whether rpb is nonzero. The relation between the
t − statistic for comparing two independent groups and rpb is given by:
√ rpb
t= N − 2q
2
1 − rpb

We conduct a one tailed t-test with the null hypothesis is that “There is no correlation
between the number of words and the sentiment”.

Page 7
Sentiment analysis for up-to-date film reviews

3.3.1.2 Number of stop words in reviews

We can evidently see that stopwords such as “the”,” and” and “a” dominate in movies’
reviews.

3.3.2 N-gram exploration and word cloud


N-grams are simply contiguous sequences of n words.
For example “river bank”,” The three musketeers” etc. If the number of words is two, it is
called bigram. For 3 words it is called a trigram.

Fig. 3: Figure of 1-gram

Page 8
Sentiment analysis for up-to-date film reviews

Fig. 4: Figure of bi-gram

Figure below shows the word clouds of all sentiment and 3 sentiments:(Recommend,Decent,Bad)
after remove stop words. The bigger the word is, the more it appears in the dataset. Overall,
the common words in the dataset are mostly verb and adjective.

3.3.3 Topic modeling exploration with pyLDAvis


Topic modeling is a technique used in natural language processing to discover latent topics
or themes present in a large corpus of text. The goal of topic modeling is to uncover the
underlying structure of the text by grouping similar words together into topics, allowing for
more efficient information retrieval and analysis.
Latent Dirichlet Allocation (LDA) is a generative probabilistic model used for topic modeling
in Natural Language Processing. The model assumes that each document in a collection can
be represented as a mixture of latent topics, where each topic is a probability distribution over
words. The goal of LDA is to discover the underlying topics present in a document collection
and to estimate the topic distribution for each document and the word distribution for each
topic.
In this project, we apply LDA model and choose the model with lowest perplexity score.
The final model has 5 topics about of sentiments.
Mathematically, the generative process of LDA can be expressed as follows:
• Topic distribution for each document: For each document d in the collection, we first
sample a topic distribution θd ∼ Dirichlet(α), where α is a hyperparameter representing
the prior belief about the distribution of topics.
• Word distribution for each topic: For each topic z, we sample a word distribution βz ∼
Dirichlet(β), where β is a hyperparameter representing the prior belief about the distri-
bution of words for each topic.

Page 9
Sentiment analysis for up-to-date film reviews

• Generation of words in a document: For each word w in document d, we sample a topic


z ∼ Multinomial(θd ) and a word w ∼ Multinomial(βz ) Given the observed document
collection w, the goal of LDA is to estimate the topic distributions θd and the word
distributions βz that maximize the likelihood of observing w:
D YNd Z
!
Y X
p(w|a, β) = θd,z βz,wd,i
d=1 i=1 z=1

where D is the number of documents, Nd is the number of words in document d, Z is the


number of topics, θd,z ,z is the topic distribution for document d and topic z, and βz,wd,i
is the word distribution for topic z and word wd,i in document d.

This estimation can be performed using the Expectation – Maximization algorithm or other
inference methods such as Variational Bayes.
To evaluate a LDA model, we can use perplexity and coherence score:

• Perplexity is a measure of how well the model fits the data. It is calculated as the
exponentiated average log-likelihood of the held-out (unseen) data. The formula for
perplexity is given by:
1 PN
Perplexity = e− N i=1 log p(wi |w1 ,...,wi−1 )

Where N is the total number of words in the held-out data,wi is the i-th word in the
heldout data, p(wi |w1 , . . . , wi−1 ) is the probability of the i-th word given the previous
words

• Coherence score is a measure of the semantic similarity between the words in a topic.
A high coherence score indicates that the words in a topic are semantically similar and
form a coherent topic. The formula for coherence score depends on the specific coherence
measure being used. The formula of Cv Coherence is:
X log p(wi |wj ) + log p(wj |wi )
Cv (t) = ∀wi , wj ∈ t
i,j
2

Where t is a topic, p(wi |wj ) and p(wj |wi ) are the probabilities of the words wi and wj
given each other.

In this project, we apply LDA model and choose the model with lowest perplexity score.
The final model has 5 topics, equals to the number of sentiments, with −6.18 perplexity. The
coherence score is 0.66, indicates that the words in a topic are semantically similar.
To get a better understandings about LDA model, we use pyLDAvis library and get the
complete visualization of the model in Figure 5. Overall, there are some notable elements in
the visualization:

• Intertopic Distance Map: The map displays a two - dimensional representation of the
topics found by the LDA model. Topics that are close together are considered to be
similar to each other in terms of the words they contain. We can observe that there are
5 topics with significant differences.

Page 10
Sentiment analysis for up-to-date film reviews

• Topic Terms: The list of words in the right-side panel shows the most relevant words for
each selected topic. The size of the word indicates its relevance to the topic.

• Topic and Term Frequency: The term frequency is shown in the table below the plot.
We can observe the frequency of each term in the corpus, as well as its frequency in the
selected topic.

3.4 TD-IDF
Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical statistic that reflects
the importance of a word in a document relative to a collection of documents, often used in
natural language processing and information retrieval. It is a combination of two components:
Term Frequency (TF) and Inverse Document Frequency (IDF).

Number of times term t appears in document d


T F (t, d) =
Total number of terms in document d

Page 11
Sentiment analysis for up-to-date film reviews

TF measures the frequency of a term within a document. It is calculated by dividing the


number of times a term appears in a document by the total number of terms in that document.
The idea is to emphasize words that occur frequently within a specific document, indicating
their relevance to the document’s content.
On the other hand, IDF measures the uniqueness of a term across a collection of documents.
It is calculated by dividing the total number of documents by the number of documents con-
taining the term, followed by taking the logarithm of the result. The purpose is to highlight
terms that are rare and distinctive, as they may carry more significance.
Total number of documents in the collection D
IDF (t, D) = log( )
Number of documents containing term t
The TF-IDF score for a term in a specific document is obtained by multiplying its TF and
IDF values:
T F − IDF (t, d, D) = T F (t, d) ∗ IDF (t, D)
In summary, TF-IDF is a powerful tool for evaluating the importance of terms in a document
within the context of a larger collection. It allows for the identification of key terms that are
both frequent within a document and rare across the entire document set, thus aiding in tasks
such as text mining, information retrieval, and document categorization.

3.5 Word Embedding (word2vec)


Word embedding is a term used in natural language processing (NLP) to describe a technique
where words or phrases from a vocabulary are mapped to vectors of real numbers. This mapping
is done in a way that words with similar meanings are close to each other in the vector space,
which helps with text analysis and understanding the semantic relationships between words.

3.5.1 One-Hot Vectors


One-Hot Vectors is easy to understand and construct. Imagine we have a dictionary with
N unique words, each assigned a unique integer index from 0 to N-1. To create a one-hot
vector for a word with index i, we construct a vector of length N, filled with zeros, and set the
i-th element to 1. Consequently, each word is represented as an N-length vector, which can
be directly utilized by neural networks. However, because One-Hot Vectors cannot accurately
express the similarity between different words, such as the cosine similarity that we often use,
it’s not usually a good choice.

3.5.2 The Skip-Gram Model


The Skip-Gram model is a straightforward neural network with a single hidden layer, de-
signed to estimate the probability of a particular word appearing in the context of a given
input word. This model works in reverse to the Continuous Bag of Words (CBOW) model.
It takes the current word as input and aims to predict the words that come before and after
it. In essence, it learns and predicts the context words around the input word. Performance
evaluations of this model have indicated that prediction quality enhances with a larger set of
word vectors, although this also escalates computational complexity.

Page 12
Sentiment analysis for up-to-date film reviews

The Skip-Gram objective function aggregates the log probabilities of the surrounding n
words to the left and right of the target word wt to formulate the objective. This process can
be visualized as follows:
T
1 X
Jθ = log p(w( t + j)|wt )
T t=−n≤j≤n,̸=0

• θ: all variables.

• n: size of training.

• T : number of words.

3.5.3 The Continuous Bag of Words Model


The Continuous Bag of Words (CBOW) model is similar to the skip-gram model, but instead
of predicting the surrounding words given a center word, it predicts the center word given its
surrounding context words. This means that the CBOW model takes a sequence of words as
input, and then tries to predict the most likely word to appear in the next position.
To do this, the CBOW model uses a neural network that takes the vectors of the surrounding
words as input, and then outputs a vector that represents the predicted center word. The
neural network is trained by minimizing the distance between the predicted vector and the
actual vector of the center word.
The objective function for CBOW:
T
1X
Jθ = log p(wt |(w( t − n), ..., w( t − 1), w( t + 1), ..., w( t + n))
T t=1

• θ: all variables.

• T : number of words.

The CBOW model has a number of advantages over the skip-gram model. First, it is able to
learn more complex relationships between words, as it is able to take into account the context
of the center word. Second, it is more efficient to train, as it only needs to calculate the gradient
for the center word, rather than for all of the surrounding words.

Page 13
Sentiment analysis for up-to-date film reviews

4 Methodology
4.1 Machine learning methods
4.1.1 XGBoost
XGBoost, or eXtreme Gradient Boosting, is an advanced machine learning algorithm that
has gained prominence for its exceptional performance in supervised learning tasks. It operates
within the realm of ensemble learning, a methodology that combines the predictive strength of
multiple models to enhance overall accuracy and generalization. What sets XGBoost apart is its
sequential construction of decision trees, where each subsequent tree focuses on correcting the
errors of the preceding ones. This process, known as boosting, allows XGBoost to incrementally
refine its predictions, producing a robust and accurate model.
The algorithm’s optimization hinges on a carefully crafted objective function that comprises
two crucial components: a loss function and a regularization term. The loss function quantifies
the disparity between predicted and actual values, while the regularization term helps prevent
the model from becoming overly complex and overfitting the training data. Through an itera-
tive process, XGBoost minimizes this combined objective function, striking a balance between
precision and model simplicity. One key feature that contributes to XGBoost’s popularity is
its interpretability. The algorithm provides valuable insights into the importance of each fea-
ture in the dataset, offering a clear understanding of how these features influence the model’s
predictions. This interpretability is especially advantageous in scenarios where understanding
the underlying decision-making process is as crucial as predictive accuracy.
XGBoost’s versatility is evident in its applicability to a wide range of machine learning tasks.
Whether tackling classification problems, regression analyses, or ranking challenges, XGBoost
has proven itself as a reliable and efficient solution. Its adaptability to diverse datasets and
ability to handle large-scale, high-dimensional data make it a favored choice among data sci-
entists and machine learning practitioners seeking powerful models for real-world applications.
In essence, XGBoost stands as a sophisticated and versatile tool, making it a cornerstone in
the toolkit of machine learning professionals.

4.1.2 SVM
Support Vector Machines (SVM) are a versatile class of supervised learning algorithms with
a robust mathematical foundation. At their core, SVMs are designed to find the hyperplane
that best separates classes in the feature space. The algorithm achieves this by identifying
support vectors—data points crucial for defining the decision boundary. The concept of a
margin, representing the distance between the hyperplane and the nearest data point of any
class, is central to SVM. SVMs excel in scenarios where the goal is to maximize this margin,
as it leads to a more resilient and generalizable model.
One notable feature of SVM is its ability to handle non-linear decision boundaries. This
is accomplished through the use of kernel functions, such as radial basis function (RBF) or
polynomial kernels, which implicitly map input features into a higher-dimensional space where
a linear separation is feasible. Its formula:
||x−z||2
K(x, z) = e 2σ , where σ > 0

Page 14
Sentiment analysis for up-to-date film reviews

The regularization parameter (C) in SVM controls the trade-off between achieving a smooth
decision boundary and correctly classifying training data. A smaller C value encourages a
broader margin, potentially allowing for some misclassifications, while a larger C emphasizes
accurate classification, potentially leading to a narrower margin.
SVMs find application in diverse domains, including image recognition, text categorization,
and bioinformatics. Their efficacy in high-dimensional spaces, even when the number of features
exceeds the number of samples, makes SVMs particularly suitable for tasks with complex data
structures. While SVMs are powerful, their performance can be influenced by the choice of
kernel and tuning parameters, requiring careful consideration in practice. Nevertheless, SVMs
remain a foundational tool in machine learning, celebrated for their ability to handle various
types of data and produce robust decision boundaries.

4.2 Deep learning methods


4.2.1 Convolutional neural network
Convolutional Neural Networks (CNNs) consist of multiple layers involving convolutions,
where non-linear activation functions such as ReLU or tanh are applied to the outcomes. In
CNNs, convolutions are employed on the input layer to compute the resulting output. Through-
out the training phase, the CNN autonomously learns the filter values based on the intended
task, with CNNs gaining renown for addressing Computer Vision challenges.
In contrast to dealing with image pixels, most Natural Language Processing (NLP) tasks
utilize sentences or documents presented as matrices. Each matrix row corresponds to a token,
essentially forming a vector that represents that particular token. Typically, these vectors take
the form of word embeddings, such as word2vec, which are low-dimensional representations. Al-
ternatively, they could be one-hot vectors indexing the word within a vocabulary. For instance,
in a 10-word sentence with a 100-dimensional embedding, the input would be represented as a
10×100 matrix, akin to an "image" in this context.

Fig. 5: Example of a CNN model for NLP

4.2.2 RNN
RNN is a class of powerful deep neural network using its internal memory with loops to
deal with sequence data. in this project, we use architecture of many-to-one of RNN
One challenge associated with Recurrent Neural Networks (RNNs) is the exploding gra-
dient problem, wherein gradients grow excessively large, leading to numerical instability and

Page 15
Sentiment analysis for up-to-date film reviews

hindering the optimization process. This problem can be addressed by implementing effective
solutions such as proper weight initialization, careful selection of activation functions, and the
application of gradient clipping. These measures collectively work to mitigate the issues associ-
ated with the exploding gradient problem and contribute to more stable and effective training
of RNNs.
Another version of RNN we use is Bi-RNNS. A Bidirectional Recurrent Neural Network
(RNN) is a neural network architecture that incorporates two RNNs operating in distinct
directions. The forward RNN processes the input sequence from the beginning to the end, while
the backward RNN processes it in the reverse direction, from end to start. These two RNNs
are stacked on top of each other, and their states are commonly combined by concatenating the
two resulting vectors. This bidirectional approach allows the network to capture information
from both the past and future context of each element in the input sequence, enhancing its
ability to understand and model sequential data.

4.2.3 LSTM
Long Short-Term Memory (LSTM) is a type of Recurrent Neural Network (RNN) that is
capable of learning long-term dependencies in sequence data. This is particularly useful for
tasks involving sequential inputs like natural language processing, speech recognition, and time
series prediction. LSTM models overcome the limitations of traditional RNNs, such as vanishing
gradient problem, by using a unique gating mechanism that controls the flow of information
between cells in the network.
The LSTM layers take into account not only the word order but also their contextual
significance within the sentence. Through this process, the model discerns key patterns residing
within the sequences, identifying their correlation with specific emotions. The ultimate layer of
the model typically consists of a softmax layer, producing a probability distribution of potential
emotions. With the highest-probability emotion selected as its prediction, this model possesses
a key advantage in its adaptability to varying sentence lengths due to the inherent properties
of LSTM networks. As such, it boasts immense versatility and efficacy in accurately classifying
the conveyed emotion.

4.2.4 GRU
Introduced in 2014 by Kyunghyun Cho et al., GRU has established itself as a well-known
type of recurrent neural network. It was designed as a simpler alternative to Long Short-Term
Memory (LSTM) networks, with fewer parameters, making it more computationally efficient.
GRU has gained popularity in the field of deep learning, particularly in tasks involving
sequential data like natural language processing, speech recognition, and time-series prediction.
Its simpler architecture compared to LSTM makes it faster to train, which can be advantageous
in projects where computational resources or time are limiting factors.
However, it’s important to note that while GRU has been widely adopted, LSTM is still
more prevalent in certain applications, especially those that require modeling longer sequences
and more complex dependencies. The choice between GRU and LSTM often depends on the
specific requirements of the task at hand. Despite its relative simplicity, GRU has proven to
be a powerful tool in the deep learning toolkit.

Page 16
Sentiment analysis for up-to-date film reviews

4.2.5 BERT

Fig. 6: Procedure of bert

One of the biggest challenges in natural language processing (NLP) is the shortage of training
data. Because NLP is a diversified field with many distinct tasks, most task-specific datasets
contain only a few thousand or a few hundred thousand human-labeled training examples.
However, modern deep learning-based NLP models see benefits from much larger amounts of
data, improving when trained on millions, or billions, of annotated training examples. To help
close this gap in data, researchers have developed a variety of techniques for training general
purpose language representation models using the enormous amount of unannotated text on
the web (known as pre-training). The pre-trained model can then be fine-tuned on small-data
NLP tasks like question answering and sentiment analysis, resulting in substantial accuracy
improvements compared to training on these datasets from scratch.

Fig. 7: Input of bert

Open sourced a new technique for NLP pre-training called Bidirectional Encoder Represen-
tations from Transformers, or BERT. With this release, anyone in the world can train their own
state-of-the-art question answering system (or a variety of other models) in about 30 minutes
on a single Cloud TPU, or in a few hours using a single GPU. The release includes source code
built on top of TensorFlow and a number of pre-trained language representation models. In our
associated paper, we demonstrate state-of-the-art results on 11 NLP tasks, including the very
competitive Stanford Question Answering Dataset (SQuAD v1.1). Input: The input data of
BERT concludes 3 components which are token embedding which separates tokens in an input;
segment embedding, this helps the model to distinguish different sentences in an input; finally
positional embedding which indicate the position of each token.

Page 17
Sentiment analysis for up-to-date film reviews

MLM: The first idea used in BERT is Masked-Language Modeling (MLM). BERT tries to
predict words in a sentence which are randomly masked before by it. The masking proportion
is about 15 percent and the masked token will be replaced by token [MASKED]. BERT use the
bidirectional approach, so that it can look both previous and next tokens, understand the full
context of the sentence to predict the masked words.

Fig. 8: MLM task

NSP: The second idea in BERT is Next Sentence Prediction (NSP). This technique is used
to learn the relationship between two sentences. BERT receives the input of two sentences and
tries to predict whether the second sentence is the next sentence of the first sentence. During
training, half of time the truth second sentence is fed with the first sentence and half of time
the second sentence is a random sentence.

Fig. 9: NSP task

5 Results
5.1 Metrics used
Let:

• T Pi (true possitive) is the number of instances t are assigned correctly to class ci .

• T Ni true negative) is the number of instances inside ci that are assigned correctly to
another class.

• F Pi (false possitive) is the number of instances that are assigned incorrectly to classci .

Page 18
Sentiment analysis for up-to-date film reviews

• F Ni (false negative) is the number of instances inside ci that are assigned incorrectly to
another class.

• Percentage of correct predictions on testing data:


T Pi + T Ni
Accuracy =
T Pi + T Ni + F Pi + F Ni

• Percentage of correct instances, among all that are assigned to ci :


T Pi
P recision =
T Pi + F P i

• Percentage of instances in ci that are correctly assigned to ci :


T Pi
Recal(ci ) =
T Pi + F Ni

• F1 - score is the harmonic mean of precision and recall. It can provide a unified view on
the performance of a classifier and is computed as:
2 ∗ P recision ∗ Recall
F1 =
P recision + Recall

5.2 Model selection


To evaluate and choose the best hyperparameters for each model, we used Grid search with
a 5–fold cross validation strategy. We then compare the cross validation’s results by applying
ttest for each pair. The models with tuned hyperparameters have been saved in the directory.
Overall, all our deep earning models have some common training techniques as follow:

• Loss function cross entropy:


n
X
LC E = − ti log(pi)
i=1

Where ti is the truth label and pi is the softmax probability for the it h class.

• Weight initialization: Xavier initialization. All the weights of a layer L are picked ran-
1
domly from a normal µ = 0 and variance σ 2 = L where nL is the number of neurons in
n
layer L − 1. This will prevent the gradients of network’s activations from vanishing or
exploding.

• Optimizer: Adam. It integrates the pros of both Momentum and RMSprop. It utilizes the
squared gradients to scale the learning rate as RMSprop and it is similar to the momentum
by using the moving average of the gradient. Its advantages are more memory efficient
and less computational ower.

• Regularization technique: Dropout.

Page 19
Sentiment analysis for up-to-date film reviews

5.3 Results
Figure 4 shows the results of our models with the best embedding technique. Here, Pre
stands for pretrained embedding:

Fig. 10: Results of the models on our test dataset

Here are some observations we can have:

• SVM– despite being a simple model, has pretty good result with 56.05. We can see that
TF – IDF with machine learning methods still perform quite well in the sentiment analysis
task.

• LSTM architecture’s performance is slightly better than CNN in this case and just lower.

• BERT, with huge improvements in dealing with NLP, has outperformed all other models
with a big gap

Page 20
Sentiment analysis for up-to-date film reviews

6 Implementing in real life


To make our project more practical in real life, we have build a model called “Movies rating”

Fig. 11: Movies rating model

Users simply input the movie they wish to assess and our model promptly crawl up to 100
reviews for that movie.
The sentiment of these reviews is then determined using our advanced pretrained model,
"Bert." After aggregating the sentiment scores from all the reviews and applying predefined
criteria, the system generates a result, categorizing the film as either a "Bad movie," "Decent
movie," "Good movie," or "Must-watch movie."
User can easily use our work in this notebook: https://www.kaggle.com/great23u5/project-
movie-ratings

Page 21
Sentiment analysis for up-to-date film reviews

7 Conclusion and future work

Fig. 12: Table of Contribution

In this project, we gathered a dataset comprising more than 140,000 reviews on film from
diverse platforms. Subsequently, we applied a range of techniques for preprocessing. Our
Exploratory Data Analysis encompassed various approaches, including hypothesis testing for
word count, analysis of stop words, and the implementation of topic modeling using the LDA
model.
For word embedding, we employed TF-IDF, CBOW, Skip-gram, and pretrained embeddings.
In the modeling phase, we explored traditional Machine Learning models such as Random For-
est and Gradient Tree Boosting, alongside various Deep Learning methods like Convolutional
Neural Network, Long Short-Term Memory, Gated Recurrent Unit in bidirectional form. Ad-
ditionally, we integrated BERT, a powerful model in Natural Language Processing (NLP).
Looking ahead, our future plans involve an extended focus on data preparation, Exploratory
Data Analysis (EDA), and modeling. Our strategy includes broadening the dataset by collecting
additional data, experimenting with diverse preprocessing and augmentation techniques, and
conducting further EDA methods to gain more profound insights. On the modeling front, we
aim to create a user-friendly website for easy accessibility to our project.

Page 22
Sentiment analysis for up-to-date film reviews

8 References
[1] NGUYEN, Dat Quoc; NGUYEN, Anh Tuan. PhoBERT: Pre-trained language models
for Vietnamese. arXiv preprint arXiv:2003.00744, 2020.
[2] DEVLIN, Jacob, et al. Bert: Pre-training of deep bidirectional transformers for language
understanding. arXiv preprint arXiv:1810.04805, 2018.
[3] DANG, Nhan Cach; MORENOGARCÍA, María N.; DE LA PRIETA, Fernando. Senti-
ment analysis based on deep learning: A comparative study. Electronics, 2020, 9.3: 483.
[4] LIU, Yinhan, et al. Roberta: A robustly optimized bert pretraining approach. arXiv
preprint arXiv:1907.11692, 2019.
[5] CUI, Zhiyong, et al. Deep bidirectional and unidirectional LSTM recurrent neural net-
work for network-wide traffic speed prediction. arXiv preprint arXiv:1801.02143, 2018.
[6] BAHDANAU, Dzmitry; CHO, Kyunghyun; BENGIO, Yoshua. Neural machine transla-
tion by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
[7] VASWANI, Ashish, et al. Attention is all you need. Advances in neural information
processing systems, 2017, 30.
[8] NGUYEN, Khang Phuoc-Quy; VAN NGUYEN, Kiet. Exploiting vietnamese social
media characteristics for textual emotion recognition in vietnamese. In: 2020 International
Conference on Asian Language Processing (IALP). IEEE, 2020. p. 276-281.
[9] L. Zhang, S. Wang, and B. Liu, “Deep learning for sentiment analysis: a survey,” WIREs
Data Mining and Knowledge Discovery, vol. 8, no. 4, pp. 1942–4795, 2018, https:// onlineli-
brary. wiley.com/doi/abs/10.1002/widm.1253.
[10] T. Singh and M. Kumari, “Role of text pre-processing in twitter sentiment analysis,” Pro-
cedia Computer Science, vol. 89, pp. 549–554, 2016, https://linkinghub.elsevier.com/retrieve/
pii/S1877050916311607.
[11] HOANG, Suong N., et al. An efficient model for sentiment analysis of electronic product
reviews in Vietnamese. In: International Conference on Future Data and Security Engineering.
Springer, Cham, 2019. p. 132-142.
[12] ZHANG, Ye; WALLACE, Byron. A sensitivity analysis of (and practitioners’ guide
to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820,
2015.
[13] RAFFEL, Colin, et al. Exploring the limits of transfer learning with a unified text-to-
text transformer. J. Mach. Learn. Res., 2020, 21.140: 1-67.
[14] SUN, Chi, et al. How to fine-tune bert for text classification?. In: China national
conference on Chinese computational linguistics. Springer, Cham, 2019. p. 194- 206.
[15] Vong Anh Ho, et al. Emotion Recognition for Vietnamese Social Media Text. In
Proceeding of PACLING, 2019

Page 23

You might also like