0% found this document useful (0 votes)
99 views5 pages

35 - Cricket Sentiment Analysis From Bangla Text Using Recurrent Neural Network With Long Short Term Memory Model

Research Paper

Uploaded by

Office Work
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
99 views5 pages

35 - Cricket Sentiment Analysis From Bangla Text Using Recurrent Neural Network With Long Short Term Memory Model

Research Paper

Uploaded by

Office Work
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Accelerat ing t he world's research.

Cricket Sentiment Analysis from


Bangla Text Using Recurrent Neural
Network with Long Short Term
Memory Model
Md Jahid Hasan, Shahin Alom

International Conference on Bangla Speech and Language Processing (ICBSLP)

Cite this paper Downloaded from Academia.edu 

Get the citation in MLA, APA, or Chicago styles

Related papers Download a PDF Pack of t he best relat ed papers 

Hai Ha Do, P.W.C. Prasad, Angelika Maag, Abeer Alsadoon, "Deep Learning for Aspect -Based Se…
A/Prof Abeer Alsadoon

A Deep Recurrent Neural Net work wit h BiLST M model for Sent iment Classificat ion
Md Saiful Islam

Convolut ional Mult i-Head Self-At t ent ion on Memory for Aspect Sent iment Classificat ion
IEEE/CAA J. Aut om. Sinica
International Conference on Bangla Speech and Language Processing (ICBSLP), 27-28 September 2019

Cricket Sentiment Analysis from Bangla Text Using


Recurrent Neural Network with Long Short Term
Memory Model
Md. Ferdous Wahid Md. Jahid Hasan Md. Shahin Alom
Dept. of Electrical and Electronic Dept. of Electrical and Electronic Dept. of Electrical and Electronic
Engineering Engineering Engineering
Hajee Mohammad Danesh Science and Hajee Mohammad Danesh Science and Hajee Mohammad Danesh Science and
Technology University,5200 Technology University,5200 Technology University,5200
Dinajpur, Bangladesh Dinajpur, Bangladesh Dinajpur, Bangladesh
[email protected] [email protected] [email protected]

Abstract— Nowadays, people used to express their feelings, understand the sentiment of people for cricket. However, very
thoughts, suggestions and opinions on different social platform few attempts were taken for sentiment analysis from Bangla
and video sharing media. Many discussions are made on text because of the unavailability of well-structured resources
Twitter, Facebook and many respective forums on sports in Bangla language processing. Hence, Cricket sentiment
especially cricket and football. The opinion may express analysis on Bangla text from real people sentiments for cricket
criticism in different manner, notation that may comprise has become an exciting field for us. Nowadays, Deep learning
different polarity like positive, negative or neutral and it is a technique is widely used to analyze sentiment of text and has
challenging task even for human to understand the sentiment of proven to be an effective tool in terms of accuracy as it
each opinion as well as time consuming. This problem can be
considered past and future word with respect to target word
solved by analyzing sentiment in respective comments through
natural language processing (NLP). Along with the success of
for text classification. Thus, we are very much influenced to
many deep learning domains, Recurrent Neural Network (RNN) classify cricket sentiment from Bangla text using deep
with Long-Short-Term-Memory (LSTM) is popularly used in learning technique.
NLP task like sentiment analysis. We have prepared a dataset In this research, Recurrent Neural Network with LSTM
about cricket comment in Bangla text of real people sentiments model has been proposed to identify cricket sentiment from
in three categories i.e. positive, negative and neutral and Bangla texts. We have collected real people sentiments about
processed it by removing unnecessary words from the dataset.
cricket from different social media and news portal and
Then we have used word embedding method for vectorization of
categorized into positive, negative or neutral. Then,
each word and for long term dependencies we used LSTM. The
accuracy of this approach has given 95% that beyond the
vectorization of each word was performed by word
accuracy of previous all method. embedding method and LSTM was used to achieve long term
dependencies. Finally, the accuracy of 95% has attained in
Keywords— sentiment analysis, natural language processing, cricket sentiment analysis using the proposed model.
deep learning, word embedding, RNN, LSTM.
II. RELATED WORK
I. INTRODUCTION Sentiment analysis from Bangla text has become major
In present era, people across the globe express their point of focus in NLP for researchers with the increasing use
opinions or feelings through social media and web on different of social media. Hasan et al. [1] proposed a model utilizing
entities such as events, products, social issues, organizations LSTM with binary cross-entropy and categorical cross-
etc. Hence, in every instant massive amount of text data are entropy loss function for sentiment analysis of Bangla and
generated on various entities over the Internet. By analyzing Romanized Bangla Text (BRBT). Sharfuddin et al. [2]
these data business organizations can understand the developed an approach combining deep RNN with
sentiment of people about their products and can find new bidirectional LSTM to classify sentiment of Bengali text
opportunities, government can understand people perception which achieved 85.67% accuracy on a dataset containing
about election and can manage their reputation, event 10000 comments of Facebook status. Baktha et al. [3] have
organizer can understand people expectation on public events explored RNN architectures on three dataset and obtained best
and so on. Thus, it is a high need to epitomize the unstructured accuracy from Gated Recurrent Units (GRU) for sentiment
data created by people over the social media and extract analysis. Tripto et al. [4] suggested deep learning based
relevant insights in order to understand people thoughts. models for detecting multilabel sentiment and emotions from
Therefore, Sentiment Analysis has become a major point of Bengali YouTube comments. They used Convolutional
focus in the field of NLP which extract contextual mining Neural Network (CNN), LSTM, Support Vector Machine
from text data that conveys emotions, sentiments or opinions (SVM) and Naïve Bayes (NB) architectures to identify three
of an individual. (positive, negative, neutral) and five label (strongly positive,
positive, neutral, negative, strongly negative) sentiment as
In recent times, Cricket has gained uttermost popularity in well as emotions where they considered SVM and NB as their
Bangladesh. So, people have diversified emotions for this baseline methods. Term Frequency Inverse Document
sport. They like to express their opinions, emotions regarding Frequency (TF-IDF) with n-gram tokens has been used to
to this sport most often through social network in Bangla extract set of features from respective sentence. They got
language. By processing these reviews, it is possible to 65.97% and 54.24% accuracy for three and five labels

978-1-7281-5242-4/19 ©2019 IEEE


sentiments respectively. Sentiment polarity detection dataset. We use two resources as a benchmark for removing
approach has been investigated on Bengali tweet dataset by Bangla stopwords.
Sarkar et al. [5] by applying multinomial NB and SVM. A
character level supervised RNN approach was used to classify 2) Text Process: Text processing is of great importance
Bengali sentiment which is categorized as positive, negative in NLP task. The unwanted text such as Links, URLs, user
and neutral [6]. An Aspect-Based Sentiment Analysis has tags and mention from comments, hash-tags, punctuation
been evaluated on Cricket comment in Bangla text in [7] marks have no impact on sentiment analysis. Therefore, we
where best accuracy has been found 71% using SVM remove these to give annotators an unbiased text. Only
classifier in their ABSA dataset. Shamsul et al. [8] employed contents have given higher priority to make a decision based
SVM, Decision Tree and Multinomial NB for sentiment on three criteria positive, negative and neutral.
analysis on Bangladesh cricket comments. They got best
accuracy using SVM which is 74%. So, we were influenced 3) Name Process: Name process is another text data
to develop a model for analyzing sentiment of real people by compression technique where mainly all proper and common
utilizing RNN with LSTM model. noun of Bangla is substitute by common words. Example-
১. তািমম আজেক খুব সু র ব া টং কেরেছ।
III. DATASET PREPARATION ২. িলটন দাশেক আজেক দেল নয়া উিচত িছল।
Here তািমম and িলটন are two different words but same
The preparation phases of the dataset were divided into
context and also contribute similar impact on sentiment
two parts, i.e. A. Gathering of Bangla comments and B. Pre-
processing of Bangla comments. analysis using proposed LSTM-RNN model. So, we replace
all proper noun with common word which does not affect the
A. Gathering of Bangla comments accuracy of the model but compress the dataset and makes
We have collected a dataset named ‘ABSA’ [7] that dataset more robust and better for classification accuracy.
contains cricket related comments for cricket sentiment Thus, we replace all country name such as (বাংলােদশ, ভারত,
analysis from Bangla texts. The dataset comprised of 2979 পািক ান, অে িলয়া) by a word “ দশ”and all players name
data with 5 columns where we have selected only two including different spelling and nick name such as (তািমম,
columns, i.e. the comment column and the target column. The তািমম ইকবাল, মাশরািফ, ম াশ, সািকব, িবরাট কাহিল, মুশিফকুর
target column contains 3 classes including positive, negative রিহম, মুশিফক) by word “ স”.
and neutral. But, proposed LSTM-RNN model may create
4) Manual validation: In manual validation, each
high variance problem on small dataset. To overcome this
issue, we supplemented more data with existing ABSA extended data sample was manually annotated by three
datasets. The extended data were picked from various online annotators into one of three categories: (i) positive (ii)
resources like Facebook, YouTube, Prothom-Alo, BBC negative and (iii) neutral. Each annotator validated the data
Bangla, Bdnews24.com and labelled them manually. Then, without knowing decisions made by other. This ensures that
the extended ABSA dataset contains total 10000 comments the validations were unbiased and personal. Furthermore,
where 8000 comments is separated for training and the rest of elongated words often contain more sentiment information for
the comments is used for testing purpose. multiclass categorizations. For example, “বাহ্হহ ্ হ অেনক
ভােলা!!” certainly provides more positive feelings. Therefore,
B. Preprocessing of Bangla comments instead of applying lemmatization we had kept elongated
words.
Data pre-processing plays significant role in natural
language processing (NLP) as the real-world data are messy IV. METHODOLOGY
and often contains error, unnecessary information and
duplication. So, in order to generate good analytics results, all Recurrent Neural Network (RNN) performed well enough
punctuation, unimportant words are removed, stemmed to on sequential input data like speech, music, text, name entity
their roots, all missing values are replaced with some values, recognition etc. than convolutional neural network (CNN) by
case of text are replaced into a single one and mostly taking into consideration the current input as well as previous
depending upon the requirement of the application. Therefore, input. However, in long term dependencies it not generally
we process our data step by step as it doesn’t carry much works well due to vanishing and exploiting gradient problem
weightage in context of the text. All preprocessed step is [9]. Hence in order to learn long term dependencies, LSTM is
illustrated below. simply added to the process input which enables RNN to
remember their inputs over a long period of time. Here, this
LSTM layer is fed with proper numerical vector
1) Stopwords Removal: Stopwords refers to the most
representation of each word which is generated based on the
common words in a language. But these words have no impact word embedding. To generate word embedding, we have
on analysis sentiment of a sentence. The most common words employed word2vec algorithm that formulates matrix of
such as এবং, এবার, এ, এটা, কী, হয়, র, পর, ওরা, ক, কউ, ইত ািদ weight from text corpus. Finally, the output of LSTM layer is
have no impact on sentiment analysis using proposed model. sent to fully-connected softmax layer to analysis sentiment of
But there some words such as না, নাই, নই, নয় have important a comment.
impact on negative sentiment and some words such as হ া,
A. Word Embedding
, কের, কাজ, কােজ have also important impact on positive
The preprocessed text data is tokenized first in order to
sentiment. So, we list these positive and negative sentiment
split a sequence of sentence into smaller parts such as words.
impact words from stopwords list as a whitelist. Then Then we implemented skip gram [10] model of word2vec
programmatically we have removed all stopwords from our algorithm for representing proper numerical vector of each
splitting word that can evaluate the similarity between words A length of 42 words is used as a maximum word length
and also placed close together in the vector space which is of a comment and zero padding are added to the right of the
computationally effective for learning word embedding than comment when it is shorter than 42 words. The model takes
one-hot vector representation. The output of the word2vec input of 42 integers or vector of words, where each integer
model is called an embedding matrix. The complete word represents a word. So, there are 42-time steps, at each time
embedding process is shown in Fig. 1. step one word is given to the model. Then, the word is entered
into the embedding layer with one neuron. The embedding
Learned
Matrix layer transform the word into a numerical vector
Input বাংলােদশ েকট দলেক অিভন ন representation of length 300 (embedding size). The
embedding weights are initially set to very small value and
Tokenize বাংলােদশ েকট দলেক অিভন ন .25 will update these weights using back-propagation during
training. This way 300 featured value are created. Then, the
.38 output of embedding layer is fed into an LSTM layer with 100
Indexing 0 1 2 3 neurons and each of the features value is multiplied by a
.46 weight of each LSTM cell, where LSTM cell contains four
gates for long term dependencies memorization. Next those
Embedding .28
weights
0.25 0.38 0.46 0.28 300 weighted features and the output of the previous time step
(output values from 100 neurons) is also used as an input for
Fig. 1. Complete word embedding process the LSTM cells. Finally, the weighted sum of dense layer
outputs is taken as an input of softmax activation function
B. Model Architecture for sentiment analysis where we predict the probability of cricket comments as
The proposed architecture built to classify the sentiment of positive, negative and neutral. The complete architecture is
the cricket comments, consists of 4 layers. These layers are shown in Fig. 2.

1. Input layer (length 1)


2. Embedding layer (Output of length)
3. LSTM layer (100 neurons) V. RESULT AND DISCUSSION
4. Output layer (103 neurons) (including 100 dense We have used LSTM layer to construct and train many-to-
layers and 3 sigmoid layers) one RNN architecture. The architecture takes sequence of
words (sentence) as input and provides sentiment value
Xt (positive, negative or neutral) as output. We used data from
the prepared dataset that contains 10,000 sentences in the ratio
Layer
Input

of 80:20 for training and test phase of proposed RNN-LSTM


architecture respectively. During training phase, learning rate
of the proposed architecture was set to a small value of 0.001
and different combinations of epochs and batch-size were
used for attaining high prediction performance with minimum
training time. The architecture attained best training accuracy
Embedding

X1 X300
after 15 epochs with batch size 30. Finally, the performance
Layer

of the proposed architecture was evaluated on the test dataset


where it obtained 95% prediction accuracy. The proposed
model accuracy and loss graph was recorded while we train
the model. Fig. 3 and Fig. 4 shows the model accuracy and
loss graph respectively.
LSTM
Layer

h1 h100

D1 D2 D9 D100
Output
Layer

σ σ σ

yt

Fig. 2. LSTM network for cricket comments sentiment classification Fig. 3. Optimal accuracy of proposed model
VI. CONCLUSIONS
In this paper we present an approach to analyze sentiment
of cricket comments in Bangla text. This model consists of a
deep learning variant named RNN. For remembering the
recurrent property and contextual meaning of a sentence we
have used LSTM that makes the model very fruitful and
produces a prediction result about 95%. Spell-checking and
stemming is not included in preprocessing section of our
collected dataset. In future, we will include more
preprocessing steps along with these two in order to improve
the structure of our dataset. We also plan to increase target
class to make an accurate NLP model within this problem
domain.

Fig. 4. Optimal loss of proposed model REFERENCES


[1] A. Hassan, M. Amin, A. Azad and N. Mohammed, “Sentiment analysis
Fig. 3 and Fig. 4 shows that the train and validation on bangla and romanized bangla text using deep recurrent models”,
accuracy is being increased as well as is train and validation 2016 International Workshop on Computational Intelligence (IWCI),
loss is being reduced respectively with respect to increasing 2016.
the number of epochs. Example of some classification result [2] A. Aziz Sharfuddin, M. Nafis Tihami and M. Saiful Islam, “A Deep
is shown in TABLE I. The misclassification result is Recurrent Neural Network with BiLSTM model for Sentiment
highlighted in the TABLE I using bold italic font. Classification”, 2018 International Conference on Bangla Speech and
Language Processing (ICBSLP), 2018.
TABLE I. SOME TEST RESULT [3] K. Baktha and B. K. Tripathy, “Investigation of recurrent neural
networks in the field of sentiment analysis”, 2017 International
Conference on Communication and Signal Processing (ICCSP), 2017.
Sentence Predict Actual
[4] N. I. Tripto and M. E. Ali, “Detecting Multilabel Sentiment and
Emotions from Bangla YouTube Comments”, 2018 International
বাংলােদশ জতেব ইনশাআ াহ। positive positive Conference on Bangla Speech and Language Processing (ICBSLP),
2018.
ট েকেট রান আউট খুবই দুঃখজনক। negative negative [5] K. Sarkar and M. Bhowmick, “Sentiment polarity detection in bengali
tweets using multinomial Naïve Bayes and support vector machines”,
শাহািরয়ার নািফস ক ও ফরােনা হাক। positive positive 2017 IEEE Calcutta Conference (CALCON), 2017.
[6] M. S. Haydar, M. A. Helal, and S. A. Hossain, “Sentiment Extraction
ওরা ২০০ করেছ তামরা ১০০ করেত পারেব না negative negative From Bangla Text : A Character Level Supervised Recurrent Neural
Network Approach”, 2018 International Conference on Computer,
Communication, Chemical, Material and Electronic Engineering
টস হারেছ মােন, হারার স বনাই বশী। neutral negative (IC4ME2), 2018.
[7] M. Rahman and K. E.Dey, “ Datasets for Aspect-Based Sentiment
মাসাে ক, সা র থেক ভাল। positive positive Analysis in Bangla and Its Baseline Evaluation". Data, vol 3, issue 2,
2018.
সিমেত আমােদর িতপ ভারত ( ায় [8] S. Arafin Mahtab, N. Islam and M. Mahfuzur Rahaman, “Sentiment
neutral neutral
িন ত)। Analysis on Bangladesh Cricket with Support Vector Machine”, 2018
International Conference on Bangla Speech and Language Processing
(ICBSLP), Sylhet, pp. 1-4, 2018.
Now a comparative study of classification accuracy [9] S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,”
among several models including proposed LSTM-RNN Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.
model is presented in Table II. It shows that the proposed [10] T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean, “Distributed
architecture performs better than any other model of Bangla Representations of Words and Phrases and their Compositionality”,
sentiment analysis, as RNN-LSTM model has a great NIPS'13 Proceedings of the 26th International Conference on Neural
Information Processing Systems, vol 2, pp. 3111-3119, 2013.
competency to capture contextual information in more fine-
grained way.
TABLE II. ACCURACY MATRIX AMONG DIFFERRENT BANGLA NLP
TASK BASED ON MODEL PERFORMANCE

Ref Prediction
Dataset Model
No. Accuracy

[7] ABSA SVM 71%

[8] ABSA_EXTENDED SVM 73.49%

Proposed
ABSA_EXTENDED LSTM 95%
method

You might also like