Sentiment Analysis For A Resource Poor Language-Roman Urdu
Sentiment Analysis For A Resource Poor Language-Roman Urdu
Language—Roman Urdu
Sentiment analysis is an important sub-task of Natural Language Processing that aims to determine the po-
larity of a review. Most of the work done on sentiment analysis is for the resource-rich languages of the
world, but very limited work has been done on resource-poor languages. In this work, we focus on develop-
ing a Sentiment Analysis System for Roman Urdu, which is a resource-poor language. To this end, a dataset
of 11,000 reviews has been gathered from six different domains. Comprehensive annotation guidelines were
defined and the dataset was annotated using the multi-annotator methodology. Using the annotated dataset,
state-of-the-art algorithms were used to build a sentiment analysis system. To improve the results of these
algorithms, four different studies were carried out based on: word-level features, character level features, and
feature union. The best results showed that we could reduce the error rate by 12% from the baseline (80.07%).
Also, to see if the improvements are statistically significant, we applied t-test and Confidence Interval on the
obtained results and found that the best results of each study are statistically significant from the baseline.
CCS Concepts: • Computing methodologies → Language resources;
Additional Key Words and Phrases: Resource poor language, Roman Urdu, Roman Urdu sentiment analysis
ACM Reference format:
Khawar Mehmood, Daryl Essam, Kamran Shafi, and Muhammad Kamran Malik. 2019. Sentiment Analysis
for a Resource Poor Language—Roman Urdu. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19, 1, Article
10 (August 2019), 15 pages.
https://doi.org/10.1145/3329709
1 INTRODUCTION
Sentiment analysis (SA) is a process for measuring the affective states of a topic or review. People
share their opinions and experiences about topics, products, social events, and political and eco-
nomic issues using social media. The proliferation of such information has made it a challenging
task to manually distill relevant details from this data, which necessitates the development of an
automated system, such as an SA System, which can intelligently extract sentiments from it.
Authors’ addresses: K. Mehmood, D. Essam, and K. Shafi, University of New South Wales, Northcott Drive, Campbell,
ACT 2600, Australia; emails: [email protected], {d.essam, k.shafi}@adfa.edu.au; M. K. Malik, Punjab
University College of Information Technology, University of the Punjab, Old Campus, The Mall, Lahore, Pakistan; email:
[email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2019 Association for Computing Machinery.
2375-4699/2019/08-ART10 $15.00
https://doi.org/10.1145/3329709
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
10:2 K. Mehmood et al.
Most of the work conducted on SA has been for major languages such as English and Chinese
[3, 7], with very little on Roman Urdu (RU)/Hindi, which is a resource-poor language.1 The devel-
opment of a robust Roman Urdu Sentiment Analysis (RUSA) is necessary for two major reasons.
Firstly, Urdu/Hindi is the third most-spoken language in the world, with more than 500 million
speakers [4]. Secondly, it is increasingly being used because people prefer to communicate on the
web using the Latin script (i.e., RU uses the 26-letter English alphabet) instead of typing in their
language using a language-specific keyboard. Its diversity and enormity of its user base motivates
work on RUSA.
The main contributions of this work are as follows.
(1) Developing the largest ever RUSA corpus of 11,000 reviews (at the time of writing) ob-
tained from six different domains and making it publicly available.2
(2) Identifying features while considering the complexities of RU, that is, (a) word-level fea-
tures, (b) character-level features, and (c) feature union (word + character level).
(3) Applying machine-learning (ML) algorithms from seven different categories to the task of
developing a RUSA and comparing the results to determine their statistical significance.
(4) Proposing a confidence-based voting technique for RUSA.
2 RELATED WORK
SA is one of the most active areas of computer science [5]. The three main approaches to SA are
ML [6], lexicon-based [36], and hybrid approaches [38]. Besides the three main approaches to SA,
SA classification is broadly done at document [26], sentence [5, 12], and aspect level [8, 27].
Different state-of-the-art algorithms have been used to improve the accuracy of SA. In Ref. [11],
the authors introduced a neural tensor network that could model the relationship between two en-
tities as an average of its word vectors, which allowed the sharing of statistical strength between
the words describing each entity. To further improve the accuracy of predicting the relationship
between two entities, they trained word vectors on the large unlabeled text of WordNet and Free-
base and used them in their model. In Ref. [35], a recurrent convolutional neural network (RCNN)
was proposed and applied to the task of text classification. As a first component, RCNN uses a bidi-
rectional recurrent structure and its second layer is based on max-pooling. To further improve the
results of their model, they used pre-trained word embeddings. In Ref. [33], the authors trained a
simple convolutional neural network (CNN) with one layer of convolution on top of word vectors
pre-trained on 100 billion words of Google News. The model could not perform well when used
without pre-trained word embeddings, but achieved excellent results on multiple benchmarks with
pre-trained word embeddings.
In Ref. [6], the authors performed SA on a database containing movie reviews. They considered
different features and used three standard ML algorithms and could achieve an accuracy of 82.9%.
In Ref. [9], the authors performed an emotion detection task at the document level in which they
exploited the human psychology of summarizing the emotions of a complete document in the last
sentence. In Ref. [1], the author collected 1,000 tweets written in Arabic and Arabizi. The author
sought to convert Arabizi to Standard Arabic and used tokenization, stemming, and stop word
removal as preprocessing. It was concluded that better performance could be achieved without
stemming and stop word removal. In the study, the Naive Bayes (NB) was the best with an accu-
racy of 76.78%. In Ref. [14], the author performed an aspect-based SA on tweeter data consisting
of pure English, pure RU, and mixed tweets. As well as applying standard ML algorithms, the
1 Lacks publicly available annotated datasets and linguistic resources (stemmers, lemmatizes, POS taggers, etc.).
2 This dataset can be accessed by emailing the corresponding author, Khawar Mehmood ([email protected]).
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
Sentiment Analysis for a Resource Poor Language—Roman Urdu 10:3
author also used a customized RU lexicon and Sentiwordnet list to assign weights to the words
in tweets. Comparing the performances of the different ML algorithms, it was concluded that
bagging outperformed the others.
SA is not limited to only classifying reviews into positive and negative classes but has a wider
range of applications [45]. In Ref. [13], the authors compared the literature of SA with that of
human-computer interactions in order to propose an interaction model that considered SA for
socio-emotional embodied conversational agents (ECAs). While all the above studies, along with
many others, have been conducted for major languages [3, 7], only a limited number has considered
RU [17]. Table 1 summarizes the work on Urdu/RU SAs. However, the techniques developed for
other major languages may not generalize well to RU, which has its roots in Urdu, a free-word-
order language with a complex morphology [18]. This problem is further aggravated by the user-
dependent selection of word spelling in RU, which is due to the lack of a standard approved method
for representing the Urdu language in Latin script [19].
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
10:4 K. Mehmood et al.
4 DATASET
This section describes the steps taken to collect the RU data required to develop an SA system.
4 Using https://scrapy.org/.
5 https://www.facebook.com/help/203805466323736?ref=dp Last Visited on: 3-1-2019.
6 https://harvardlawreview.org/2014/12/data-mining-dog-sniffs-and-the-fourth-amendment/ Last Visited on: 3-1-2019.
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
Sentiment Analysis for a Resource Poor Language—Roman Urdu 10:5
4.6 Data Annotation, the Guidelines, and How They Were Decided
To develop the RUSA using supervised ML algorithms, we required an annotated dataset con-
taining reviews and their sentiments. A systematic approach was used to define the annotation
guidelines and to annotate the dataset. Annotation guidelines were defined as a two-step proce-
dure. In the first step, extensive study of the existing work on annotation guidelines was done
[32, 39, 40, 43] and the baseline guidelines were defined for the unambiguous cases. The second
step was to improve the guidelines. Two of the three annotators were asked to independently
annotate 1,500 reviews under the provided guidelines. The annotators contacted back once diffi-
culty was found in deciding the labels of some reviews. Discussions were conducted in the pres-
ence of the third annotator (only for discrepancy resolution) and the guidelines were enhanced to
cater for those cases. This process continued until the annotators annotated the provided dataset
without difficulty. The ambiguous reviews included the reviews with two sentiments in one re-
view and the reviews mentioned in special cases. For example, the case of “itna acha mobile ha
ky banda muft bi na lay” (It is such a good mobile that no one can think to even get it free)
( ), which was not covered in the initial guidelines, was dis-
cussed and marked as negative because it was considered a sarcastic remark. Similarly, the case
of “qeimat achi hy or battery bohat hi kam hy” (The price is good but the battery life is too less)
( ) was marked as “others” by one annotator and “negative” by
the other. In the ensuing discussion, it was decided that in all the reviews with positive and neg-
ative sentiments, the modifiers (bohat (much), zayada (more), kam (less), etc.) with the sentiment
words would decide the polarity of the review. Therefore, this review was marked as “negative”
because this was the strongest sentiment in it, i.e., “bohat hi kam” ( ).
While formulating the guidelines, we focused on the following three main aspects.
(1) What to annotate (reviews containing positive and negative terms, positive and negative
events, and positive and negative emotional states of the speaker).
(2) What not to annotate (neutral instances and those containing both sentiments).
(3) How to handle special cases (quotations, negations, sarcasm, one side denouncing the
other).
The following are the final guidelines given to the annotators for manual annotation of the
dataset.
(1) Reviews bearing positive terms: All those reviews which explicitly contain the feelings of
joy, appreciation, optimism, excitement, and hope will be marked as positive. Such reviews
should contain positive terms like “acha” ( - good), “khobsoorat” ( - beautiful),
and “saaf”( - clean) without being modified by any other term like “Na - ,” “Nahi -
”, and “mat - ” as these terms invert the polarity.
(2) Reviews bearing negative terms: All those reviews that express the feelings of sadness,
anger, and violence will be marked as negative. Such reviews should contain nega-
tive terms like “bura” ( - bad), “bakwass” ( - rubbish), “zulam”( - cruelness),
“ganda”( - dirty), and “mayosi”( – dejection), without being modified by any other
term like “Na - ”, “Nahi - ”, and “mat - ” as these terms invert the polarity.
(3) Positive or negative events and situations: Sometimes the explicit sentiment of a speaker
talking about an event may not be evident. In that case, the review will be marked as
per the polarity of the event, for example, “Jang mulkon ko bohat bara jani aur mali
nuqsan pohanchati ha” (war causes great losses in terms of lives and a country’s econ-
omy) ( ). This review will be marked as
negative since the speaker is talking about a negative event, i.e., “Jang” ( – war).
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
10:6 K. Mehmood et al.
(4) Positive or negative emotional state of the speaker: In some cases, the polarity of a
review may not conform to the emotional state of the speaker. For example, the re-
view “ham kab tak aysa logon ko vote dyty rahyn gy?” (Till when, will we be vot-
ing to such people?) ( ) doesn’t contain any neg-
ative terms but reflects that the speaker uttered it in deep frustration, and so all such
reviews will be marked as negative. Another example is “wah kya baat ha! Qalan-
ders na Peshawar Zelmi ko hara dya” (Wow! Qalanders have defeated Peshawar Zelmi)
( ) contains the negative term “hara” (defeat - )
but the review will be marked as positive as the words “wah!”(wow - ) identify that the
speaker is in a positive emotional state.
(5) Neutral reviews: All those reviews that do not contain any sentiment bearing words or
present only facts (objectivity) will be considered neutral and marked as “others”.
(6) Reviews containing both positive and negative terms: The reviews containing the same
intensity7 [16] of positive and negative terms will be ignored. For example, “OnePlus
mobile ki awaz to achi ha lakin is ki battery kam chalty ha” (OnePlus mobile has a
good voice but its battery lasts less time than others) (
) will be ignored. In contrast, “is hotel ki chay achi nahi ha lakin
is ki coffee bohat hi achi ha”(This restaurant has bad tea but its coffee is extremely
good) ( ) will be marked as pos-
itive, since its strongest sentiment is positive. Similarly, a review containing more
positive than negative terms will be marked as positive and vice versa. For exam-
ple, “Is dramay ki kahani bohat achi thee, adakaron ki adakari bhe aala thee lakin
daramy ka akhir bura hoa” (This drama had a good storyline, acting of actors was
also marvelous, but its end was bad) (
). This review will be marked as positive, as it has two pos-
itive terms and only one negative term.
(7) Special cases.
(a) Quotations: Reviews containing, or consisting entirely of quotations, idioms, and
proverbs will be marked as per the polarity borne by them. For example, “Agar app
dosron ki qadar pamaye kerna chahty han to phir pehly apny aap ko dekhyn kyon ka
Aap Bhaly To Jug Bhala” (If you want to judge the worth of other people then first
evaluate yourself because good mind, good find) (
) will be marked as positive since it contains
a positive proverb.
(b) Sarcasm: This is speech or writing that actually means the opposite of what it ap-
pears to and is usually intended to mock or insult someone (definition provided in
Collins English Dictionary). Reviews that contain mockery, scoffing, and ridicule will
be treated as sarcasm as per our annotation guidelines and marked as negative. For
example, “MashaAllah se koe to position aie ha na hamari hockey team ki . . . aakhri
he sahi” (By the grace of God, our hockey team got at least some position. . . may that
be last)( ).
(c) Negations: These are words which flip the polarity of a sentiment. All such re-
views should be carefully identified and analyzed during the annotation process and
marked according to the inferred sentiment. For example, “punjab ki hakomat acchi
nahi ha”(The government of Punjab is not good)( ) will be
7 In
this work, the intensity is determined by the modifiers, for example, the intensity of “kam” is less than that of “bohat
kam” while that of “bohat zayada” is greater than that of “zyada”, where “bohat” acts as a modifier.
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
Sentiment Analysis for a Resource Poor Language—Roman Urdu 10:7
marked as negative and “sarhad ki hakomat bury nahi ha”(The government of Sarhad
is not bad)( ) as positive.
(d) Miscellaneous.
(i) Reviews containing supplications will be marked as positive. For example,
“Khuda hamary mulk per reham farmay” (May God shower his mercy on our
country) ( ).
(ii) Questions showing frustrations and dismal will be marked as negative. For
example, “is mulk ka kya bany ga?”(What will happen to this country?)
( ).
(iii) Reviews in which one party wins over another will be marked as per the incli-
nation of the speaker toward a group or party. For example, “Aala! Is election
ma PTI na N-league ko hara dya”(Great! In this election, PTI defeated N-league)
( ) will be marked as positive be-
cause the speaker is in favor of the PTI (as shown by the word “Aala”—Great—
) and against the N-league. If the speaker’s inclination is missing, the an-
notator will look for positive or negative terms and mark the review as per their
polarity. For example, “Is election ma PTI na N-league ko hara dya”(In this elec-
tion, the PTI defeated the N-league) ( ) is miss-
ing the inclination of the speaker toward any group, it will be marked as negative
due to the presence of the word “hara”(defeat - ).
(iv) As, in RU, positive and negative terms may have different spellings, the annotator
will consider all their variations.
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
10:8 K. Mehmood et al.
5 ALGORITHMS USED
In this section, the algorithms used for the development of the RUSA and their technical details
are provided.
(1) Instance-Based Learning [21]: we used k-nearest neighbors (KNN) with n_neighbors = 5
and weights = “uniform.”
(2) Tree-Based Algorithms: Decision trees (DT) and Random Forest (RF) were used with de-
fault parameters.
(3) Probabilistic Learning Algorithms [22]: Logistic Regression (LR) with C = 1.0 and NB were
used.
(4) Perceptron Based [23]: We used Multi-Layer Perceptron (ANN) with one hidden layer
(100 units and activation = “relu”).
(5) Large Margin Classifiers [24]: We used SVM with C = 1.0 and kernel = “rbf” for our ex-
perimentation.
(6) Ensembles [25]: We used AdaBoost (ABNB, ABLR), Bagging (BNB, BLR, BANN), and ma-
jority and weighted voting (wVoting).
(7) Deep Learning: We used long short-term memory (LSTM) networks and CNN. Structure
of LSTM is one embedding layer (embed_dim = 128), one LSTM layer (196 units), and an
output layer. In CNN, we used embedding layer (embed_dim = 128), one convolutional
layer (filters = 64 and kernel_size = 3), one dense layer (units = 180 and activation =
“tanh”), and an output layer.
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
Sentiment Analysis for a Resource Poor Language—Roman Urdu 10:9
6 FEATURES USED
In this work, three broad categories of features were used, namely: word level features, character
level features, and feature union.
(1) Word level features: Word level N-gram features used in this work consist of unigram(Uni),
bigram(Bi), unigram+bigram(Uni-Bi), and unigram+bigram+trigram(Uni-Bi-Tri). To com-
pute the value of each N-gram feature, we have used feature presence (FP), feature fre-
quency (Freq), and Term Frequency-Inverse Document Frequency (TFIDF).
(2) Character level features: Character level features used in this work are bigram, trigram,
4-gram, 5-gram, and 6-gram for “with word boundary” and “without word boundary.”
(3) Feature union: In feature union, we have combined different word level and character level
features.
7 METHODOLOGY
In this section, the proposed methodology, in which different steps involving extensive experi-
mentations using different features (see Section 6) to improve the accuracies of ML algorithms, is
explained.
Firstly, the source websites, blogs, and social media links, which contained user reviews in RU,
were identified, the reviews extracted using a semi-automatic methodology, and then the data
cleaned by removing unwanted information and stored in an Excel file. Since the extracted reviews
did not contain polarity (positive or negative), in the next step, the data was annotated using the
multi-annotator methodology described in Section 4. For consistency, we trained and tested our
model on mutually exclusive training and testing sets. To do this, we executed 30 iterations of each
experiment, in each of which cross-validation was performed by randomly splitting (stratifying)
the entire dataset into training (80%) and testing (20%) ones, applying ML algorithms [28] from each
high-level category (see Section 5) and computing the average accuracies of all the iterations. Also,
during each iteration, we computed the minimum and maximum accuracies and standard deviation
of each feature achieved by each classifier. Since all the experiments were performed on the entire
dataset, which was balanced (51.69% positive and 49.31% negative reviews), we used accuracy8
as the metric for assessing performance [6, 41 42]. Further, while the reviews are categorized by
domain, considering those domains is left to later research.
Three different studies were carried out on three different categories of features (see Section 6),
namely, word and character level, and feature union, with a fourth conducted using deep learning.
The first study consisted of the word-level features in which we performed experiments on three
sub-categories, namely, freq, FP, and TFIDF. For each subcategory, we tested, Uni, Bi, Uni-Bi, and
Uni-Bi-Trigram features. The second study consisted of the character-level features, in which we
performed experiments in two sub-categories, namely, with and without a word boundary. For
each subcategory, bigram, trigram, 4-gram, 5-gram, and 6-gram were used as features. In the third
study, we did feature union by considering different combinations of the top-performing features
from the first and second studies. The main features selected for this study were character (5-gram
and 6-gram) and word (Uni-Bi and Uni-Bi-Tri - FP and Freq) level. The fourth study was performed
on the entire dataset using CNN and LSTM. To further improve accuracies, we used an ensemble
of ML algorithms, followed by majority voting and wVoting, using the top three best-performing
standard ML algorithms.
Details of the results are discussed in the next section. The main aim of these studies was to
determine which types of features perform best given the existing complexities and challenges of
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
10:10 K. Mehmood et al.
RU (see Section 3) because selecting the correct and relevant features plays an important role in
increasing the robustness of a SA system as their impacts vary from language to language.
9 For Voting and wVoting LR, NB, and ANN were selected.
10 The error rate is calculated as (((Old error—New error)/Old error) *100).
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
Sentiment Analysis for a Resource Poor Language—Roman Urdu 10:11
phase, as it could reduce the error rate by 5.32% from the baseline, and by 0.2% from the highest
achieved results of the last phase, which proved that the hypothesis was correct. Since none of
the ensemble techniques, DT, SVM, and KNN, outperformed the others, therefore, they were not
included in the remaining experiments.
Table 5 shows the results for the third sub-category of word-level features (TFIDF) using uni-
gram, bigram, Uni-Bi, and Uni-Bi-Tri features from performing the same set of experiments. How-
ever, TFIDF was not able to improve on the previous results. The possible reason for the unsatis-
factory results could be that in the SA task, the sentiment-bearing words are supposed to be more
frequent; it assigned a lower weight to the most frequent terms.
The results of the second study are shown in Tables 6 and 7 for the sub-categories of “character-
level features without a word boundary” and “character level features with a word boundary,”
respectively.
Table 6 shows that the highest accuracy was achieved by 5-gram followed by 4-gram and 6-gram.
After analyzing the results, it was observed that 5-gram outperformed all the previous results using
the wVoting technique. 5-gram could reduce the error rate by 5.52% from the baseline, and reduce
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
10:12 K. Mehmood et al.
Char 5 + Char 5 + Char 5 + Char 6 + Char 6 + Char 6 + Char 5,6 + Char 5,6 +
Types of Uni-Bi Uni-Bi-Tri Uni-Bi-Tri Uni-Bi Uni-Bi-Tri Uni-Bi-Tri Char 5,6 + Uni-Bi-Tri Uni-Bi-Tri
Features FP FP Freq FP FP Freq Uni-Bi FP FP Freq
No. of 68,287 92,119 92,119 92,119 114,756 114,756 133,113 156,945 156,945
Features
Algorithms % % % % % % % % %
LR 81.07 81.07 80.93 80.61 80.51 80.47 80.98 81.12 80.94
NB 80.30 80.41 80.28 80.00 80.17 79.99 80.01 80.09 80.03
ANN 81.51 81.60 81.51 81.07 81.16 80.95 81.22 81.41 81.33
Voting 82.07 82.24 82.10 81.68 81.77 81.56 81.97 82.17 82.04
wVoting 82.32 82.46 82.29 81.97 82.09 81.84 82.14 82.21 82.16
the error rate by 0.21% from the highest achieved results of the last phases. Table 7 shows the results
for “character-level features with word boundary,” where the highest accuracy was achieved by
5-gram followed by 4-gram and 6-gram. After analysis, it was observed that 5-gram could reduce
the error rate by 6.62% from the baseline, and reduce the error rate by 1.38% from the highest
achieved results of the last study.
The reason for performing the second study was based on one of the challenges of RU mentioned
in Section 3, i.e., variations in word spellings; for example, “Play” and “Pley” and “words” and
“wards.” When dealing with word-level N-gram features, “words” and “wards” are represented
by two different features whereas in character-level N-gram features, some parts of “word” and
“ward” match. For example, if we use a character level bigram, then the word “word” will be divided
into six features as {_w, wo,or,rd,ds,s_} and the word “ward” will be divided into six features {_w,
wa,ar,rd,ds,s_}. This example shows that four features match while four do not, which may help the
classifier to predict better. An error analysis shows that the example “Ye drama bohat bakwasssss
hai” (This drama is very pathetic) was wrongly classified using uni-bigram and correctly classified
using character-level feature.
The results of the third study are shown in Table 8. This study consisted of the three top-
performing features from each of the word-level and character level N-grams. It can be observed
that all the results have improved from the previous two studies, where the highest accuracy was
achieved by wVoting using Char 5+Uni-Bi-Trigram FP as a feature. The error rate was reduced
by 12% from the baseline and 7.05% and 5.75% from the best-performing features of the first and
second studies, respectively. The main reason for such improvement in all the experiments of this
study was that we combined the strength of the top best-performing word level and character level
features.
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
Sentiment Analysis for a Resource Poor Language—Roman Urdu 10:13
Algorithms Epochs
2 3 4 5 6 7
LSTM (%) 78.24 77.68 77.29 76.92 76.23 76.46
CNN (%) 75.53 76.25 76.39 76.46 76.51 76.20
In the last study, we used deep-learning techniques to evaluate their impact on the RU dataset
because these techniques have produced good results for similar tasks using pre-trained word
embeddings [33, 35]. After evaluating these techniques on our dataset without using pre-trained
word embeddings, these techniques could not outperform, as shown in Table 9. We believe that the
major reason of poor results in our case was the unavailability of pre-trained word embeddings
for RU due to the absence of a huge dataset like Google News or Wikipedia. Also, as previous
work showed that when deep-learning techniques were run on a dataset without pre-trained word
embeddings, they did not produce good results [33]; hence, our results are in line with state-of-
the-art work in deep-learning.
As previously discussed, by conducting different studies, we improved on the results from the
established baseline. To determine if these improvements were also statistically significant, we
applied the following t-test(t) and confidence interval (CI) with a significance level of 0.05 on the
obtained results.
x− µ √ √
t= √ and CI = x − z s/ n , x − z s/ n ,
s/ n
where “x” and “µ” are the mean values, “s” the standard deviation, “z” the number of standard
errors to be added and subtracted to achieve the desired confidence level, and “n” the sample size.
They showed that the best results of each study were statistically significant from the baseline.
REFERENCES
[1] R. M. Duwairi, R. Marji, N. Sha’ban, and S. Rushaidat. 2014. Sentiment analysis in Arabic tweets. In 2014 5th Interna-
tional Conference on Information and Communication Systems (ICICS). IEEE, 1–6.
[2] B. Anwar. 2009. Urdu-English code switching: The use of Urdu phrases and clauses in Pakistani English (A non-native
variety). Int. J. Lang. Stud. 3, 4 (2009), 409–424.
[3] A. Pak and P. Paroubek. 2010. Twitter as a corpus for sentiment analysis and opinion mining. In LREc, (Vol. 10,
No. 2010), 1320–1326.
[4] Gary F. Simons and Charles D. Fennig (Eds.). 2017. Ethnologue: Languages of the World, 20th edition. Dallas, Texas:
SIL International. Retrieved from http://www.ethnologue.com.
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
10:14 K. Mehmood et al.
[5] R. Feldman. 2013. Techniques and applications for sentiment analysis. Communications of the ACM 56, 4 (2013), 82–89.
[6] B. Pang, L. Lee, and S. Vaithyanathan. 2002. Thumbs up?: Sentiment classification using machine learning techniques.
In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing. Vol. 10. 79–86.
[7] W. Medhat, A. Hassan, and H. Korashy. 2014. Sentiment analysis algorithms and applications: A survey. Ain Shams
Eng. J. 5, 4 (2014), 1093–1113.
[8] A. Abbasi, H. Chen, and A. Salem. 2008. Sentiment analysis in multiple languages: Feature selection for opinion
classification in web forums. ACM Trans. Inf. Syst. 26, 3 (2008), 12:11–12.34.
[9] C. Yang, K. H. Y. Lin, and H. H. Chen. 2007. Emotion classification using web blog corpora. In IEEE/WIC/ACM Inter-
national Conference on Web Intelligence. IEEE, 275–278.
[10] K. Mehmood, D. Essam, and K. Shafi. 2018. Sentiment analysis system for Roman Urdu. In Science and Information
Conference. Springer, Cham, 29–42.
[11] R. Socher, D. Chen, C. D. Manning, and A. Ng. 2013. Reasoning with neural tensor networks for knowledge base
completion. In Advances in Neural Information Processing Systems. 926–934.
[12] C. Zhang, D. Zeng, J. Li, F. Y. Wang, and W. Zuo. 2009. Sentiment analysis of Chinese documents: From sentence to
document level. J. Assoc. Inf. Sci. Tech. 60, 12 (2009), 2474–2487.
[13] C. Clavel and Z. Callejas. 2016. Sentiment analysis: From opinion mining to human-agent interaction. IEEE Trans.
Affective Comput. 7, 1 (2016), 74–93.
[14] S. Ahmed, S. Hina, and R. Asif. 2018. Detection of sentiment polarity of unstructured multi-language text from social
media. Int. J. Adv. Comput. Sci. Appl. 9, 7 (2018), 199–203.
[15] M. Daud, R. Khan, and A. Daud. 2015. Roman Urdu opinion mining system (RUOMiS). arXiv preprint
arXiv:1501.01386.
[16] A. Z. Syed, M. Aslam, and A. M. Martinez-Enriquez. 2010. Lexicon based sentiment analysis of Urdu text using Sen-
tiUnits. In Mexican International Conference on Artificial Intelligence. Springer, Berlin, 32–43.
[17] N. Mukhtar, M. A. Khan, and N. Chiragh. 2017. Effective use of evaluation measures for the validation of best classifier
in Urdu sentiment analysis. Cognitive Computation (2017), 1–11.
[18] N. Mukhtar and M. A. Khan. 2018. Urdu sentiment analysis using supervised machine learning approach. Int. J. Pattern
Recognit. Artif. Intell. (2018), 32.
[19] S. Mukund and R. K. Srihari. 2012. Analyzing Urdu social media for sentiments using transfer learning with controlled
translations. In Proceedings of the Second Workshop on Language in Social Media. ACL, 1–8
[20] J. Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 1 (1960),
37–46.
[21] S. B. Kotsiantis, I. Zaharakis, and P. Pintelas. 2007. Supervised machine learning: A review of classification techniques.
Emerging Artificial Intelligence Applications in Computer Engineering 160 (2007), 3–24.
[22] T. Hastie, R. Tibshirani, and J. Friedman. 2009. Overview of supervised learning. In The Elements of Statistical Learning.
Springer New York, 9–41.
[23] S. I. Gallant. 1990. Perceptron-based learning algorithms. IEEE Trans. Neural Networks 1, 2 (1990), 179–191.
[24] B. E. Boser, I. M. Guyon, and V. N. Vapnik. 1992. A training algorithm for optimal margin classifiers. In Proceedings
of the 5th Annual Workshop on Computational Learning Theory. ACM, 144–152.
[25] G. Zenobi and P. Cunningham. 2001. Using diversity in preparing ensembles of classifiers based on different feature
subsets to minimize generalization error. Machine Learning: ECML 2001, 576–587.
[26] A. Yessenalina, Y. Yue, and C. Cardie. 2010. Multi-level structured models for document-level sentiment classification.
In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. ACL, 1046–1056.
[27] W. Medhat, A. Hassan, and H. Korashy. 2014. Sentiment analysis algorithms and applications: A survey. Ain Shams
Eng. J. 5, 4 (2014), 1093–1113.
[28] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, and J. Vanderplas. 2011. Scikit-learn: Ma-
chine learning in Python. J. Mach. Learn. Res. 12, (Oct. 2011), 2825–2830.
[29] P. H. Shahana and B. Omman. 2015. Evaluation of features on sentimental analysis. Procedia Comp. Sci. 46 (2015),
1585–1592.
[30] C. W. Hsu and C. J. Lin. 2002. A comparison of methods for multiclass support vector machines. IEEE Trans. Neural
Networks 13, 2 (2002), 415–425.
[31] E. Cambria, B. Schuller, Y. Xia, and C. Havasi. 2013. New avenues in opinion mining and sentiment analysis. IEEE
Intell. Syst. 28, 2 (2013), 15–21.
[32] K. Oouchida, J. D. Kim, T. Takagi, and J. I. Tsujii. 2009. GuideLink: A corpus annotation system that integrates the
management of annotation guidelines. In Proceedings of 23rd Pacific Asia Conference on Language, Information, and
Computation. Vol. 2.
[33] Y. Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, 1746–
1751.
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
Sentiment Analysis for a Resource Poor Language—Roman Urdu 10:15
[34] M. Bilal, H. Israr, M. Shahid, and A. Khan. 2016. Sentiment classification of Roman-Urdu opinions using Naïve
Bayesian, Decision Tree, and KNN classification techniques. J. King Saud Univ. Comp, Inf. Sci. 28, 3 (2016), 330–344.
[35] S. Lai, L. Xu, K. Liu, and J. Zhao. 2015. Recurrent convolutional neural networks for text classification. In AAAI,
Vol. 333. 2267–2273.
[36] M. Taboada, J. Brooke, M. Tofiloski, K. Voll, and M. Stede. 2011. Lexicon-based methods for sentiment analysis. Com-
put. Ling. 37, 2 (2011), 267–307.
[37] R. M. Duwairi, R. Marji, N. Sha’ban, and S. Rushaidat. 2014. Sentiment analysis in Arabic tweets. In 2014 5th Interna-
tional Conference on Information and Communication Systems (ICICS), IEEE. 1–6.
[38] D. Alessia, F. Ferri, P. Grifoni, and T. Guzzo. 2015. Approaches, tools, and applications for sentiment analysis imple-
mentation. Int. J. Comput. Appl. 125, 3 (2015), 26–33.
[39] M. K. Malik. 2017. Urdu named entity recognition and classification system using artificial neural network. ACM
Trans. Asian Low-Resour. Lang. Inf. Process. 17, 1 (2017), 2.
[40] S. Mohammad. 2016. A practical guide to sentiment annotation: Challenges and solutions. In Proceedings of the 7th
Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 174–179.
[41] Y. Sun, A. K. Wong, and M. S. Kamel. 2009. Classification of imbalanced data: A review. Int. J. Pattern Recognit. Artif.
Intell. 23, 4 (2009), 687–719.
[42] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts. 2013. Recursive deep models for
semantic compositionality over a sentiment treebank. In Proceedings of 2013 Conference on Empirical Methods in
Natural Language Processing.
[43] Z. Lu, M. Bada, P. V. Ogren, K. B. Cohen, and L. Hunter. 2006. Improving biomedical corpus annotation guidelines.
In Proceedings of the Joint BioLink and 9th Bio-ontologies Meeting. 89–92.
[44] Z. Sharf and S. U. Rahman. 2018. Performing natural language processing on roman urdu datasets. Int. J. Comput. Sci.
Network Secur. 18, 1 (2018), 141–148.
[45] Ravi Kumar and Ravi Vadlamani. 2015. A survey on opinion mining and sentiment analysis: tasks, approaches and
applications. Knowledge-Based Syst. 89 (2015), 14–46.
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.