0% found this document useful (0 votes)
267 views15 pages

Sentiment Analysis For A Resource Poor Language-Roman Urdu

Research paper on Sentiment Analysis for a Resource Poor Language—Roman Urdu.

Uploaded by

Muhammad Usman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
267 views15 pages

Sentiment Analysis For A Resource Poor Language-Roman Urdu

Research paper on Sentiment Analysis for a Resource Poor Language—Roman Urdu.

Uploaded by

Muhammad Usman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Sentiment Analysis for a Resource Poor

Language—Roman Urdu

KHAWAR MEHMOOD, DARYL ESSAM, and KAMRAN SHAFI, 10


University of New South Wales Australia
MUHAMMAD KAMRAN MALIK, University of the Punjab, Pakistan

Sentiment analysis is an important sub-task of Natural Language Processing that aims to determine the po-
larity of a review. Most of the work done on sentiment analysis is for the resource-rich languages of the
world, but very limited work has been done on resource-poor languages. In this work, we focus on develop-
ing a Sentiment Analysis System for Roman Urdu, which is a resource-poor language. To this end, a dataset
of 11,000 reviews has been gathered from six different domains. Comprehensive annotation guidelines were
defined and the dataset was annotated using the multi-annotator methodology. Using the annotated dataset,
state-of-the-art algorithms were used to build a sentiment analysis system. To improve the results of these
algorithms, four different studies were carried out based on: word-level features, character level features, and
feature union. The best results showed that we could reduce the error rate by 12% from the baseline (80.07%).
Also, to see if the improvements are statistically significant, we applied t-test and Confidence Interval on the
obtained results and found that the best results of each study are statistically significant from the baseline.
CCS Concepts: • Computing methodologies → Language resources;
Additional Key Words and Phrases: Resource poor language, Roman Urdu, Roman Urdu sentiment analysis
ACM Reference format:
Khawar Mehmood, Daryl Essam, Kamran Shafi, and Muhammad Kamran Malik. 2019. Sentiment Analysis
for a Resource Poor Language—Roman Urdu. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 19, 1, Article
10 (August 2019), 15 pages.
https://doi.org/10.1145/3329709

1 INTRODUCTION
Sentiment analysis (SA) is a process for measuring the affective states of a topic or review. People
share their opinions and experiences about topics, products, social events, and political and eco-
nomic issues using social media. The proliferation of such information has made it a challenging
task to manually distill relevant details from this data, which necessitates the development of an
automated system, such as an SA System, which can intelligently extract sentiments from it.

Authors’ addresses: K. Mehmood, D. Essam, and K. Shafi, University of New South Wales, Northcott Drive, Campbell,
ACT 2600, Australia; emails: [email protected], {d.essam, k.shafi}@adfa.edu.au; M. K. Malik, Punjab
University College of Information Technology, University of the Punjab, Old Campus, The Mall, Lahore, Pakistan; email:
[email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2019 Association for Computing Machinery.
2375-4699/2019/08-ART10 $15.00
https://doi.org/10.1145/3329709

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
10:2 K. Mehmood et al.

Most of the work conducted on SA has been for major languages such as English and Chinese
[3, 7], with very little on Roman Urdu (RU)/Hindi, which is a resource-poor language.1 The devel-
opment of a robust Roman Urdu Sentiment Analysis (RUSA) is necessary for two major reasons.
Firstly, Urdu/Hindi is the third most-spoken language in the world, with more than 500 million
speakers [4]. Secondly, it is increasingly being used because people prefer to communicate on the
web using the Latin script (i.e., RU uses the 26-letter English alphabet) instead of typing in their
language using a language-specific keyboard. Its diversity and enormity of its user base motivates
work on RUSA.
The main contributions of this work are as follows.
(1) Developing the largest ever RUSA corpus of 11,000 reviews (at the time of writing) ob-
tained from six different domains and making it publicly available.2
(2) Identifying features while considering the complexities of RU, that is, (a) word-level fea-
tures, (b) character-level features, and (c) feature union (word + character level).
(3) Applying machine-learning (ML) algorithms from seven different categories to the task of
developing a RUSA and comparing the results to determine their statistical significance.
(4) Proposing a confidence-based voting technique for RUSA.

2 RELATED WORK
SA is one of the most active areas of computer science [5]. The three main approaches to SA are
ML [6], lexicon-based [36], and hybrid approaches [38]. Besides the three main approaches to SA,
SA classification is broadly done at document [26], sentence [5, 12], and aspect level [8, 27].
Different state-of-the-art algorithms have been used to improve the accuracy of SA. In Ref. [11],
the authors introduced a neural tensor network that could model the relationship between two en-
tities as an average of its word vectors, which allowed the sharing of statistical strength between
the words describing each entity. To further improve the accuracy of predicting the relationship
between two entities, they trained word vectors on the large unlabeled text of WordNet and Free-
base and used them in their model. In Ref. [35], a recurrent convolutional neural network (RCNN)
was proposed and applied to the task of text classification. As a first component, RCNN uses a bidi-
rectional recurrent structure and its second layer is based on max-pooling. To further improve the
results of their model, they used pre-trained word embeddings. In Ref. [33], the authors trained a
simple convolutional neural network (CNN) with one layer of convolution on top of word vectors
pre-trained on 100 billion words of Google News. The model could not perform well when used
without pre-trained word embeddings, but achieved excellent results on multiple benchmarks with
pre-trained word embeddings.
In Ref. [6], the authors performed SA on a database containing movie reviews. They considered
different features and used three standard ML algorithms and could achieve an accuracy of 82.9%.
In Ref. [9], the authors performed an emotion detection task at the document level in which they
exploited the human psychology of summarizing the emotions of a complete document in the last
sentence. In Ref. [1], the author collected 1,000 tweets written in Arabic and Arabizi. The author
sought to convert Arabizi to Standard Arabic and used tokenization, stemming, and stop word
removal as preprocessing. It was concluded that better performance could be achieved without
stemming and stop word removal. In the study, the Naive Bayes (NB) was the best with an accu-
racy of 76.78%. In Ref. [14], the author performed an aspect-based SA on tweeter data consisting
of pure English, pure RU, and mixed tweets. As well as applying standard ML algorithms, the

1 Lacks publicly available annotated datasets and linguistic resources (stemmers, lemmatizes, POS taggers, etc.).
2 This dataset can be accessed by emailing the corresponding author, Khawar Mehmood ([email protected]).

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
Sentiment Analysis for a Resource Poor Language—Roman Urdu 10:3

Table 1. Comparison of Urdu/RUSA

Author(s) Dataset (No. of Reviews) Publicly Available Language Accuracies (%)


[15] 1,620 No RU Approximately 21.1%
falsely categorized
[34] 300 No RU 97.5
[10] 779 Yes RU 76.98
[18] 6,025 Yes Pure Urdu 67.0185
(Current work) 11,000 Yes RU 82.46

author also used a customized RU lexicon and Sentiwordnet list to assign weights to the words
in tweets. Comparing the performances of the different ML algorithms, it was concluded that
bagging outperformed the others.
SA is not limited to only classifying reviews into positive and negative classes but has a wider
range of applications [45]. In Ref. [13], the authors compared the literature of SA with that of
human-computer interactions in order to propose an interaction model that considered SA for
socio-emotional embodied conversational agents (ECAs). While all the above studies, along with
many others, have been conducted for major languages [3, 7], only a limited number has considered
RU [17]. Table 1 summarizes the work on Urdu/RU SAs. However, the techniques developed for
other major languages may not generalize well to RU, which has its roots in Urdu, a free-word-
order language with a complex morphology [18]. This problem is further aggravated by the user-
dependent selection of word spelling in RU, which is due to the lack of a standard approved method
for representing the Urdu language in Latin script [19].

3 COMPLEXITIES OF ROMAN URDU


The following are a few of the complexities of RU, which make the development of a SA system
for it more challenging.
(1) There are no standards for representing the Urdu language in Latin script; for example,
both “Who bhot acha ha” and “Who bohat acha hai” represent “he is very good.”
(2) One word in RU may point to two or more words in Urdu with different sounds; for ex-
ample, “Khawar” meaning “Dawn” and “Khawar” meaning “miserable” are pronounced
differently.
(3) Urdu is a free-phrase-order language; for example, both “ali ne paani ka aik glass piya”
and “panni ka aik glass ali ne piya” translate to “Ali drank a glass of water.”
(4) Urdu, and hence RU, are morphologically rich languages, with “acha” (masculine), “achi”
(feminine), and “achay” (plural), all pointing to the one English word “good.”
(5) There is no capitalization in RU, i.e., for “ali ne paani ka aik glass piya,” some people write
“Ali” as “ali” and others “Ali.”
(6) The handling of negation is a more complex task in RU; for example, both “Yeh mery lya
theek nahi ha” and “Ye mery lya nahi theek” translate to “This is not good for me.”
(7) There is a serious lack of labeled datasets in RU, with very few RUSA ones [10, 44].3
(8) Since our focus is on RU, only relevant reviews were considered. However, as we live in a
multilingual world, “borrowing” could not be avoided [2]; for example. “Ye drama to acha
tha lakin is ka end bohat bura hoa” (This drama was good, but its end was very bad).

3 https://archive.ics.uci.edu/ml/datasets/Roman+Urdu+Data+Set# Last visited on: 5-1-2019.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
10:4 K. Mehmood et al.

4 DATASET
This section describes the steps taken to collect the RU data required to develop an SA system.

4.1 Links Identification


Identifying the data resources was the primary task to create RUSA. Therefore, as a first step,
different sites, blogs, and the like, where users posted data in RU were identified. The selected
sites for data collection were “hamariweb”, “youtube”, “dramaonline”, “facebook”, “filmy-wap”,
“video.genfk”, “filmy-wap”, “vdos.tv”, “siasat.pk”, “whatmobile.com”, “masala.tv”, “pakistan.web”,
and “twitter”.

4.2 Data Extraction Method


After the identification of relevant URLs, data extraction was done using both automatic and man-
ual techniques. The websites: “hamariweb,” “youtube,” “dramaonline,” “facebook,” and “whatmo-
bile” were crawled using a custom written crawler4 and the available data was extracted. In other
cases, the RU data was manually read and extracted from “video.genfk”, “filmy-wap”, “vdos.tv”,
“siasat.pk”, “whatmobile.com”, “masala.tv”, “pakistan.web”, and “twitter”.

4.3 Ethical Aspects


Since, the privacy of user credentials is a major concern in this information age, therefore, only
those reviews were collected that were publicly available on the Internet. None of the collected
reviews contained any Personally Identifiable Information (PII).5,6

4.4 Data Related Information


The data was extracted from six different domains and contained four types of reviews, i.e., English,
Urdu, RU, and a mix of RU and English. To keep track of the extracted data, URLs of the websites,
domains of the reviews, and the reviews themselves were stored in an Excel file. Since the focus
was on creating an SA for RU, only those reviews were kept and any technical aspects, such as
html, removed.

4.5 Domains Selected


Data was taken from six different domains, namely: Politics, Drama/Movie/Telefilm, Mobile Re-
views, Sports, Food, and Miscellaneous (Misc). While extracting the data, the URLs and the sites
described the domain of the data. For example, the data gathered from “whatmobile.com” depicted
the reviews about the mobile phone domain and the data gathered from “masala.tv” showed the
reviews for the food domain. Reviews in Misc domain include opinion of the people on questions
like “kitab sy doory kyon?”, “Zalzalay aur tsunami, wajuhat kya?”, and “hamary talba chuteian
kis tarha guzarain?” In addition, the reviewers also reconfirmed these domains while annotating
the dataset. At the end of this process, a total of 18,000 reviews of RU were gathered from all six
domains. There are two reasons for selecting these domains. Firstly, in the Indian Subcontinent,
people have strong interests in these domains. Secondly, data from one domain contains limited
vocabulary, while data from different domains contain a variety of vocabulary from different gen-
res. Therefore, by building an SA system on such data, people can be informed about positive and
negative trends in their area of interest, and this will also help in the development of a robust SA
system.

4 Using https://scrapy.org/.
5 https://www.facebook.com/help/203805466323736?ref=dp Last Visited on: 3-1-2019.
6 https://harvardlawreview.org/2014/12/data-mining-dog-sniffs-and-the-fourth-amendment/ Last Visited on: 3-1-2019.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
Sentiment Analysis for a Resource Poor Language—Roman Urdu 10:5

4.6 Data Annotation, the Guidelines, and How They Were Decided
To develop the RUSA using supervised ML algorithms, we required an annotated dataset con-
taining reviews and their sentiments. A systematic approach was used to define the annotation
guidelines and to annotate the dataset. Annotation guidelines were defined as a two-step proce-
dure. In the first step, extensive study of the existing work on annotation guidelines was done
[32, 39, 40, 43] and the baseline guidelines were defined for the unambiguous cases. The second
step was to improve the guidelines. Two of the three annotators were asked to independently
annotate 1,500 reviews under the provided guidelines. The annotators contacted back once diffi-
culty was found in deciding the labels of some reviews. Discussions were conducted in the pres-
ence of the third annotator (only for discrepancy resolution) and the guidelines were enhanced to
cater for those cases. This process continued until the annotators annotated the provided dataset
without difficulty. The ambiguous reviews included the reviews with two sentiments in one re-
view and the reviews mentioned in special cases. For example, the case of “itna acha mobile ha
ky banda muft bi na lay” (It is such a good mobile that no one can think to even get it free)
( ), which was not covered in the initial guidelines, was dis-
cussed and marked as negative because it was considered a sarcastic remark. Similarly, the case
of “qeimat achi hy or battery bohat hi kam hy” (The price is good but the battery life is too less)
( ) was marked as “others” by one annotator and “negative” by
the other. In the ensuing discussion, it was decided that in all the reviews with positive and neg-
ative sentiments, the modifiers (bohat (much), zayada (more), kam (less), etc.) with the sentiment
words would decide the polarity of the review. Therefore, this review was marked as “negative”
because this was the strongest sentiment in it, i.e., “bohat hi kam” ( ).
While formulating the guidelines, we focused on the following three main aspects.
(1) What to annotate (reviews containing positive and negative terms, positive and negative
events, and positive and negative emotional states of the speaker).
(2) What not to annotate (neutral instances and those containing both sentiments).
(3) How to handle special cases (quotations, negations, sarcasm, one side denouncing the
other).
The following are the final guidelines given to the annotators for manual annotation of the
dataset.
(1) Reviews bearing positive terms: All those reviews which explicitly contain the feelings of
joy, appreciation, optimism, excitement, and hope will be marked as positive. Such reviews
should contain positive terms like “acha” ( - good), “khobsoorat” ( - beautiful),
and “saaf”( - clean) without being modified by any other term like “Na - ,” “Nahi -
”, and “mat - ” as these terms invert the polarity.
(2) Reviews bearing negative terms: All those reviews that express the feelings of sadness,
anger, and violence will be marked as negative. Such reviews should contain nega-
tive terms like “bura” ( - bad), “bakwass” ( - rubbish), “zulam”( - cruelness),
“ganda”( - dirty), and “mayosi”( – dejection), without being modified by any other
term like “Na - ”, “Nahi - ”, and “mat - ” as these terms invert the polarity.
(3) Positive or negative events and situations: Sometimes the explicit sentiment of a speaker
talking about an event may not be evident. In that case, the review will be marked as
per the polarity of the event, for example, “Jang mulkon ko bohat bara jani aur mali
nuqsan pohanchati ha” (war causes great losses in terms of lives and a country’s econ-
omy) ( ). This review will be marked as
negative since the speaker is talking about a negative event, i.e., “Jang” ( – war).

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
10:6 K. Mehmood et al.

(4) Positive or negative emotional state of the speaker: In some cases, the polarity of a
review may not conform to the emotional state of the speaker. For example, the re-
view “ham kab tak aysa logon ko vote dyty rahyn gy?” (Till when, will we be vot-
ing to such people?) ( ) doesn’t contain any neg-
ative terms but reflects that the speaker uttered it in deep frustration, and so all such
reviews will be marked as negative. Another example is “wah kya baat ha! Qalan-
ders na Peshawar Zelmi ko hara dya” (Wow! Qalanders have defeated Peshawar Zelmi)
( ) contains the negative term “hara” (defeat - )
but the review will be marked as positive as the words “wah!”(wow - ) identify that the
speaker is in a positive emotional state.
(5) Neutral reviews: All those reviews that do not contain any sentiment bearing words or
present only facts (objectivity) will be considered neutral and marked as “others”.
(6) Reviews containing both positive and negative terms: The reviews containing the same
intensity7 [16] of positive and negative terms will be ignored. For example, “OnePlus
mobile ki awaz to achi ha lakin is ki battery kam chalty ha” (OnePlus mobile has a
good voice but its battery lasts less time than others) (
) will be ignored. In contrast, “is hotel ki chay achi nahi ha lakin
is ki coffee bohat hi achi ha”(This restaurant has bad tea but its coffee is extremely
good) ( ) will be marked as pos-
itive, since its strongest sentiment is positive. Similarly, a review containing more
positive than negative terms will be marked as positive and vice versa. For exam-
ple, “Is dramay ki kahani bohat achi thee, adakaron ki adakari bhe aala thee lakin
daramy ka akhir bura hoa” (This drama had a good storyline, acting of actors was
also marvelous, but its end was bad) (
). This review will be marked as positive, as it has two pos-
itive terms and only one negative term.
(7) Special cases.
(a) Quotations: Reviews containing, or consisting entirely of quotations, idioms, and
proverbs will be marked as per the polarity borne by them. For example, “Agar app
dosron ki qadar pamaye kerna chahty han to phir pehly apny aap ko dekhyn kyon ka
Aap Bhaly To Jug Bhala” (If you want to judge the worth of other people then first
evaluate yourself because good mind, good find) (
) will be marked as positive since it contains
a positive proverb.
(b) Sarcasm: This is speech or writing that actually means the opposite of what it ap-
pears to and is usually intended to mock or insult someone (definition provided in
Collins English Dictionary). Reviews that contain mockery, scoffing, and ridicule will
be treated as sarcasm as per our annotation guidelines and marked as negative. For
example, “MashaAllah se koe to position aie ha na hamari hockey team ki . . . aakhri
he sahi” (By the grace of God, our hockey team got at least some position. . . may that
be last)( ).
(c) Negations: These are words which flip the polarity of a sentiment. All such re-
views should be carefully identified and analyzed during the annotation process and
marked according to the inferred sentiment. For example, “punjab ki hakomat acchi
nahi ha”(The government of Punjab is not good)( ) will be

7 In
this work, the intensity is determined by the modifiers, for example, the intensity of “kam” is less than that of “bohat
kam” while that of “bohat zayada” is greater than that of “zyada”, where “bohat” acts as a modifier.
ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
Sentiment Analysis for a Resource Poor Language—Roman Urdu 10:7

Table 2. Details of Dataset

Domains No. of Reviews Positive Negative


Misc. 3,568 2,116 1,452
Food 837 658 179
Mobile Reviews 1,109 693 416
Drama/ Movie/Telefilm (DMT) 2,543 1,335 1,208
Politics 2,481 646 1,835
Sports 462 238 224
Total 11,000 5,686 5,314

marked as negative and “sarhad ki hakomat bury nahi ha”(The government of Sarhad
is not bad)( ) as positive.
(d) Miscellaneous.
(i) Reviews containing supplications will be marked as positive. For example,
“Khuda hamary mulk per reham farmay” (May God shower his mercy on our
country) ( ).
(ii) Questions showing frustrations and dismal will be marked as negative. For
example, “is mulk ka kya bany ga?”(What will happen to this country?)
( ).
(iii) Reviews in which one party wins over another will be marked as per the incli-
nation of the speaker toward a group or party. For example, “Aala! Is election
ma PTI na N-league ko hara dya”(Great! In this election, PTI defeated N-league)
( ) will be marked as positive be-
cause the speaker is in favor of the PTI (as shown by the word “Aala”—Great—
) and against the N-league. If the speaker’s inclination is missing, the an-
notator will look for positive or negative terms and mark the review as per their
polarity. For example, “Is election ma PTI na N-league ko hara dya”(In this elec-
tion, the PTI defeated the N-league) ( ) is miss-
ing the inclination of the speaker toward any group, it will be marked as negative
due to the presence of the word “hara”(defeat - ).
(iv) As, in RU, positive and negative terms may have different spellings, the annotator
will consider all their variations.

4.7 Inter-annotator Agreement


To verify the completeness of the annotation guidelines, we gave 1,500 reviews to two different
annotators and asked them to annotate and mention which ones fell under which guidelines. They
independently annotated these reviews into “positive (pos),” “negative (neg),” and “others” cate-
gories based on the annotation guidelines. The inter-annotator agreement calculated using kapa
statistics was 97.15, which indicated almost perfect agreement between the two annotators and the
completeness of the annotation guidelines. The distribution of the 1,500 reviews reported by the
annotators was 427, 401, 79, 51, 412, 93 and 37 against guidelines 1, 2, 3, 4, 5, 6, and 7, respectively.
Then, to speed up the annotation process, the remaining data (16,500 reviews) was divided equally
between the two annotators. Details of the finalized dataset of 11,000 reviews (after ignoring all
those marked as “others”) are shown in Table 2.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
10:8 K. Mehmood et al.

4.8 Data Analysis


When analyzing the data with respect to the sentiment vocabulary, it was observed that the most
frequent positive terms were “acha - ” (good), “achi - ” (good), “achay - ” (good),
“kamal - ” (perfect), “pasand - ” (like), “zabardast - ” (great), “wah – ” (wow),
and “khoob - ” (good); and the negative ones “bakwaas - ” (rubbish), “bekaar - ”
(useless), “ghalat – ” (wrong), “bura - ” (bad), “buri - ” (bad), “buray - ” (bad), “ghatiya
- ” (cheap), “jhoota - ” (liar), “jhooty - ” (liars), and “jhooti - ” (liar). Also, most
of the sentiment-bearing vocabulary was shared by all the domains; for example, “acha - ”
(good), “zabardast - ” (great), and “bakwas - ” (rubbish). However, there were some
polarity words specific to a domain; for example, Sport contained “sifarshi - ” (recommend
out of way), “jawary - ” (gambler), “mainaz - ” (superior), and “pur josh - ”
(enthusiastic), Food “mazidar - ” (tasty), “lazeez - ” (tasty), “baasi – ” (rotten) and
“badzaika - ” (bad taste), Politics “fradia - ” (fraud person), “corrupt - ”, “aymandar
- ” (honest), and “mukhlis - ” (sincere), DMT “rawayti - ” (traditional), “surely
- ” (appealing voice), “mazahia - ” (funny) and “bayhooda - ” (vulgar) and Mobile
“mehnga - ” (costly), “paidar - ” (long lasting), “mulaaem - ” (soft), and “thoori -
” (less). On further analysis, there were some words that conveyed a positive sense but, in
practice, were used with both polarities. The polarities of words like “Mashalla - ” (whatever
God’s will) and “Inshallah - ” (whatever God’s will) were decided based on the context
in which they were used. The respective distributions of the most frequent lexical variations of
“Mashallah” and “Inshallah” in negative and positive reviews were 4 and 52, respectively, and
15 and 78, respectively. Regarding spelling variations, for the single-polarity word “Bakwaas
- ”, 13 different variations observed in the data were “bakwas,” “bakwaass,” “bakwaas,”
“bakwass,” “bakwasssss,” “bkwaas,” “bkwas,” “bkwaaaaaas,” “bkwaaas,” “bkws,” “bkwaaaas,”
“bukwaas,” and “bukwas,” as was the case with many other words, including polarity ones.

5 ALGORITHMS USED
In this section, the algorithms used for the development of the RUSA and their technical details
are provided.

(1) Instance-Based Learning [21]: we used k-nearest neighbors (KNN) with n_neighbors = 5
and weights = “uniform.”
(2) Tree-Based Algorithms: Decision trees (DT) and Random Forest (RF) were used with de-
fault parameters.
(3) Probabilistic Learning Algorithms [22]: Logistic Regression (LR) with C = 1.0 and NB were
used.
(4) Perceptron Based [23]: We used Multi-Layer Perceptron (ANN) with one hidden layer
(100 units and activation = “relu”).
(5) Large Margin Classifiers [24]: We used SVM with C = 1.0 and kernel = “rbf” for our ex-
perimentation.
(6) Ensembles [25]: We used AdaBoost (ABNB, ABLR), Bagging (BNB, BLR, BANN), and ma-
jority and weighted voting (wVoting).
(7) Deep Learning: We used long short-term memory (LSTM) networks and CNN. Structure
of LSTM is one embedding layer (embed_dim = 128), one LSTM layer (196 units), and an
output layer. In CNN, we used embedding layer (embed_dim = 128), one convolutional
layer (filters = 64 and kernel_size = 3), one dense layer (units = 180 and activation =
“tanh”), and an output layer.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
Sentiment Analysis for a Resource Poor Language—Roman Urdu 10:9

6 FEATURES USED
In this work, three broad categories of features were used, namely: word level features, character
level features, and feature union.
(1) Word level features: Word level N-gram features used in this work consist of unigram(Uni),
bigram(Bi), unigram+bigram(Uni-Bi), and unigram+bigram+trigram(Uni-Bi-Tri). To com-
pute the value of each N-gram feature, we have used feature presence (FP), feature fre-
quency (Freq), and Term Frequency-Inverse Document Frequency (TFIDF).
(2) Character level features: Character level features used in this work are bigram, trigram,
4-gram, 5-gram, and 6-gram for “with word boundary” and “without word boundary.”
(3) Feature union: In feature union, we have combined different word level and character level
features.

7 METHODOLOGY
In this section, the proposed methodology, in which different steps involving extensive experi-
mentations using different features (see Section 6) to improve the accuracies of ML algorithms, is
explained.
Firstly, the source websites, blogs, and social media links, which contained user reviews in RU,
were identified, the reviews extracted using a semi-automatic methodology, and then the data
cleaned by removing unwanted information and stored in an Excel file. Since the extracted reviews
did not contain polarity (positive or negative), in the next step, the data was annotated using the
multi-annotator methodology described in Section 4. For consistency, we trained and tested our
model on mutually exclusive training and testing sets. To do this, we executed 30 iterations of each
experiment, in each of which cross-validation was performed by randomly splitting (stratifying)
the entire dataset into training (80%) and testing (20%) ones, applying ML algorithms [28] from each
high-level category (see Section 5) and computing the average accuracies of all the iterations. Also,
during each iteration, we computed the minimum and maximum accuracies and standard deviation
of each feature achieved by each classifier. Since all the experiments were performed on the entire
dataset, which was balanced (51.69% positive and 49.31% negative reviews), we used accuracy8
as the metric for assessing performance [6, 41 42]. Further, while the reviews are categorized by
domain, considering those domains is left to later research.
Three different studies were carried out on three different categories of features (see Section 6),
namely, word and character level, and feature union, with a fourth conducted using deep learning.
The first study consisted of the word-level features in which we performed experiments on three
sub-categories, namely, freq, FP, and TFIDF. For each subcategory, we tested, Uni, Bi, Uni-Bi, and
Uni-Bi-Trigram features. The second study consisted of the character-level features, in which we
performed experiments in two sub-categories, namely, with and without a word boundary. For
each subcategory, bigram, trigram, 4-gram, 5-gram, and 6-gram were used as features. In the third
study, we did feature union by considering different combinations of the top-performing features
from the first and second studies. The main features selected for this study were character (5-gram
and 6-gram) and word (Uni-Bi and Uni-Bi-Tri - FP and Freq) level. The fourth study was performed
on the entire dataset using CNN and LSTM. To further improve accuracies, we used an ensemble
of ML algorithms, followed by majority voting and wVoting, using the top three best-performing
standard ML algorithms.
Details of the results are discussed in the next section. The main aim of these studies was to
determine which types of features perform best given the existing complexities and challenges of

8 The degree to which the estimated and actual results conform.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
10:10 K. Mehmood et al.

Table 3. Accuracies Using Word-level Features (Freq-based)

Types of Features Unigram Bigram Uni-Bigram Uni-Bi-Tri


Number of Features 30,382 141,588 171,970 361,393
Algorithms (%) (%) (%) (%)
LR 79.17 69.37 79.22 78.73
NB 78.56 73.75 79.42 79.47
SVM 53.54 52.04 52.07 52.04
DT 68.80 63.25 69.07 69.17
KNN 58.38 52.77 54.97 53.89
ANN 78.50 68.29 79.67 78.92
BNB 78.37 73.24 79.09 79.11
BLR 78.91 67.40 78.39 77.69
BANN 79.38 66.12 79.37 78.08
RF 72.60 63.59 71.47 71.13
ABNB 53.64 50.52 52.94 53.10
ABLR 72.68 60.30 72.88 72.94
Voting 79.88 69.98 80.41 80.07
wVoting 80.07 71.45 81.09 80.95

RU (see Section 3) because selecting the correct and relevant features plays an important role in
increasing the robustness of a SA system as their impacts vary from language to language.

8 RESULTS AND DISCUSSION


This section presents and analyzes the results obtained from the four studies, with Table 3 showing
those for the first sub-category of word-level features (freq-based). As, using unigram, the maxi-
mum accuracy was achieved by wVoting,9 it was selected as the baseline (80.07%). It outperformed
bigram, which is in line with the results obtained from other state-of-the-art works in the field of
SA for the English language [6, 29]. As more and more features were added to unigram, the accura-
cies improved although the difference between Uni-Bi and Uni-Bi-Tri was very small. The possible
reason could be that the strength of one feature was complemented by that of the other when they
were combined. Overall, wVoting outperformed all other algorithms in all the experiments and,
by combining different features, the error rate10 was reduced by 5.12% from the baseline. An error
analysis showed that, when using unigram as a feature, “qmobile acha nahi ha” (Q Mobile is not
good) was wrongly classified as positive, but when uni-bigram was used, it was correctly classified
because “acha nahi (not good)” was then also a feature.
After the word level N-gram freq-based features, the second subcategory of word-level features
(FP) was used for further experimentation. Although the frequency of a feature may be helpful
in other language processing tasks, such as keyword extraction and topic modeling [31], the hy-
pothesis was that, in an SA task, the presence of a feature would be sufficient to determine the
polarity (positive or negative) of a review. Table 4 shows the results obtained from the second set
of experiments. It was observed that although the FP unigram performed better than the freq-
based unigram, it could not improve the results achieved by the best performing algorithm using
Uni-Bigram features in the previous phase. Similarly, we selected Bigram, Uni-Bi, and Uni-Bi-Tri
features (FP) to compute accuracies. The results of Uni-Bi outperformed the results of the last

9 For Voting and wVoting LR, NB, and ANN were selected.
10 The error rate is calculated as (((Old error—New error)/Old error) *100).

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
Sentiment Analysis for a Resource Poor Language—Roman Urdu 10:11

Table 4. Accuracies Using Word-level Features (FP)

Types of Features Unigram Bigram Uni-Bigram Uni-Bi-Tri


Number of Features 30,382 141,588 171,970 361,393
Algorithms (%) (%) (%) (%)
LR 79.39 69.71 79.39 78.82
NB 78.69 73.89 79.68 79.91
ANN 78.55 68.27 80.17 79.36
Voting 80.01 70.18 80.68 80.16
wVoting 80.15 71.60 81.13 81.01

Table 5. Accuracies Using Word-level Features (TFIDF-based)

Types of Features Unigram Bigram Uni-Bigram Uni-Bi-Tri


Number of Features 30,382 141,588 171,970 361,393
Algorithms (%) (%) (%) (%)
LR 79.57 72.15 78.66 77.98
NB 79.48 73.40 80.05 79.86
ANN 78.61 72.38 80.80 80.62
Voting 80.29 73.19 80.48 80.18
wVoting 79.74 72.67 80.88 80.59

Table 6. Accuracies Using Character-level Features (Without Word Boundary)

Types of Features Bigram Trigram 4-Gram 5-Gram 6-Gram


Number of Features 3,707 23,739 86,271 207,474 368,125
Algorithms % % % % %
LR 73.90 77.85 79.81 79.61 78.92
NB 71.77 76.89 78.86 79.42 79.83
ANN 75.93 78.21 80.40 80.17 79.26
Voting 75.85 78.93 80.91 80.74 80.10
wVoting 76.30 79.38 81.10 81.17 81.01

phase, as it could reduce the error rate by 5.32% from the baseline, and by 0.2% from the highest
achieved results of the last phase, which proved that the hypothesis was correct. Since none of
the ensemble techniques, DT, SVM, and KNN, outperformed the others, therefore, they were not
included in the remaining experiments.
Table 5 shows the results for the third sub-category of word-level features (TFIDF) using uni-
gram, bigram, Uni-Bi, and Uni-Bi-Tri features from performing the same set of experiments. How-
ever, TFIDF was not able to improve on the previous results. The possible reason for the unsatis-
factory results could be that in the SA task, the sentiment-bearing words are supposed to be more
frequent; it assigned a lower weight to the most frequent terms.
The results of the second study are shown in Tables 6 and 7 for the sub-categories of “character-
level features without a word boundary” and “character level features with a word boundary,”
respectively.
Table 6 shows that the highest accuracy was achieved by 5-gram followed by 4-gram and 6-gram.
After analyzing the results, it was observed that 5-gram outperformed all the previous results using
the wVoting technique. 5-gram could reduce the error rate by 5.52% from the baseline, and reduce

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
10:12 K. Mehmood et al.

Table 7. Accuracies Using Character-level Features (With Word Boundary)

Types of Features Bigram Trigram 4-Gram 5-Gram 6-Gram


Number of Features 3,696 20,384 53,836 82,561 89,089
Algorithms % % % % %
LR 73.92 78.19 80.04 80.27 79.80
NB 72.07 77.07 79.12 79.35 78.91
ANN 75.67 78.22 79.76 79.78 79.24
Voting 75.81 79.09 80.67 80.96 80.63
wVoting 76.45 79.46 81.09 81.39 81.01

Table 8. Accuracies Using Feature Union

Char 5 + Char 5 + Char 5 + Char 6 + Char 6 + Char 6 + Char 5,6 + Char 5,6 +
Types of Uni-Bi Uni-Bi-Tri Uni-Bi-Tri Uni-Bi Uni-Bi-Tri Uni-Bi-Tri Char 5,6 + Uni-Bi-Tri Uni-Bi-Tri
Features FP FP Freq FP FP Freq Uni-Bi FP FP Freq
No. of 68,287 92,119 92,119 92,119 114,756 114,756 133,113 156,945 156,945
Features
Algorithms % % % % % % % % %
LR 81.07 81.07 80.93 80.61 80.51 80.47 80.98 81.12 80.94
NB 80.30 80.41 80.28 80.00 80.17 79.99 80.01 80.09 80.03
ANN 81.51 81.60 81.51 81.07 81.16 80.95 81.22 81.41 81.33
Voting 82.07 82.24 82.10 81.68 81.77 81.56 81.97 82.17 82.04
wVoting 82.32 82.46 82.29 81.97 82.09 81.84 82.14 82.21 82.16

the error rate by 0.21% from the highest achieved results of the last phases. Table 7 shows the results
for “character-level features with word boundary,” where the highest accuracy was achieved by
5-gram followed by 4-gram and 6-gram. After analysis, it was observed that 5-gram could reduce
the error rate by 6.62% from the baseline, and reduce the error rate by 1.38% from the highest
achieved results of the last study.
The reason for performing the second study was based on one of the challenges of RU mentioned
in Section 3, i.e., variations in word spellings; for example, “Play” and “Pley” and “words” and
“wards.” When dealing with word-level N-gram features, “words” and “wards” are represented
by two different features whereas in character-level N-gram features, some parts of “word” and
“ward” match. For example, if we use a character level bigram, then the word “word” will be divided
into six features as {_w, wo,or,rd,ds,s_} and the word “ward” will be divided into six features {_w,
wa,ar,rd,ds,s_}. This example shows that four features match while four do not, which may help the
classifier to predict better. An error analysis shows that the example “Ye drama bohat bakwasssss
hai” (This drama is very pathetic) was wrongly classified using uni-bigram and correctly classified
using character-level feature.
The results of the third study are shown in Table 8. This study consisted of the three top-
performing features from each of the word-level and character level N-grams. It can be observed
that all the results have improved from the previous two studies, where the highest accuracy was
achieved by wVoting using Char 5+Uni-Bi-Trigram FP as a feature. The error rate was reduced
by 12% from the baseline and 7.05% and 5.75% from the best-performing features of the first and
second studies, respectively. The main reason for such improvement in all the experiments of this
study was that we combined the strength of the top best-performing word level and character level
features.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
Sentiment Analysis for a Resource Poor Language—Roman Urdu 10:13

Table 9. Accuracies Using Deep Learning

Algorithms Epochs
2 3 4 5 6 7
LSTM (%) 78.24 77.68 77.29 76.92 76.23 76.46
CNN (%) 75.53 76.25 76.39 76.46 76.51 76.20

In the last study, we used deep-learning techniques to evaluate their impact on the RU dataset
because these techniques have produced good results for similar tasks using pre-trained word
embeddings [33, 35]. After evaluating these techniques on our dataset without using pre-trained
word embeddings, these techniques could not outperform, as shown in Table 9. We believe that the
major reason of poor results in our case was the unavailability of pre-trained word embeddings
for RU due to the absence of a huge dataset like Google News or Wikipedia. Also, as previous
work showed that when deep-learning techniques were run on a dataset without pre-trained word
embeddings, they did not produce good results [33]; hence, our results are in line with state-of-
the-art work in deep-learning.
As previously discussed, by conducting different studies, we improved on the results from the
established baseline. To determine if these improvements were also statistically significant, we
applied the following t-test(t) and confidence interval (CI) with a significance level of 0.05 on the
obtained results.
x− µ   √   √ 
t= √ and CI = x − z s/ n , x − z s/ n ,
s/ n
where “x” and “µ” are the mean values, “s” the standard deviation, “z” the number of standard
errors to be added and subtracted to achieve the desired confidence level, and “n” the sample size.
They showed that the best results of each study were statistically significant from the baseline.

9 CONCLUSION AND FUTURE WORK


In this work, the largest ever dataset of RUSA was developed and a number of steps were taken
for the development and improvement of RUSA. To show the impact on results, different ML
algorithms, along with different types of word-level features, character level features, and the
combination of both (feature union), were identified and used to improve accuracy. It was also
observed that word level features could not perform as well as the character level features and
feature union did. The main reason for this is the spelling variation complexity of RU. With re-
gards to the performance of ML algorithms, wVoting (ensemble technique) performed best most
of the time. State-of-the-art deep-learning techniques were also used to further improve results,
but they did not perform well due to the lack of pre-trained RU word embeddings. Further, t-tests
and CI showed that the improvements made were statistically significant. A potential path for fu-
ture work includes analyzing and improving the results obtained by these studies using different
feature-reduction techniques and developing pre-trained RU word embeddings.

REFERENCES
[1] R. M. Duwairi, R. Marji, N. Sha’ban, and S. Rushaidat. 2014. Sentiment analysis in Arabic tweets. In 2014 5th Interna-
tional Conference on Information and Communication Systems (ICICS). IEEE, 1–6.
[2] B. Anwar. 2009. Urdu-English code switching: The use of Urdu phrases and clauses in Pakistani English (A non-native
variety). Int. J. Lang. Stud. 3, 4 (2009), 409–424.
[3] A. Pak and P. Paroubek. 2010. Twitter as a corpus for sentiment analysis and opinion mining. In LREc, (Vol. 10,
No. 2010), 1320–1326.
[4] Gary F. Simons and Charles D. Fennig (Eds.). 2017. Ethnologue: Languages of the World, 20th edition. Dallas, Texas:
SIL International. Retrieved from http://www.ethnologue.com.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
10:14 K. Mehmood et al.

[5] R. Feldman. 2013. Techniques and applications for sentiment analysis. Communications of the ACM 56, 4 (2013), 82–89.
[6] B. Pang, L. Lee, and S. Vaithyanathan. 2002. Thumbs up?: Sentiment classification using machine learning techniques.
In Proceedings of the ACL-02 Conference on Empirical Methods in Natural Language Processing. Vol. 10. 79–86.
[7] W. Medhat, A. Hassan, and H. Korashy. 2014. Sentiment analysis algorithms and applications: A survey. Ain Shams
Eng. J. 5, 4 (2014), 1093–1113.
[8] A. Abbasi, H. Chen, and A. Salem. 2008. Sentiment analysis in multiple languages: Feature selection for opinion
classification in web forums. ACM Trans. Inf. Syst. 26, 3 (2008), 12:11–12.34.
[9] C. Yang, K. H. Y. Lin, and H. H. Chen. 2007. Emotion classification using web blog corpora. In IEEE/WIC/ACM Inter-
national Conference on Web Intelligence. IEEE, 275–278.
[10] K. Mehmood, D. Essam, and K. Shafi. 2018. Sentiment analysis system for Roman Urdu. In Science and Information
Conference. Springer, Cham, 29–42.
[11] R. Socher, D. Chen, C. D. Manning, and A. Ng. 2013. Reasoning with neural tensor networks for knowledge base
completion. In Advances in Neural Information Processing Systems. 926–934.
[12] C. Zhang, D. Zeng, J. Li, F. Y. Wang, and W. Zuo. 2009. Sentiment analysis of Chinese documents: From sentence to
document level. J. Assoc. Inf. Sci. Tech. 60, 12 (2009), 2474–2487.
[13] C. Clavel and Z. Callejas. 2016. Sentiment analysis: From opinion mining to human-agent interaction. IEEE Trans.
Affective Comput. 7, 1 (2016), 74–93.
[14] S. Ahmed, S. Hina, and R. Asif. 2018. Detection of sentiment polarity of unstructured multi-language text from social
media. Int. J. Adv. Comput. Sci. Appl. 9, 7 (2018), 199–203.
[15] M. Daud, R. Khan, and A. Daud. 2015. Roman Urdu opinion mining system (RUOMiS). arXiv preprint
arXiv:1501.01386.
[16] A. Z. Syed, M. Aslam, and A. M. Martinez-Enriquez. 2010. Lexicon based sentiment analysis of Urdu text using Sen-
tiUnits. In Mexican International Conference on Artificial Intelligence. Springer, Berlin, 32–43.
[17] N. Mukhtar, M. A. Khan, and N. Chiragh. 2017. Effective use of evaluation measures for the validation of best classifier
in Urdu sentiment analysis. Cognitive Computation (2017), 1–11.
[18] N. Mukhtar and M. A. Khan. 2018. Urdu sentiment analysis using supervised machine learning approach. Int. J. Pattern
Recognit. Artif. Intell. (2018), 32.
[19] S. Mukund and R. K. Srihari. 2012. Analyzing Urdu social media for sentiments using transfer learning with controlled
translations. In Proceedings of the Second Workshop on Language in Social Media. ACL, 1–8
[20] J. Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20, 1 (1960),
37–46.
[21] S. B. Kotsiantis, I. Zaharakis, and P. Pintelas. 2007. Supervised machine learning: A review of classification techniques.
Emerging Artificial Intelligence Applications in Computer Engineering 160 (2007), 3–24.
[22] T. Hastie, R. Tibshirani, and J. Friedman. 2009. Overview of supervised learning. In The Elements of Statistical Learning.
Springer New York, 9–41.
[23] S. I. Gallant. 1990. Perceptron-based learning algorithms. IEEE Trans. Neural Networks 1, 2 (1990), 179–191.
[24] B. E. Boser, I. M. Guyon, and V. N. Vapnik. 1992. A training algorithm for optimal margin classifiers. In Proceedings
of the 5th Annual Workshop on Computational Learning Theory. ACM, 144–152.
[25] G. Zenobi and P. Cunningham. 2001. Using diversity in preparing ensembles of classifiers based on different feature
subsets to minimize generalization error. Machine Learning: ECML 2001, 576–587.
[26] A. Yessenalina, Y. Yue, and C. Cardie. 2010. Multi-level structured models for document-level sentiment classification.
In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. ACL, 1046–1056.
[27] W. Medhat, A. Hassan, and H. Korashy. 2014. Sentiment analysis algorithms and applications: A survey. Ain Shams
Eng. J. 5, 4 (2014), 1093–1113.
[28] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, and J. Vanderplas. 2011. Scikit-learn: Ma-
chine learning in Python. J. Mach. Learn. Res. 12, (Oct. 2011), 2825–2830.
[29] P. H. Shahana and B. Omman. 2015. Evaluation of features on sentimental analysis. Procedia Comp. Sci. 46 (2015),
1585–1592.
[30] C. W. Hsu and C. J. Lin. 2002. A comparison of methods for multiclass support vector machines. IEEE Trans. Neural
Networks 13, 2 (2002), 415–425.
[31] E. Cambria, B. Schuller, Y. Xia, and C. Havasi. 2013. New avenues in opinion mining and sentiment analysis. IEEE
Intell. Syst. 28, 2 (2013), 15–21.
[32] K. Oouchida, J. D. Kim, T. Takagi, and J. I. Tsujii. 2009. GuideLink: A corpus annotation system that integrates the
management of annotation guidelines. In Proceedings of 23rd Pacific Asia Conference on Language, Information, and
Computation. Vol. 2.
[33] Y. Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on
Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, 1746–
1751.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.
Sentiment Analysis for a Resource Poor Language—Roman Urdu 10:15

[34] M. Bilal, H. Israr, M. Shahid, and A. Khan. 2016. Sentiment classification of Roman-Urdu opinions using Naïve
Bayesian, Decision Tree, and KNN classification techniques. J. King Saud Univ. Comp, Inf. Sci. 28, 3 (2016), 330–344.
[35] S. Lai, L. Xu, K. Liu, and J. Zhao. 2015. Recurrent convolutional neural networks for text classification. In AAAI,
Vol. 333. 2267–2273.
[36] M. Taboada, J. Brooke, M. Tofiloski, K. Voll, and M. Stede. 2011. Lexicon-based methods for sentiment analysis. Com-
put. Ling. 37, 2 (2011), 267–307.
[37] R. M. Duwairi, R. Marji, N. Sha’ban, and S. Rushaidat. 2014. Sentiment analysis in Arabic tweets. In 2014 5th Interna-
tional Conference on Information and Communication Systems (ICICS), IEEE. 1–6.
[38] D. Alessia, F. Ferri, P. Grifoni, and T. Guzzo. 2015. Approaches, tools, and applications for sentiment analysis imple-
mentation. Int. J. Comput. Appl. 125, 3 (2015), 26–33.
[39] M. K. Malik. 2017. Urdu named entity recognition and classification system using artificial neural network. ACM
Trans. Asian Low-Resour. Lang. Inf. Process. 17, 1 (2017), 2.
[40] S. Mohammad. 2016. A practical guide to sentiment annotation: Challenges and solutions. In Proceedings of the 7th
Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 174–179.
[41] Y. Sun, A. K. Wong, and M. S. Kamel. 2009. Classification of imbalanced data: A review. Int. J. Pattern Recognit. Artif.
Intell. 23, 4 (2009), 687–719.
[42] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts. 2013. Recursive deep models for
semantic compositionality over a sentiment treebank. In Proceedings of 2013 Conference on Empirical Methods in
Natural Language Processing.
[43] Z. Lu, M. Bada, P. V. Ogren, K. B. Cohen, and L. Hunter. 2006. Improving biomedical corpus annotation guidelines.
In Proceedings of the Joint BioLink and 9th Bio-ontologies Meeting. 89–92.
[44] Z. Sharf and S. U. Rahman. 2018. Performing natural language processing on roman urdu datasets. Int. J. Comput. Sci.
Network Secur. 18, 1 (2018), 141–148.
[45] Ravi Kumar and Ravi Vadlamani. 2015. A survey on opinion mining and sentiment analysis: tasks, approaches and
applications. Knowledge-Based Syst. 89 (2015), 14–46.

Received February 2018; revised February 2019; accepted April 2019

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 19, No. 1, Article 10. Publication date: August 2019.

You might also like