Roman Urdu Multi-Class Offensive Text Detection - 2020
Roman Urdu Multi-Class Offensive Text Detection - 2020
Roman Urdu Multi-Class Offensive Text Detection - 2020
№ 4810
Rabia Gillani
Department of Computer Science,
National Cybercrime Forensics Lab
Air University, Sector E-9, Islamabad
Pakistan
[email protected]
Abstract—Hate content has become a significant issue Threat/Fear, and Neutral. Similar approaches were used to
worldwide due to the increase in social networking sites. detect hate speech used in previous research [3] and [4]. There
Detection of hate content from a language other than English is is no international legal definition of hate speech. However,
challenging. We propose a new technique that automatically according to UN hate speech is: any kind of communication
detects the Roman Urdu comments from YouTube videos into in speech, writing or behavior that attacks or uses abusive or
five classes. These classes, including, Religious Hate, Violence discriminatory language which refers to a group or single
Promotion, Extremist (Racist), Threat/Fear, and Neutral. We person based on religion, ethnicity, nationality, race, color,
have generated dataset by scrapping Roman Urdu comments gender or other identity factors [5].
from YouTube videos and labeled by the language experts. We
have considered N-grams and TF-IDF values for feature In this research paper, we proposed a solution that can
extraction followed by SVM classification. Some classes have identify hate speech into five classes. People use youtube as a
relatively less instances, and we employed SMOTE for class- medium to spread hate speech in the country. So, we decided
balancing. The developed model offers a high classification to use youtube as a source. We made our scrapper that scrape
performance of 77.45% using the 10-Fold cross-validation the comments from YouTube videos. We annotate these
technique. The proposed approach offers superior classification comments into their respective class. We train our model
results as compared to others. using n-gram with norm L1 and L2 of term frequency-inverse
document frequency (TF-IDF) as features values and classify
Keywords—hate speech, n-gram, tf-id, machine learning, deep
learning, youtube
the comments. We evaluate the model using metric scores and
a confusion matrix. In this research, we perform comparative
I. INTRODUCTION analysis using Logistic Regression (LR), Support Vector
Machines (SVM), SGDClassifier, Naive Bayes (NB) using n-
In Pakistan, most people used Roman Urdu for comments gram with L1 and L2 norm of TF-IDF values and document
on video-sharing and social media platforms like YouTube, to vector features with Logistic Regression (LR), Support
Facebook, and Twitter. From the last few years in Pakistan, Vector Machines (SVM) and SGDClassifier as classifier
tremendous growth in the number of people using YouTube. models. Our results show that on our Roman Urdu dataset,
YouTube is the second most visited website in Pakistan [1]. Support Vector Machine (SVM) performs better than all other
People from different religions, cultures, and educational models on n-gram with norm L2 TF-IDF features values. We
backgrounds use YouTube. Sometimes people upload videos also did hyperparameter tuning of our machine learning
which might be inappropriate for various cultures or religions, models using 10-Fold cross-validation. We make a YT
which may lead to verbal assaults in comments because of Monitor web interface that scrapes the comments from a given
differences in people's opinions. Such type of actions may link or keyword and classifies comments into its respected
create law and order situation in the country. People have classes of hate speech.
freedom of speech so they can comment on any content on
YouTube, which leads to the use of abusive language, racist We organized this paper as follows. In section II, we will
comments, religious hate, and sometimes people even give overview related work, which approaches people used to solve
menace. Hate speech makes a terrible impact on society and the hate speech problem. In section III, we will explain our
damages people's mental health, so people commit suicide [2]. methodology; which steps we follow to solve hate speech
problem. Section IV will describe the results of the
Hate speech is a huge issue, although, for English and comparative analysis of machine learning models using
some other languages, there is much work that has been done different features. In section, V concludes our research.
in the hate speech detection field, for Roman Urdu there is no
work done to detect hate speech. This increases the II. RELATED WORK
importance of detection hate speech in Roman Urdu to remove
such content, so we can save people from cyberbullying. Also, Detection and checking of a Hate Speech in social media
manually removing such content is challenging, which also cannot be an easy task. Every day many people write text on
increases the importance of an automated system, which can social media; they use informal languages. Different people
detect hate content from comments on YouTube. use different languages; that is why some words for some
people, are a joke, but for other people, hate speech [3]. This
We divided the hate speech into five different classes: point is difficult to distinguish.
Religious Hate, Violence Promotion—Extremist (Racist),
Different machine learning and deep learning approaches crawling process had done using the scrapy library with AJAX
have used to detect hate speech. In some researches, sentence request. If we give a crawler link of any video, it will scrape
structure used to capture hate speech [6], many others used all comments from this video, or if we provide a YouTube
Lexical features [7], and a bag of words [8] approaches to channel link, the scraper will scrape all comments from all
detect hate speech. Previous research observed that these videos uploaded on the provided YouTube channel. We also
features were not entirely useful to understand the hate speech can search for videos by keyword we input the keyword in the
from text and failed. On the other hand, the N-gram feature search box, and the crawler will scrape the top videos against
with TF-IDF also used in research, which showed better the given keyword, and top videos will show on the YT
results [9] [10]. Monitor interface. From shown videos against keyword
searches, we can select which video comments we want to
Lexical features have two main approaches, dictionary- scrape and simply click the scrape button, and comments will
based and corpus-based. In Lexical features, it involves the scrape for that selected video. For this research, we query
words of the same meaning tagged in a static dictionary with different keyword and find videos that have comments
polarity labels and semantic orientation scores. In a bag of contains offensive in Roman Urdu. We also query many
words (BOW), the text is tokenized into words, followed by offensive Roman Urdu words on YouTube to finds a video
its word frequencies. As a bag of words (BOW) did not care
that has comments related to offensive keywords. We scrape
about word order, semantic of words, and grammar, it mostly about 16806 comments data from YouTube.
used for basics works of natural language processing (NLP)
[11]. B. Data Pre-processing
Linguistic Features are comprised of sample length, parts- In pre-processing, we filter the comments dataset first, and
of-speech, average length of words, number of periods, we remove duplicate comments from data. To decide whether
punctuations, URLs, capitalized letters, polite words, insults a given document is roman or not, we used a pre-defined set
words, hate speech words and one letter words. These features of roman words named as roman dictionary; dictionary
did not provide much importance in studies and did not show contains the set of roman words and words possible write up.
much improvement in the classifier accuracy [12]. Sentiment For example, some people use to write the word kafir or kafar.
analysis features show their importance in hate speech So, we stored all the possible roman words that a user can
detection, and it has seen that they are closely related. It write in the document. The roman dictionary is compiled by
assumed that most negative sentiment concerns hate speech the language experts in our team. Also, we remove those
[13]. Hate speech shows higher negative polarity where hate comments which contain languages other than Roman Urdu.
speech present in document [14]. After filtering the comments, we got 16300 comments, which
is labeled by the language experts.
Recently, word embedding has proposed to detect hate
speech [15]. In Word embedding’s tokens are devour TABLE I. FEW SAMPLES FROM THE DATASET
sequentially in the matrix through the concatenation of tokens
Document Label
embedding's [16]. Also, Deep learning algorithms recently
used to detect hate speech, such as Convolutional Neural This Randi doesnt know the difference between fuel Violence
tank and bomb. Promotion
Network (CNN and LSTM). Convolutional Neural Network Pakistan zindabad pak army zindabad pakistan isi
(CNN) is used to detect hate speech from Twitter in recent Neutral
zindabad.
researches [17] and [18]. They also perform further analysis Tu pak k nitale kuto hramiyo aagar bhart tum pe mut
Extremist
using a word embedding, so they understand the effect of the bhi de na to tum to use amrit samaz kar pi jaoge.
feature selection process on different models. Sahaba se bughuz sirf harmi karskta hn. Religious
meri khawish hai me gun uthaon aur un sb ko maar
III. METHODOLOGY do jo wahan milen gaye muje.
Threat
(2)
² ² . . . ²
Fig. 2. Distribution of obtained dataset 1 The dimensions of the uni-gram, bi-gram and tri-grams
used for the feature extraction of the textual data. The vectors
1) Bad symbols dimensions are shown in the TABLE II given below:
2) Stop words
TABLE II. FEATURES EXTRACTION DIMENSION OF TF-IDF
3) Non-Ascii characters
4) Punctuations Features Dimensions
5) Uniform resource locator (URLs) Uni-gram (25865, 13897)
6) Emoji’s Uni+bi-gram (25865, 26002)
Next step which is a challenge for us, In Roman Urdu,
every person have different writing style like some people Uni+bi+tri-gram (25865, 28884)
write kafir word as Kfar or Kafr or Kafar. So, we have to
convert the word into its original form for that problem, but
we did not have stemming or lemmatization pre-defined for D. Classification Models and Evaluation
Roman Urdu data. We did not have a dictionary for Roman For this research, we use different classifiers Logistic
Urdu data to convert the word into its original form. For this Regression (LR), Support Vector Machine (SVM),
challenging task, we start working on making a dictionary for SGDClassifier, Naive Bayes (NB), and document to vector
Roman Urdu data to convert all different forms of the same features with Logistic Regression, Support Vector Machine
meaning word into one word. For this research, we build that (SVM) and SGDClassifier. We use Scikit-Learn and Keras for
dictionary and convert different forms of the same meaning the implementation of these models. Scikit-Learn is a trendy
word in our data into one word. Our dataset is not published library that provides highly efficient classification models
online yet and hope so will be available soon, because we are which nowadays, almost every researcher uses in their
performing experiments based on that data. research for text and other classification. For better results, we
tune our model parameters using GridSearchCV, also to avoid
C. Feature Extraction
our model from overfitting, we use 10-Folds cross-validation
For this research, we use n-grams features from uni-gram and evaluate our model. Mostly in the research field, 10-Fold
to tri-gram and give the weights with TF-IDF values. We use cross-validation is used. In 10-Fold cross-validation, we
TF-IDF to remove biasness from those tokens, which occurs divided the dataset into 10 parts where 9/10 parts of data use
very frequently in data but are very less informative. TF-IDF for training data, and the 1/10 data use for testing the model.
is computationally and mathematically easy to implement for
problem in hand. The other important thing in the TF-IDF is E. YT Monitor
that it is very simple to calculate the similarity between two or YT Monitor is a web-based application developed to
more documents. Basic calculations such as addition and scrape comments and performs the task of multi-class
subtraction process is used to extract the most descriptive offensive detection in Roman Urdu data. It has an input field
terms from the dataset. Common terms or words in the dataset where user give URL or keyword to scrape comments of
not affect the results due to IDF (e.g. “is”, “are”, “am”, videos from YouTube. Input will pass to the YouTube
etc.).When we complete the feature extractions process, we comment scraper, which we made, and the comment will
provide the data to models for classification. A formula [4] scrape for respected input. After scraping the comments, we
that used to compute TF-IDF is: apply pre-processing steps on scraped comments and pass the
pre-processed comments to our machine learning model,
TF IDF , (1)
which classifies the comments to its respected classes. We
We use the L1 and L2 norm of TF-IDF performing show different graphs as a result, such as a pie chart, line
experiments. L1 norm of TF-IDF is defined as: graph, word cloud.
IV. RESULTS N-gram with TF-IDF Accuracy
Norm LR SVM SGD
The comparative analysis results of machine learning NB
models Logistic Regression (LR), Support Vector Machine uni-gram + bi-gram with
0.7606 0.7734 0.7179 0.7337
(SVM), SGDClassifier (SGD) and Naive Bayes (NB) using L2 norm
uni-gram + bi-gram + tri-
different combinations of feature parameters of TF-IDF gram with L2 norm
0.7614 0.7745 0.7174 0.7312
shown in TABLE I, document to vector features with Logistic
Regression (LR), Support Vector Machine (SVM),
SGDClassifier (SGD) shown in TABLE II. TABLE V shows that after tuning the parameters of
models, Support Vector Machine (SVM) performs best on our
TABLE III. COMPARISON OF MODELS WITH DIFFERENT N-GRAM data. We tune Support Vector Machine (SVM) on Kernel and
FEATURES AND TF-IDF VALUES
Regularization C with different values. We get the best
N-gram with TF-IDF Accuracy parameters of Support Vector Machine (SVM) C is 100, and
Norm LR SVM SGD NB the kernel is rbf on uni-gram, bi-gram, tri-gram with L2 norm
word uni-gram with L1 of TFIDF values. Support Vector Machine (SVM) precision,
0.6369 0.7075 0.6565 0.6565
norm recall, and f-score can see in TABLE IV.
word uni-gram + bi-gram
0.6330 0.6956 0.6581 0.6690
with L1 norm TABLE VI. FINAL TUNED SUPPORT VECTOR MACHINE (SVM) MODEL
word uni-gram + bi-gram SCORES ON TEST DATA
0.6338 0.6944 0.6607 0.6734
+ tri-gram with L1 norm
word uni-gram with L2 Precision Recall F-Score
0.6879 0.7298 0.7038 0.6836
norm Religious Hate 0.78 0.71 0.74
word uni-gram + bi-gram
0.6970 0.7326 0.7179 0.7010 Violence
with L2 norm 0.69 0.74 0.72
word uni-gram + bi-gram, Promotion
0.6973 0.7318 0.7174 0.7017 Extremist
tri-gram with L2 norm 0.86 0.76 0.81
(Racist)
Threat/Fear 0.83 0.96 0.89
TABLE III presents that machine learning algorithms
perform better on the L2 norm of TF-IDF than on the L1 norm Neutral 0.72 0.69 0.70
of TF-IDF. The best model is Support Vector Machine, and
its accuracy is 73.26%, on uni-gram, bi-gram with TFIDF
norm L2. Support Vector Machine (SVM) performs better on TABLE VI shows that the violence promotion class
uni-gram, bi-gram, tri-gram with TFIDF norm L2, and precision is 0.69, which shows that model 31% predict the
achieved 73.18% accuracy. SGDClassifier performs better on other four classes as violence promotion. Recall for neutral
uni-gram, bi-gram with TFIDF norm L2, and achieved comments is 0.69, comparatively low than other class’s recall,
71.79% accuracy. Naïve Bayes (NB) performs better on uni- which shows 31% comments which are neutral but
gram, bi-gram, tri-gram with TFIDF norm L2, and achieved misclassified by the model. Recall for Threat/Fear is 0.96,
70.17% accuracy. which is better.
We also compute the confusion matrix for the tuned
TABLE IV. COMPARISON OF DOCUMENT TO VECTOR FEATURES WITH Support Vector Machine (SVM), which can see in Fig. 3.
DIFFERENT MACHINE LEARNING MODELS
Confusion matrix, also known as an error matrix, is a specific
Models Accuracy table that visually describes the performance of a supervised
LR 0.6142 classification machine learning algorithm.
SVM 0.6142 Fig. 3 we can see that model did several misclassifications.
SGD 0.6069
To increase the score of model improvements can be done in
this area. The final accuracy of the Support Vector Machine
(SVM) model obtained on testing data is 77.45%.
TABLE IV demonstrates that Logistic Regression
(LR) and Support Vector Machine (SVM) performs the same
and better than SGDClassifier (SGD) using the document to
vector features. We got 61.42% accuracy for Logistic
Regression (LR) and Support Vector Machine (SVM) using
the document to vector features, and for SGDClassifier
(SGD), we got 60.69% accuracy.
Machine learning models perform better using n-gram
with the L2 norm of TF-IDF values on our dataset. We tune
machine learning models using 10-Fold cross-validation on n-
gram with the L2 norm of TF-IDF shows in TABLE III.