0% found this document useful (0 votes)

20 views

A Big-Data Processing and Visualization Platform

This document describes MANDOLA, a big data processing system that monitors, detects, visualizes and reports the spread of online hate speech. It consists of six components that consume, process, store and visualize statistical information about hate speech. It also presents a novel ensemble-based classification algorithm to improve hate speech detection. The system provides a comprehensive approach to monitor hate speech using big data techniques and generates actionable insights for policymakers.

Uploaded by

wingwingxd61

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

A Big-Data Processing and Visualization Platform

Uploaded by

wingwingxd61

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

MANDOLA: A Big-Data Processing and Visualization

Platform for Monitoring and Detecting Online Hate Speech

DEMETRIS PASCHALIDES and DIMOSTHENIS STEPHANIDIS, University of Cyprus

ANDREAS ANDREOU, KALIA ORPHANOU, GEORGE PALLIS, and
MARIOS D. DIKAIAKOS, University of Cyprus
EVANGELOS MARKATOS, University of Crete

In recent years, the increasing propagation of hate speech in online social networks and the need for effec-
tive counter-measures have drawn significant investment from social network companies and researchers.
This has resulted in the development of many web platforms and mobile applications for reporting and mon-
itoring online hate speech incidents. In this article, we present MANDOLA, a big-data processing system
that monitors, detects, visualizes, and reports the spread and penetration of online hate-related speech using
big-data approaches. MANDOLA consists of six individual components that intercommunicate to consume,
process, store, and visualize statistical information regarding hate speech spread online. We also present a
novel ensemble-based classification algorithm for hate speech detection that can significantly improve the
performance of MANDOLA’s ability to detect hate speech. To present the functionality and usability of our
system, we present a use case scenario of real-life event annotation and data correlation. As shown from
the performance of the individual modules, as well as the usability and functionality of the whole system, 11
MANDOLA is a powerful system for reporting and monitoring online hate speech.
CCS Concepts: • Computing methodologies → Ensemble methods; Natural language processing;
• Social and professional topics→ Hate speech; User characteristics; • Information systems→ Data
stream mining;
Additional Key Words and Phrases: Hate speech, online social networks, deep learning, system approach,
big-data processing platform
ACM Reference format:
Demetris Paschalides, Dimosthenis Stephanidis, Andreas Andreou, Kalia Orphanou, George Pallis, Marios D.
Dikaiakos, and Evangelos Markatos. 2020. MANDOLA: A Big-Data Processing and Visualization Platform for
Monitoring and Detecting Online Hate Speech. ACM Trans. Internet Technol. 20, 2, Article 11 (March 2020),
21 pages.
https://doi.org/10.1145/3371276

This article was written with financial support of the Rights Equality and Citizenship (REC) programme of the European
Union. The contents of this publication are the sole responsibility of the authors and can in no way be taken to reflect the
views of the European Commission.
Authors’ addresses: D. Paschalides, D. Stephanidis, A. Andreou, K. Orphanou, G. Pallis, M. D. Dikaiakos, Computer
Science Department, University of Cyprus, Nicosia, Cyprus; emails: {dpasch01, dstefa02, aandre28, korfan01, gpallis,
mdd}@cs.ucy.ac.cy; E. Markatos, Department of Computer Science, University of Crete, Heraklion, Crete, Greece; email:
[email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2020 Association for Computing Machinery.
1533-5399/2020/03-ART11 $15.00
https://doi.org/10.1145/3371276

ACM Transactions on Internet Technology, Vol. 20, No. 2, Article 11. Publication date: March 2020.
11:2 D. Paschalides et al.

1 INTRODUCTION
Online social networks (OSNs) have revolutionized human communication by providing people
with a medium to freely and instantly share thoughts, opinions, and real-life experiences at a
large scale over the Internet. Unfortunately, certain properties of OSNs, such as their openness,
ease of -use, scale, and anonymity, are being exploited by people who use these platforms to spread
hate speech and organize hateful activities. Although it is difficult to get accurate statistics about
the spread of hate speech in OSNs, the picture is becoming increasingly clear: social networking
platforms are alarmingly effective at spreading hate speech, and most users have encountered it
at some point. To make matters worse, hate speech usually targets the most vulnerable groups
within society, such as children, minorities, and immigrants. For example, in the United Kingdom,
there has been a significant increase in hate speech against the migrant and Muslim communi-
ties following recent events, including the Manchester and London terrorist attacks and the Brexit
campaign [56]. A Council of Europe online survey [20] shows a rise in hate speech and related
crimes in Europe, targeting victims because of their religious beliefs, ethnicity, or gender. Simi-
larly, statistics show that in the United States, hate speech and hate crime occurrences have been
growing in recent years [53]. The urgency of this matter has been increasingly recognized, as a
range of international initiatives have been launched toward the qualification of the problems and
the development of counter-measures [24].
Hate speech is defined as the expression of extreme dislike of a person or group of people be-
cause of race, ethnicity, religion, or gender orientation [14, 30]. In the legal aspect of hate speech,
there are different laws applyed to different countries across the globe. People convicted of us-
ing hate speech can often face large fines and even imprisonment [38]. These laws extend to the
Internet and social media, leading many sites to create their own provisions against hate speech.
Facebook,1 for instance, defines the term hate speech as “direct and serious attacks on any protected
category of people based on their race, ethnicity, national origin, religion, sex, gender, sexual ori-
entation, disability, or disease.” Twitter2 does not provide its own definition but simply forbids to
“publish or post direct, specific threats of violence against others.” YouTube3 clearly specifies that
it does not permit hate speech, which is defined thereto as “speech which attacks or demeans a
group based on race or ethnic origin, religion, disability, gender, age, veteran status, and sexual
orientation/gender identity.” Google4 makes a special mention on hate speech in its User Content
and Conduct Policy: “Do not distribute content that promotes hatred or violence towards groups
of people based on their race or ethnic origin, religion, disability, gender, age, veteran status, or
sexual orientation/gender identity.”
Different definitions of hate speech also appear in research works that focus on hate speech
detection via computational methods to guide their detection systems. Silva et. al. [54] define hate
speech as “any offense motivated in whole or in a part by the offender’s bias against an aspect of a
group of people.” In the work of Gitari et. al. [27], hate speech is defined in three parts: it targets a
group of people and not a single person, it may contain dangerous speech and threats, and it may
encourage violence against the target group.
There are two major difficulties in dealing with online hate speech: (i) lack of reliable data that
can show detailed online hate speech trends and (ii) awareness about how to deal with the issue,
since there is a fine line between hate speech, freedom of speech, and the respect of privacy of OSN

1 https://www.facebook.com/communitystandards/Hate speech.
2 https://help.twitter.com/en/rules-and-policies/hateful-conduct-policy.
3 https://support.google.com/youtube/answer/2801939?hl=en.
4 https://www.google.com/intl/en/+/policy/content.html.

ACM Transactions on Internet Technology, Vol. 20, No. 2, Article 11. Publication date: March 2020.
MANDOLA 11:3

users. Furthermore, the boundaries between legal and potentially illegal hate speech are sometimes
blurred and may vary between territories.
Following this motivation, we developed the MANDOLA system, which implements a compre-
hensive big-data approach to monitor, detect, and visualize online hate speech. A prototype imple-
mentation of the system is available online.5 In particular, MANDOLA provides (i) an automated
systematic approach for monitoring, detecting, analyzing, and visualizing online hate speech in a
privacy-preserving manner, and (ii) actionable information to policymakers, which can be used to
drive policies that mitigate the spread of online hate speech.
At the core of MANDOLA lies a hate speech detection module, which implements and com-
bines a variety of automated methods for hate speech detection in OSNs. These methods employ
natural language processing (NLP) techniques, focusing on sentiment analysis to detect highly po-
larized contexts that are often recognized in hate speech incidents. The output of NLP is used as
input to either traditional machine learning (ML) or deep learning (DL) models for the purpose of
classifying content into hate speech or not. To boost the performance of MANDOLA’s hate speech
detection, we introduce a three-layer stacked ensemble classifier that, among other models, also in-
cludes three deep neural network (DNN) models, namely word leve, character level, and metadata
level, which are empirically shown to be effective feature extractors for hate speech detection.
The main contributions of this work can be summarized as follows:
• The design, architecture, and implementation of a systematic approach for monitoring, re-
porting, and hate speech detection that can be easily adapted to any big-data classification
analysis of any data source.
• A novel ensemble-based classification algorithm for hate speech detection, incorporated in
MANDOLA’s hate speech detection module. The proposed method uses NLP and sentiment
analysis techniques, and is able to outperform the current state of the art.
• The MANDOLA’s monitoring dashboard, equipped with functionalities and visualization
tools for a variety of uses, as well as correlations with real-life events.
The remainder of this article is organized as follows. Section 2 presents an overview of the
related work on hate speech detection. Section 3 describes the architecture of MANDOLA and
provides an overview of the system modules. In Section 4, we describe a use case scenario of
real-life event annotation and data correlation to demonstrate the functionality and usability of
the system. Finally, in Section 5, we conclude by outlining the contributions of the MANDOLA
system on hate speech detection and monitoring and stating our future plans for how to expand
the system.

2 RELATED WORK
The existence of the problem of hate speech is globally recognizable with several initiatives that
try to address it. Most of the state-of-the-art works on combating hate speech focus on developing
online platforms and visualization tools for reporting incidents of hate speech to relevant author-
ities and for monitoring and visualizing the spread and penetration of online hate speech. One
of the first developed platforms for monitoring and reporting hate speech is the Umati Kenyan
platform,6 which was launched in October 2012, 6 months before the Kenya general elections. Al-
though the second phase of Umati platform, Umati II, looked at employing ML and NLP techniques
to collect and detect hate and dangerous speech incidents, it only collects data from the Kenyan
online space. Similarly, the COSMOS platform [9] is a free software tool that allows the collection

5 http://mandola.grid.ucy.ac.cy/dummy.
6 https://ihub.co.ke/ihubresearch/Umati%20Report2015_IntelligentUmatiMonitor.pdf.

ACM Transactions on Internet Technology, Vol. 20, No. 2, Article 11. Publication date: March 2020.
11:4 D. Paschalides et al.

of real-time social media posts by specifying keywords directly from Twitter’s streaming applica-
tion programming interface (API). Burnap et al. [9], pre-process and analyze hate speech incidents
from collected data using sentiment analysis, NLP methods, and language and tension detection.
Recently, they also introduced the Hate Lab Dashboard,7 a monitoring dashboard that tracks hate
speech and visualizes different patterns and correlations among hate speech incidents and a hate
crime event. Although Umati and COSMOS platforms provide big-data collection, monitoring,
and visualization of hate speech incidents, MANDOLA also employs ML techniques to analyze
and monitor big data for hate or offensive speech classification. Another benefit of MANDOLA, in
contrast to the related platforms, is that it is compliant with the European Union’s new General
Data Protection Regulations (GDPR) and processes information in real time and on-the-fly without
storing any user-specific information.
Other more recent related platforms are the Hatemeter platform8 and another platform that
monitors cyberbullying phenomena proposed in Menini et al. [40]. The Hatemeter platform is
part of an ongoing project and aims to detect anti-Muslim hate speech using ML and NLP tech-
niques, and also provides big-data analytics and visualization for monitoring hate speech incidents.
The platform comprises two components: (i) an algorithm that automatically detects online com-
munities from geo-referenced pictures extracted from Instagram messages and (ii) a hate speech
classifier. Although both platforms employ ML methods for hate speech classification, in contrast
to MANDOLA, their functionality is tested only on specific hate speech sub-classes: racism and
sexism.
There are also several reporting platforms and websites that allow users to report hate speech
incidents, either for their own analyses or to submit them to the appropriate authorities. Known
examples of such reporting platforms are C.O.N.T.A.C.T.,9 which also provides a live data visualiza-
tion for reported hate speech incidents; the INACH reporting platform10 ; INHOPE11 ; and Galop.12
Other platforms only provide visualization of hate speech incidents and not any reporting, such
as the eMORE platform,13 which consists of a map depicting online hate speech, as well as info-
graphics regarding statistics on hate speech; the Guardian’s Datablog,14 which provides a visual
representation of hate crimes reported in the United Kingdom between 2011 and 2012; and the
Italian monitoring platform15 [10] for fighting hate against immigrants. The aforementioned ini-
tiatives either visualize hate speech or provide a reporting platform, whereas MANDOLA also
extensively uses ML techniques for hate or offensive speech classification.
The main novelty of the MANDOLA platform compared to the state-of-the-art platforms lies in
the combination of hate speech detection and inference of discussion topics of the tweets identified
as hate speech by employing a combination of ML and DL methods.

2.1 Hate Speech Classification

Hate speech detection can be defined as the classification task of documents (e.g., tweets, comments)
into hate speech or hate speech sub-classes such as racism and sexism [52]. Existing methods of

7 https://hatelab.net/.
8 http://hatemeter.eu/.
9 http://reportinghate.eu/en/.
10 http://www.inach.net/.
11 https://www.inhope.org.
12 http://www.galop.org.uk/online-report-form/.
13 https://www.emoreproject.eu/.
14 https://www.theguardian.com/news/datablog/interactive/2012/sep/13/hate-crime-map-england-wales.
15 https://mappa.controlodio.it.

ACM Transactions on Internet Technology, Vol. 20, No. 2, Article 11. Publication date: March 2020.
MANDOLA 11:5

hate speech classification can be divided into three main categories: (i) traditional methods, (ii) DL
methods, and (iii) ensemble methods.
2.1.1 Traditional Methods. Traditional methods rely on manual feature engineering. For the
raw data to be used as input by classifiers, it is required to design and encode features that are
derived from the data. Effective features for online hate speech detection and similar tasks can be
categorized into textual features, user features, and network features [11]. Textual features con-
sider common and basic data that can be extracted from text with NLP techniques, such as the
amount of URLs and hashtags, as well as the average length of words. User features consider data
regarding the author of the post or comment, such as the user’s popularity, her activity, and the
age of her account [11, 22]. Finally, the network features are extracted from the analysis of the
user’s ego network, such as cascading of hate speech in a friendship network [11]. The features
extracted for detecting hate speech can be divided into the following six categories [52]: (i) sim-
ple surface features, such as bag of words (BoW) [69], word and character n-grams [7], word and
sentence length, the existence of hashtags, and the number of capital letters [12, 17, 43]; (ii) word
generalization features, using low-dimensional word vectors learned from clustering algorithms
[59], topic modeling [63, 72], look-up tables [15], and word embeddings [19, 43, 59, 64]; (iii) lin-
guistic features, including part-of-speech (PoS) tags [45] and dependency relations between words
[9, 12, 17, 23, 72]; (iv) sentiment analysis, with the polarity and subjectivity expressed in a message
[9, 17, 23, 58]; (v) lexical resources, often used to match specific words in messages, such as profane,
hateful, and modal words [9, 23, 43, 63]; and (vi) meta-information, referring to user-level data re-
garding the sender of a specific message, such as gender and age, as well as network-level data of
the sender’s ego network in the OSN [16, 60, 61, 63]. Known examples of traditional methods are
logistic regression [9, 19, 60, 61, 63, 64], random forest [63], naive Bayes [9, 39] and support vector
machine (SVM) [17, 29, 39, 59, 63].
2.1.2 DL Methods. DL methods revolve around the recent DL paradigm that is applied to many
different domains with promising results [22, 25, 43, 46, 58, 70]. Such methods automatically learn
representations of their inputs and generate abstract features by letting the data flow between a
set of layers. DL methods for hate speech detection learn abstract feature representations from
the input, either raw or encoded, by passing it through multiple stacked layers of neurons. The
main difference between DL and traditional methods is that the input is not directly used for
classification but rather to derive new abstract feature representations proven to be more effective
for learning. Thus, feature engineering is not an essential process for DL methods, where the main
focus is on the network structure that is designed in such a way as to extract useful features from a
simple input feature representation. Research in hate speech detection is gradually shifting toward
DL methods, which seem to outperform traditional methods [25, 46].
Existing works in hate speech detection using DL methods can be divided into three categories:
(i) text-based methods, (ii) metadata-based methods, and (iii) hybrid methods. Text-based methods
use textual features of text sequences by applying one-hot encoding to the input at word level,
such as word embeddings, or character level [3, 22, 25, 46, 58, 70]. Metadata-based methods use as
input numerical features similar to those used in traditional methods. The difference is that to per-
form the classification, the metadata-based methods generate abstract (high-level) features using
densely connected layers [22, 31, 55]. Hybrid methods utilize both text and metadata, generating
hybrid models and boosting the hate speech classification performance [22, 46, 58, 66]. The most
popular DNN structures used in hate speech detection are convolutional neural networks (CNNs)
[35] and recurrent neural networks (RNNs) [36]. In the literature, the CNN is considered to be
effective in extracting abstract features from the input [26], whereas the RNN works well with
orderly sequence learning problems [44]. Existing works on hate speech detection utilize CNNs

ACM Transactions on Internet Technology, Vol. 20, No. 2, Article 11. Publication date: March 2020.
11:6 D. Paschalides et al.

Fig. 1. Overview of the MANDOLA Big-Data Processing System architecture.

to extract word or character combinations [3, 25, 46] and RNNs to learn word or character depen-
dencies based on their order in the sequence [3, 58]. There are also hybrid approaches using both
the CNN and RNN, showing the benefits of combining both structures in such tasks [70].
2.1.3 Ensemble Methods. Aside from traditional and DL methods, a combination of these meth-
ods has also been used for hate speech detection, creating ensemble classifiers for this task. En-
semble learning is another ML process that is able to improve the classification performance by
the strategic combination of different models and classifiers. Ensembles are known to reduce the
risk of selecting the wrong model by aggregating the candidate models, as well as harnessing in-
dividual benefits and complementing each model. Most of the errors in the learning process of a
model are originated from three main factors: variance, noise, or bias. Using ensemble methods,
the stability of the final model increases and these errors reduce effectively. There are three main
ensemble techniques describing the combination of various models into one more potent model:
(i) bagging [6], which targets to reduce the variance error factor by generating multiple bootstrap
training sets from the original training set, using each of them to generate a classifier for inclusion
in the ensemble; (ii) boosting [51], which targets the reduction of the bias by involving the incre-
mental building of an ensemble; and (iii) stacking [62], with the concept of using a new classifier
to correct the errors of a previous classifier, and hence the classifiers are stacked on top of one an-
other. Ensemble models have also shown promising results in the task of hate speech classification
[3, 46, 48].
Most of the traditional ML and DL methods can be incorporated in the hate speech detection
module of the MANDOLA system. In addition to the existing methods, in Section 3.3, we propose
a novel three-layer stacked ensemble classifier.

3 MANDOLA BIG-DATA PROCESSING SYSTEM

The MANDOLA Big-Data Processing System consists of six individual components, namely the
data collection engine, data pre-processing, hate speech detection, hate topic inference, hate
speech metadata storage, and visualization dashboard. These components intercommunicate to
consume, process, store, and visualize statistical information regarding hate speech spread online.
An overview of MANDOLA is illustrated in Figure 1.
The system consumes data from the data collection engine, an interchangeable streaming data
source that can be characterized by volume, variety, and velocity. At the moment, the only stream-
ing source implemented and used is the Twitter stream. However, the collection engine can easily
incorporate or remove any other data source, and several connectors to popular news media sites
have been implemented as well. The data consumed from the collection engine first enters the
data pre-processing module, where a pipeline of NLP techniques is applied to clean it and prepare

ACM Transactions on Internet Technology, Vol. 20, No. 2, Article 11. Publication date: March 2020.
MANDOLA 11:7

it for the next component. Next, the data enters the hate speech detection module, where data items
are classified as hate speech or not, as well as annotated based on their discussion topic via the
hate topic inference module. After the end of the process, the system stores the extracted statis-
tical insights in the hate speech metadata storage, which is accessible via a RESTful API for data
querying and visualization on the visualization dashboard. Alternatively to the visualization dash-
board, monitoring can be performed and visualized by the MANDOLA mobile application or other
third-party services.

3.1 Data Collection Engine

At its current state, MANDOLA mostly handles input from the Twitter stream. It can also collect
input from user comments posted on online news sites for which there is a licensing agreement.
The Data Collection Engine is responsible for the collection of tweets via the Twitter Stream-
ing API.16 To be compliant to GDPR requirements [57], the MANDOLA system processes the in-
formation in real time and on-the-fly without storing any user-specific information, apart from
the statistical output and metadata resulted from the processing. The information stored by the
MANDOLA system includes (i) the hate classification output, (ii) the hate topic inference, (iii) the
date that the tweets were published or updated, (iv) the language, and (v) the encoded location of
origin.
Most OSN platforms provide free, programmatic access to their data streams. However, this
typically needs to comply to certain limitations that aim at reducing the amount of content down-
loaded per second per user or the number of simultaneous connections allowed per IP address. To
mitigate the effect of these limitations on Twitter content retrieval, we adopt a distributed retrieval
framework that we developed in prior work [21]. This framework introduces an intelligent crowd
harvesting approach, where volunteers can authorize the framework to generate multiple Twitter
Streaming API keys, hence providing it with an increased number of Twitter tokens that can be
used during the retrieval process to increase the number of parallel Twitter connections and the
Twitter stream harvesting throughput. The data streams are collected and processed in real time,
and the framework dynamically adapts its load with respect to keywords sought and geographic
areas monitored. This framework has been shown to be more effective than other state-of-the-
art systems [18, 33], overcoming Twitter’s restriction of providing for free only 1% of the Twitter
stream. In particular, our framework is able to execute a large-scale real-time monitoring campaign
with up to three times higher throughput than the common approach. In a single day, this resulted
in the collection of more than 9 million different Tweets, while at the same time, Twitter Stream
API did not return more than 3 million.
MANDOLA consumes only “geotagged” tweets, to retrieve the information that can help us
point each retrieved tweet to a specific country and city, for both statistical and visualization pur-
poses. The number of geotagged tweets is low, with only 3.17% of the tweets in the Twitter stream
contain geolocation information [42].

3.2 Data Pre-Processing

Once a tweet is received from the Twitter stream, it undergoes pre-processing in real time. The
aim of data pre-processing is to transform the input text to a normalized, understandable by the
models sequence of words before feeding it to the hate speech classifier or the hate speech topic
inference modules. This is a standard procedure in ML, where the dimensionality of the dataset
is reduced, making the training of the classifier easier and faster. This procedure differs based on

16 https://developer.twitter.com/en/docs/tweets/filter-realtime/overview.html.

ACM Transactions on Internet Technology, Vol. 20, No. 2, Article 11. Publication date: March 2020.
11:8 D. Paschalides et al.

the classification task. For the task of hate speech classification, the text undergoes the following
cleaning procedure step-by-step:
(1) Remove URLs from the text.
(2) Remove the following characters: | : , ; & ! ? /.
(3) Suppress three or more repeated letters into one (e.g., hellooooo to hello).
(4) Replace slang words and phrases with their actual meaning using dictionaries.17
(5) Normalize hashtags into words, so #refugeesnotwelcome becomes refugees not welcome.
(6) Use lowercase and stemming to reduce word inflections.
(7) Remove user mentions from the text.
(8) Remove stop words from the text.
Feature selection is another significant part of the data pre-processing process following the
cleaning procedure. In MANDOLA, only textual features are considered, as online hate speech
does not necessarily apply in platforms with a social network structure (e.g., followers). The textual
features used as input into the hate speech detection module are chosen based on several state-of-
the-art works that try to tackle online hate speech [3, 17, 22, 43, 58].
Simple surface features are the most obvious information to utilize for tasks such as hate speech
detection. These features are often reported to be highly predictive, but when combined with other
feature groups, they can further improve performance. Although simple surface features usually
yield good accuracy in hate speech detection, they might result in a data sparsity problem (con-
taining a lot of zeros). For that reason, such classification tasks also apply some form of word
generalization using TF-IDF vectors and word embeddings. Another significant category of fea-
tures for hate speech detection relates to sentiment analysis. MANDOLA makes use of the VADER
model [32] for sentiment analysis. Hate speech and sentiment analysis are closely related to neg-
ative sentiment usually present in hate speech messages [9, 17, 23, 58]. Hate speech is considered
an expression of opinion by an individual, and thus a hate speech message tends to have higher
subjectivity scores. Similarly, lexical resource features are specific words regarding the state of
the writer (e.g., angry, familiar with the subject, confident) or the context of the words used. The
presence of offensive or blacklisted words has proven to be helpful in the detection of hate speech.
Linguistic properties of the text also play an important role in hate speech detection. Linguistic
features such as PoS tags and dependency parsing try to capture long-range dependencies between
words, which may not be captured by n-grams [43]. For example, in the phrase Martians are lower-
class pigs, an n-gram model would not be able to connect Martians and pigs, whereas a dependency
parser would generate the tuple are-Martians-pigs, where Martians and pigs are the children of are.
Examples of these categories of features are also displayed in Table 1 based in the work of Schmidt
and Wiegand [52].

3.3 Hate Speech Detection Module

The traditional ML and DL methods described in Section 2.1.2 showcase hate speech detection
models, trained on text in either character or word granularity, as well as other linguistic fea-
tures. In addition to the existing methods, we propose a novel three-layer stacked ensemble clas-
sifier, which can be incorporated in the hate speech detection module of the MANDOLA system,
which leverages different state-of-the-art methods, providing a more holistic approach on the task
of hate speech classification. The master classifier is trained on the outputs of slave classifiers
based on methods focusing on character-level, word-level, and metadata-level features of hate
speech. The proposed ensemble classifier is able to achieve very promising results, outperforming

17 https://github.com/saurabhhjjain/SlangWordsDetectorCorrector/blob/master/data/slangdict.csv.

ACM Transactions on Internet Technology, Vol. 20, No. 2, Article 11. Publication date: March 2020.
MANDOLA 11:9

Table 1. Table with the Features Used in the Training of the Metadata-Based Model Grouped
in Their Respective Categories

Feature Category Feature Value

Character 3-grams, 4-grams, and 5-grams Discrete
Word uni-grams and bi-grams Discrete
Length of text in tokens Discrete
Average length of word Continuous
Number of individual punctuation Discrete
Number of repeated punctuation Discrete
Number of one-letter tokens Discrete
Number of capitalized letters Discrete
Simple surface Number of URLs, hashtags, and user mentions Discrete
Number of tokens with non-alpha middle
Discrete
characters
Number of uppercase-only words Discrete
Contains hashtag and/or user mention Binary
Number of characters Discrete
Number of words Discrete
Percentage of capitalized letters Continuous
Number of named entities Discrete
Contains named entity Binary
Average word embeddings vector Continuous
Word generalization TF-IDF weighted word tri-grams Continuous
TF-IDF weighted PoS tags Continuous
Negative and positive sentiment scores Continuous
Sentiment analysis
Subjective and objective sentiment scores Continuous
Number of modal words Discrete
Number of unknown words Discrete
Number of insult and hate speech blacklisted
Discrete
Lexical resources words
Number of emoticons Discrete
Number of words related to emotions Discrete
Number of offensive and profane words Discrete
Ratio of misspelled words and total words Continuous
Syntactical dependencies in the text Discrete
Linguistic features
PoS tags Discrete

state-of-the-art models in hate speech detection. The architecture is illustrated in Figure 2. Next,
we present an overview of the slave (C 1 − C 6 ) and master (C M ) classifiers used in our ensemble:

• C 1 : A hybrid CNN and RNN with a gated recurrent unit (GRU) [13], an adaptation of the
model proposed in Zhang et al. [70]. Such combination is able to harness both the ability of
a CNN to extract word combinations and the learning of word dependencies of an RNN. The
intuition of its structure is to extract dependencies between words or phrases as features.
A very descriptive example used by Zhang et al. [70] is the following:
‘These muslim refugees are troublemakers and parasites,they should be
deported from my country.’

ACM Transactions on Internet Technology, Vol. 20, No. 2, Article 11. Publication date: March 2020.
11:10 D. Paschalides et al.

Fig. 2. Stacked ensemble classification architecture where circle nodes represent data, rectangles represent
classifiers, and arrows indicate dataflow.

The individual words muslim, refugee, troublemakers, parasites, and deported do not in-
dicate features of hate speech because they can be used in any possible context. The
combination of these words, however, can be more indicative of hate speech, such as mus-
lim refugees and troublemakers. Such dependencies cannot be captured by n-gram-based
features, whereas the combination of CNN and RNN in this model can. The model receives
as input word embeddings of the text, using GloVe embeddings [47].
• C 2 : A temporal CNN based on Zhang and LeCun [67] and Zhang et al. [68]. This model is
chosen because CNN does not require the knowledge of words, syntax, or semantic struc-
tures and performs equally well with characters. This renders word-based feature extrac-
tors such as word embeddings [41] or look-up tables [15] unnecessary. The input of this
network is a 1-of-m character encoding of the text. Such encoding is done by the use of an
alphabet with m characters. The sequence of the characters is transformed to a sequence of
m-sized vectors with fixed length l. Using the 1-of-m character encoding, the input is quite
sparse with multiple zeros, but the temporal CNN is able to learn the character embeddings
from this encoding without the need of normalization. The alphabet used consists of m = 70
characters, including 26 lowercase English letters, 10 digits, and 33 special characters.
• C 3 : A DNN that consists of multiple stacked layers trained over a set of descriptive fea-
tures regarding hate speech detection (see Table 1). The set of fully connected layers (dense
layers) forms a bottleneck [55] that has proven to work well in generating abstract feature
representations, supporting the network into learning.
• C 4 : Three-way classification using one-vs-all logistic regression with L2 regularization over
average word vectors.
• C 5 : Three-way classification using one-vs-all logistic regression with L2 regularization over
the standard scaled features described in C 3 .
• C 6 : Three-way classification using one-vs-all logistic regression with L1 regularization
trained on the predicted output from C 1 , C 2 , and C 3 , respectively (ŷ1 , ŷ2 , and ŷ3 ).

ACM Transactions on Internet Technology, Vol. 20, No. 2, Article 11. Publication date: March 2020.
MANDOLA 11:11

Table 2. Table with the Re-Implemented State-of-the-Art Models Used for the
Performance Comparison for Hate Speech Detection

Model Structure Category

CHASE [70] CNN/RNN Word-Level DL
HybridCNN [46] CNN Hybrid DL
SVM [17] SVM Traditional ML
GBDT with LSTM embeddings [3] GBDT/LSTM Ensemble of ML with DL
Unified NN [22] RNN/DNN Hybrid DL
WordCNN [46] CNN Word-Level DL
Logistic Regression [43] LR Traditional ML

• C M : One-vs-all linear SVM with L2 regularization trained on the prediction output from C 4 ,
C 5 , and C 6 , respectively (ŷ4 , ŷ5 , and ŷ6 ).
For the evaluation of the proposed classifier approach, seven state-of-the-art works were se-
lected, as shown in Table 2, to compare their performance to the proposed algorithm. The com-
parison methods include two traditional ML approaches [17, 43], one simple DL approach [70], two
hybrid DL approaches [22, 46], and one ensemble DL approach [3] to compare to all the possible
categories of classifiers presented in Section 2. The selected methods had the best performance
either using only metadata-level features [17, 43], only character-level features [46], or only word-
level features [3, 22, 70]. The methods were re-implemented and trained on the same dataset for
a fair comparison. There is a handful of datasets regarding online hate speech detection used in
the literature [17, 25, 46, 60, 61, 70], with Basile et al. [4] and Zampieri et al. [65] being the most
recent ones. Most of these provide a version of hate speech definition and focus on certain hate
topics, such as racism and sexism. The dataset from Davidson et al. [17] was selected to be used in
the MANDOLA hate speech detection module due to the specificity of the hate speech definition
provided, as well as the generalized hate topics and the differentiation between hate speech and
offensive speech, where the latter falls within the freedom of expression. The selected dataset is
divided into three classes: hate, offensive, and non-hate. The tweets do not revolve around a specific
topic (e.g., sexism or racism) but were gathered using a controlled vocabulary of abusive words
from Hatebase.18 The tweets were manually annotated by experts through the Figure Eight19 plat-
form, with each tweet coded by three or more people. The intercoder -agreement provided by
Figure Eight is 92%. The final labels of the tweets as hate speech, offensive, or neither were se-
lected based on the majority decision for each tweet. By analyzing the dataset, it is visible that it
is significantly biased toward the non-hate class, as hate tweets account for only 6% of the corpus,
77% are labeled as offensive, and the rest are labeled as as non-hate. For the corresponding dataset,
hate speech is defined as language that is used to express hatred, insult, or humiliate a targeted
group or its members. Offensive language is less clearly defined as speech that uses profanity, but
not necessarily carrying an offensive meaning.
3.3.1 Evaluation of Hate Speech Classifier. All of the aforementioned models were re-
implemented using Python Keras20 with Tensorflow21 [1] as back-end, as well as the scikit-learn22

18 https://hatebase.org/.
19 https://www.figure-eight.com/.
20 https://keras.io/.
21 https://www.tensorflow.org/.
22 https://scikit-learn.org/.

ACM Transactions on Internet Technology, Vol. 20, No. 2, Article 11. Publication date: March 2020.
11:12 D. Paschalides et al.

Table 3. Performance Scores of Hate Speech Detection Models and the p-Value of the Wilcoxon Rank
Test of the MANDOLA Approach Compared to the Other Works Between Parentheses

Micro- Micro- Micro- Balanced

Model AUC
Precision Recall F1 Score Accuracy
0.866 0.866 0.866 0.794 0.899
CHASE [70]
(p = .005) (p = .005) (p = .005) (p = .021) (p = .005)
0.858 0.858 0.858 0.796 0.893
HybridCNN [46]
(p = .005) (p = .005) (p = .005) (p = .009) (p = .005)
0.872 0.872 0.872 0.714 0.904
SVM [17]
(p = .005) (p = .005) (p = .005) (p = .005) (p = .005)
0.851 0.851 0.851 0.679 0.888
GBDT with LSTM embeddings [3]
(p = .005) (p = .005) (p = .005) (p = .005) (p = .005)
MANDOLA 3-Layer Stacked Ensemble 0.890 0.890 0.890 0.770 0.918
0.868 0.868 0.868 0.713 0.901
Unified NN [22]
(p = .005) (p = .005) (p = .005) (p = .005) (p = .005)
0.863 0.863 0.863 0.750 0.897
WordCNN [46]
(p = .005) (p = .005) (p = .005) (p = .012) (p = .005)
0.881 0.881 0.881 0.733 0.911
Logistic Regression [43]
(p = .016) (p = .016) (p = .016) (p = .006) (p = .016)
Note: The best results on each metric are highlighted in bold.

library. For the DL-based approaches, the training epochs were fixed at 100 with mini-batches of
128. During training, categorical cross-entropy [28] was used as loss function and Adam [34] as
the optimization function. In addition, to control the models’ over-fitting, an early stopping mech-
anism was used. Early stopping is responsible for interrupting the training if the validation loss
does not drop for 10 consecutive epochs. All experiments were run in a stratified 10-fold cross
validation. Stratified k-fold was used as another mean to mitigate the unbalanced dataset. Finally,
all experiments were run on a VM with Ubuntu 16.4, 16 VCPUs, and 32 GB of RAM. For the sake of
training time, some models were also trained using Google’s Colab,23 a 12-hour free subscription
to a Google Cloud VM with 13 GB of RAM and a Tesla K80 GPU.
The evaluation metrics used for the comparison of the performance of the classifiers are F1-
score, accuracy, and area under the curve (AUC). More precisely, due to the unbalanced nature
of the dataset, the micro-averaging adaptation of precision, recall, and F1-score and the balanced
accuracy [8] are used. In addition, we used the Wilcoxon rank-sum test to assess the statistical
significance of our results using a significance level of p = .05. The overall results provided are
displayed in Table 3, which shows that our proposed classifier outperforms almost all other ap-
proaches on all evaluation metrics. The balanced accuracy value of CHASE [70] and HybridCNN
[46] is higher than our approach, and the difference is statistically significant (p < .05). However,
the balanced accuracy is defined as the average of recall obtained on each class, and due to the
imbalanced nature of the data, the real performance of the model is masked from the model’s clas-
sification of the majority class (offensive). In contrast, the F1-score is the harmonic mean of both
precision and recall, and thus it is not affected by the imbalanced nature of the data. Moreover, the
performance of Logistic Regression [43] is very close to the MANDOLA approach, if not higher,
but the difference is statistically significant (p < .05).

23 https://colab.research.google.com/.

ACM Transactions on Internet Technology, Vol. 20, No. 2, Article 11. Publication date: March 2020.
MANDOLA 11:13

Fig. 3. Hate topic inference input text pre-processing.

3.4 Hate Topic Inference Module

To support further analysis on hate speech activity and to better visualize it, the discussion topics of
the tweets identified as hate speech are inferred. Any tweet labeled as hate speech, as the output of
the hate speech detection module, is used as input to the hate topic inference module. The module
categorizes tweets into the hate-related topics of Sexual, Racism, Religion, Sports, and Politics. The
hate topic inference module consists of a model trained with topic modeling techniques that aim
to discover and annotate large archives of documents with thematic information. For the purpose
of MANDOLA, latent Dirichlet allocation (LDA) [5] was used. LDA is the most popular statistical
topic modeling algorithm that also has an approach for microblogs such as Twitter [71]. The model
was trained on the same corpus used for the hate speech classification model.
Before entering the hate topic inference module, the data undergo an additional pre-processing
step, which involves the named entity recognition (NER) extraction and named entity linking
(NEL) of named entities in the input. For example, by adding the tweet by @realDonalTrump
in 2013, Every penny of the $7 billion going to Africa as per Obama will be stolen - corruption is ram-
pant!, to the topic inferencer, it will be classified as Racism due to the existence of the racial-related
word Africa in the text but with low confidence. During the hate topic inference process, the NER
is applied on the input tweet [49, 50]. The output of the NER, illustrated in Figure 3(a), specifies
whether a token in the text is outside an entity (O) or inside an entity (B-ENTITY, I-ENTITY), as
well as the PoS tag (e.g. determiner (DT), noun (NN)). The extracted entities consist of Africa and
Obama as the extracted named entities. Each extracted entity is linked to a global identifier pro-
viding more information about it. The NEL is done by cross referencing the results with several
existing knowledge bases (TagMe, DBPedia Spotlight, Babelfy, and Google Knowledge Graph). The
final entities are selected with majority voting from the outputs. After identifying the entities, their
proper titles are extracted and used as extra information in the input, as can be seen in Figure 3(b).
After the NEL, the entity of Africa results in Africa Continent and Obama Barack Obama 44th Pres-
ident of the United States of America. This extra step of pre-processing helps in the generalization
of the inferencer by adding extra knowledge to the input. The final result on the example tweet
assigns it to the topics Racism and Politics.

ACM Transactions on Internet Technology, Vol. 20, No. 2, Article 11. Publication date: March 2020.
11:14 D. Paschalides et al.

3.5 Hate Speech Metadata Storage

The hate speech metadata storage takes into consideration the data privacy policies by protecting
the user’s identification from the information stored. Therefore, no sensitive information is stored
in the hate speech database and the data processing is done on the fly, storing only data statistics
calculated from the hate speech classification and hate topic inference.
3.5.1 Database. The data is stored in the MongoDB24 database, an NoSQL database that pro-
vides flexibility, scalability, and performance required from applications such as MANDOLA. Mon-
goDB was selected from the various schemaless databases (e.g., CouchDB,25 Cassandra26 ) because
of the potent combination with Node JS,27 where they both employ the usage of JavaScript and
JSON. Thus, Node JS can have direct access to MongoDB and retrieve the data faster than other
databases.
3.5.2 Data Collections. There are four basic collections of data stored in the hate database: hate
speech, hatebase, countries, and cities. Each document stored includes the following information:
• Hate score: A number between 0 and 1, representing the score provided by the hate speech
classifier, indicating how hateful the content is, with 1 being the most hateful and 0 the
least.
• Timestamp: The time at which the content was posted depicted in milliseconds.
• Country: The country from which the content was posted.
• City: The city from which the content was posted.
• Topics: An array of the different discussion topics that the annotated content falls into:
Racism, Sexual, Religion, Sports, and Politics.
• Geolocation: The approximate location of the data that is encoded for user protection
purposes.
The country and city detection procedure is required to place each text to a specific region rather
than a point. By doing so, data for a specific country or city can be analyzed without using the
coordinates pointing to a certain user. The coordinates (geolocation) are stored in the hate speech
database, although not in the same format as they are received, to visualize the hate speech data to
a heat map. The accuracy of the coordinates is reduced to the point where there is no possibility
that a user can be identified. This is done by removing decimals from the latitude and longitude
of each coordinate. A RESTful API was developed so that both the visualization dashboard and
any other third-party application can query the hate speech metadata storage and retrieve the
appropriate information (e.g., the average number of processed hateful tweets for a period of time
for each country).

3.6 Visualization Dashboard

The statistical data extracted from MANDOLA are available through the RESTful API and are
displayed on an interactive dashboard, offering a variety of visualization tools, such as maps, pie
charts, bar charts, and trend lines, designed to provide easy user interaction. The MANDOLA
dashboard visualizations focus on both context and first/zoom and filter/details using an aggre-
gated map-based visualization, a statistical map-based visualization, a heat table visualization, and
several other all-purpose visualizations. Throughout all visualizations, the user can filter the data

24 https://www.mongodb.com/.
25 http://couchdb.apache.org/.
26 http://cassandra.apache.org/.
27 https://nodejs.org/en/.

ACM Transactions on Internet Technology, Vol. 20, No. 2, Article 11. Publication date: March 2020.
MANDOLA 11:15

Fig. 4. Displays for the different visualizations available by the MANDOLA dashboard.

based on time, context (hate topic), and location (country/city level) of the data. Additionally, the
user can view specific events in all visualizations to find a possible correlation between any sudden
rises in hate speech activity.
Figure 4(a) depicts the Hate map, a global heat map visualization approach, where the heat
is an aggregated representation of hate speech in a certain location. Similarly to the Hate map,
Figure 4(c) depicts the Hotspot map, another map-based visualization that utilizes the hate-rate
metric categorizing each country into six levels of hate speech activity from very low to very high.
The hate-rate metric (hr c ), of a particular country c, is computed using the following formula:

1
n
hr c = · hsci .
tc i=1
To summarize, the hate rate is equal to the rate of the total amount of messages marked as hate
speech by the hate speech classifier module (hsci ), divided by the total number of messages (tc )
that MANDOLA processed originating from the particular country (c). After applying the hate-
rate metric, we normalize the hate rate between 0% and 100%, and then we sort the countries based
on their level of hate rate. Subsequently, we distribute uniformly the countries among the ‘‘bins’’
(levels of hue gradients) of the Hotspot map.
In both the Hate and Hotspot maps, the user can filter the data rendered by both time and hate
topic. Figure 4(b) displays the Heat table, a context-focused visualization of the daily activity of

ACM Transactions on Internet Technology, Vol. 20, No. 2, Article 11. Publication date: March 2020.
11:16 D. Paschalides et al.

hate on the different hate topics. The red color displays the percentage of hate speech expression at
the corresponding date regarding the corresponding topic. The darkest red represents the highest
daily activity of hate. A group of other descriptive and all-purpose visualizations can be found in
Figure 4. The time-line chart stands out in the group, presenting the trend of hate speech according
to the hate-rate metric.
3.6.1 Event Annotation Mechanism. An important functionality of the MANDOLA dashboard
is the ability to annotate global events that may correlate to online hate speech bursts. Such hate
speech outbreaks may indicate a potential hateful activity, and they can be used by different au-
thorities to monitor incidents and be pro-active. This functionality also can be used by journalists
and policy makers to state their position or correlate it to certain events. A user can select the
date of the event to be annotated and is required to complete a New Event form with an event
title, along with the date of the event and duration of time. The user is also prompted to optionally
add a location of the event so as to be visible in the map-based visualizations. The location can be
added manually by pin-pointing it on a provided map or by adding a related article URL were it is
extracted automatically via a gazetteer index.28

4 MANDOLA EVENT VISUALIZATION: USE CASE

In this section, we describe an example of the occurrence of a real hate event, which was observed
through MANDOLA. The system actively monitored the Twitter stream during the occurrence of
the event and processed the tweets based on the proposed data processing pipeline. The data first
went through the data pre-processing module, where the data cleaning was performed. Afterward,
the processed data were used as input to the hate speech classifier and hate topic inference mod-
ules. The statistical output of these modules was stored in the hate speech metadata storage, and
via the API, it was displayed on the visualization dashboard.
The incident analyzed is considered to be the deadliest mass shooting committed by an individ-
ual in the United States. It reignited the debate about gun laws in the United States and focused
attention on bump stocks, which the murderer used to fire shots in a rapid succession, with re-
sults comparable to automatic weapons. The shooting occurred on the night of October 1, 2017,
between 10:05 and 10:15 pm, and about an hour later, the murderer was found dead in his room
from a self-inflicted gunshot wound. The incident caused the spread of a wave of tweets in the
Twitter platform using various hashtags such as #LasVegas, #PrayForVegas, and #VegasShooting.
The tweets mostly provided comfort and support to the victims and their families; however, a large
portion of tweets disputed the fact that the shooting was labeled as domestic terrorism, whereas if
the shooter were Muslim, it would be criticized and labeled as a terrorist attack against the United
States [2, 37]. Examples of such tweets are the following:
• Ban all white people. The world would be safer.
• So they aren’t calling the shooting domestic terrorism but I bet if his name was Muslim it would
be instant terrorism.
• Why is there a hashtag like #Stopthehate? This was a white on white crime lmao but when
blacks ask for the hate to stop yall lives matter.
The MANDOLA system managed to capture this surge of tweets discussing the shooter race dis-
pute, with several posts being classified as hate speech (nearly 86%) by the MANDOLA hate speech
detection module. For the annotation of the specific event through the MANDOLA visualization
dashboard, the event annotation mechanism is required, including the date of the annotated event

28 https://clavin.bericotechnologies.com/.

ACM Transactions on Internet Technology, Vol. 20, No. 2, Article 11. Publication date: March 2020.
MANDOLA 11:17

and any related information such as an event title and a date range. For the event to be visible in
the geographical visualizations (Hate map and Hotspot map), it is required to have a geolocation.
For the aforementioned event, the date is October 1, 2017, the event title is Las Vegas Shooting, and
the date range is from October 1, 2017, at 10:05 pm to October 1, 2017, at 10:15 pm. There is also
the functionality of expanding the actual date range for the event to be visible before and after the
actual date. This is helpful for the monitoring and assessment of pre-event and post-event hate
speech traffic. The geolocation of the event can be assigned via three methods: (i) by selecting the
exact location of the event, (ii) by adding the name of the place the event occurred (e.g., Las Vegas
Strip, Paradise, Nevada, U.S.), and (iii) by adding a URL of an event-related news article where the
geolocation will be automatically extracted.
With the completion of the event annotation, the event is visible for the configured duration that
it took place. By loading the Hate map visualization, the hate speech incidents are filtered based
on their observation time. For the purpose of this use case, the data is filtered for ±2 days since
the day the Las Vegas shooting took place. In Figure 4(a), it is visible that hate speech incidents are
aggregated near the event location, as well as other nearby states of the United States. The Hotspot
map, illustrated in Figure 4(c), can provide a statistical view without, however, placing the event on
the map. By looking at the hate speech activity in the United States, using the same time interval
filtering applied in the Hate map, it is visible that Nevada and nearby states display a high percent-
age of hateful tweets. The Heat table visualization (Figure 4(b)) that displays the daily statistics
based on the different hate topic reports an increased hate speech activity related to politics, eth-
nicity, and nationality between the October 1, 2017, and October 2, 2017. These observations are in
line with the initial disputes that the event caused regarding two main incidents: the debate about
gun laws in the United States and the racial dispute of labeling the event as domestic terrorism
because the shooter was not Muslim.
After using the Statistics page to get a better picture of the bursts of hate speech activity, a hate
speech surge is visible, probably caused by the Las Vegas shooting incident (Figure 4(d)). Before
October 2, but maybe slightly after the occurrence of the event, there is a burst of hate speech
activity, which is highly correlated with the event. Furthermore, by observing the average hate
rate per category, it is visible that the politics, ethnicity, and nationality hate topics are the most
active ones, which are also related to the event.

5 CONCLUSION
In this article, we presented the MANDOLA Big-Data Processing System, a system for detect-
ing and monitoring online hate speech. The strength of this system is its ability to analyze and
store big data from online sources, using NLP and ML techniques. To the best of our knowledge,
MANDOLA is the first system that provides a systematic and integrated approach for detecting,
monitoring, and visualizing hate speech incidents by collecting real-time data on the fly from
online sources. The evaluation of MANDOLA included the evaluation of a proposed novel hate
speech classification approach that is incorporated into the hate speech classifier module. Con-
sidering the experimental results, the proposed classifier outperforms the current state of the art,
and it can improve the ability of our system in detecting hate speech. In general, MANDOLA in
its entirety has been developed as a modular, extendable system for general-purpose monitoring
and visualization and can be easily adapted for other applications where spatio-temporal data are
available. Some examples could be the brand monitoring for different companies and products, the
visualization of traffic accidents and their severity for policies and regulations to be made, the use
by governments and other authorities for monitoring their citizens’ opinions for public services,
and analyzing OSN data for upcoming elections.

ACM Transactions on Internet Technology, Vol. 20, No. 2, Article 11. Publication date: March 2020.
11:18 D. Paschalides et al.

In the future, we plan to evaluate the performance of the proposed ensemble classifier by ap-
plying it to more recent benchmark datasets for hate speech detection. Examples data include
HatEval 2019 [4], which detects hate speech against immigrants and women in Twitter,29 and Offe-
seval@Semeval19 [65], a multilingual dataset released in the context of evaluation campaigns in
2018. At this stage, the MANDOLA system is not compatible with any cross-lingual detection set-
ting and does not select and process data in different languages. However, the platform has been
developed to easily integrate with other languages, simply by incorporating classification models
that are able to be applied to datasets for specific languages. We plan on extending this work using
the multi-lingual aspect of the referred datasets to expand MANDOLA to collect and analyze data
from different languages.
REFERENCES
[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, et al. 2015.
TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. Available at http://tensorflow.org (Software
available from tensorflow.org.)
[2] Imran Awan. 2014. Islamophobia and Twitter: A typology of online hate against Muslims on social media. Policy &
Internet 6, 2 (June 2014), 133–150. DOI:https://doi.org/10.1002/1944-2866.POI364
[3] Pinkesh Badjatiya, Shashank Gupta, Manish Gupta, and Vasudeva Varma. 2017. Deep learning for hate speech de-
tection in tweets. In Proceedings of the 26th International Conference on World Wide Web Companion (WWW’17 Com-
panion). ACM, New York, NY. DOI:https://doi.org/10.1145/3041021.3054223
[4] Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco Manuel Rangel Pardo, Paolo
Rosso, and Manuela Sanguinetti. 2019. SemEval-2019 Task 5: Multilingual detection of hate speech against immigrants
and women in Twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation. 54–63. https://www.
aclweb.org/anthology/S19-2007.
[5] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning
Research 3 (March 2003), 993–1022. http://dl.acm.org/citation.cfm?id=944919.944937.
[6] L. Breiman. 1996. Bagging predictors. Machine Learning 24, 2 (Aug. 1996), 123–140.
[7] Andrei Z. Broder, Steven C. Glassman, Mark S. Manasse, and Geoffrey Zweig. 1997. Syntactic clustering of the web.
Computer Networks and ISDN Systems 29, 8 (1997), 1157–1166. DOI:https://doi.org/10.1016/S0169-7552(97)00031-7
[8] Kay Henning Brodersen, Cheng Soon Ong, Klaas Enno Stephan, and Joachim M. Buhmann. 2010. The balanced ac-
curacy and its posterior distribution. In Proceedings of the 2010 20th International Conference on Pattern Recognition
(ICPR’10). IEEE, Los Alamitos, CA, 3121–3124. DOI:https://doi.org/10.1109/ICPR.2010.764
[9] Peter Burnap, Omer Rana, Matthew Williams, William Housley, Adam Edwards, Jeffrey Morgan, Luke Sloan, and
Javier Conejero. 2015. COSMOS: Towards an integrated and scalable service for analysing social media on demand.
International Journal of Parallel, Emergent and Distributed Systems 30, 2 (2015), 80–100.
[10] Arthur T. E. Capozzi, Mirko Lai, Valerio Basile, Cataldo Musto, Marco Polignano, Fabio Poletto, Manuela Sanguinetti,
et al. 2019. Computational linguistics against hate: Hate speech detection and visualization on social media in the
“Contro L’Odio” project. In Proceedings of the 6th Italian Conference on Computational Linguistics (CLiC-it’19).
[11] Despoina Chatzakou, Nicolas Kourtellis, Jeremy Blackburn, Emiliano De Cristofaro, Gianluca Stringhini, and Athena
Vakali. 2017. Mean birds: Detecting aggression and bullying on Twitter. arXiv:1702.06877.
[12] Ying Chen, Yilu Zhou, Sencun Zhu, and Heng Xu. 2012. Detecting offensive language in social media to protect adoles-
cent online safety. In Proceedings of the 2012 International Conference on Privacy, Security, Risk, and Trust and the 2012
International Confernece on Social Computing. IEEE, Los Alamitos, CA. DOI:https://doi.org/10.1109/socialcom-passat.
2012.55
[13] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014.
Learning phrase representations using RNN encoder-decoder for statistical machine translation. arxiv:1406.1078.
[14] Raphael Cohen-Almagor. 2015. Viral hate: Containing its spread on the Internet by Abraham H. Foxman and Christo-
pher Wolf. Basingstoke: Palgrave Macmillan, 2013. 256pp., £17.99, ISBN 978 0230342170. Political Studies Review 13,
2 (2015), 281–282. DOI:https://doi.org/10.1111/1478-9302.12087_70 arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.
1111/1478-9302.12087_70
[15] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa. 2011. Natural language processing (al-
most) from scratch. Journal of Machine Learning Research 12 (Nov. 2011), 2493–2537. http://dl.acm.org/citation.cfm?
id=1953048.2078186.

29 https://competitions.codalab.org/competitions/19935.

ACM Transactions on Internet Technology, Vol. 20, No. 2, Article 11. Publication date: March 2020.
MANDOLA 11:19

[16] Maral Dadvar, Dolf Trieschnigg, Roeland Ordelman, and Franciska de Jong. 2013. Improving cyberbullying detection
with user context. In Proceedings of the 35th European Conference on Advances in Information Retrieval (ECIR’13).
693–696. DOI:https://doi.org/10.1007/978-3-642-36973-5_62
[17] Thomas Davidson, Dana Warmsley, Michael W. Macy, and Ingmar Weber. 2017. Automated hate speech detection
and the problem of offensive language. arxiv:1703.04009.
[18] Cong Ding, Yang Chen, and Xiaoming Fu. 2013. Crowd crawling: Towards collaborative data collection for large-scale
online social networks. In Proceedings of the 1st ACM Conference on Online Social Networks (COSN’13). ACM, New
York, NY, 183–188. DOI:https://doi.org/10.1145/2512938.2512958
[19] Nemanja Djuric, Jing Zhou, Robin Morris, Mihajlo Grbovic, Vladan Radosavljevic, and Narayan Bhamidipati. 2015.
Hate speech detection with comment embeddings. In Proceedings of the 24th International Conference on World Wide
Web (WWW’15 Companion). ACM, New York, NY. DOI:https://doi.org/10.1145/2740908.2742760
[20] EEANews. Countering Hate Speech Online. Retrieved February 15, 202 from https://eeagrants.org/News/2012/
Countering-hate-speech-online.
[21] H. Efstathiades, D. Antoniades, G. Pallis, and M. D. Dikaiakos. 2016. Distributed large-scale data collection in online
social networks. In Proceedings of the 2016 IEEE 2nd International Conference on Collaboration and Internet Computing
(CIC’16). 373–380. DOI:https://doi.org/10.1109/CIC.2016.056
[22] Antigoni-Maria Founta, Despoina Chatzakou, Nicolas Kourtellis, Jeremy Blackburn, Athena Vakali, and Ilias Leon-
tiadis. 2018. A unified deep learning architecture for abuse detection. arxiv:1802.00385.
[23] D. G. Njagi, Z. Zhang, D. Hanyurwimfura, and J. Long. 2015. A lexicon-based approach for hate speech detection.
International Journal of Multimedia and Ubiquitous Engineering 10, 4 (April 2015), 215–230. DOI:https://doi.org/10.
14257/ijmue.2015.10.4.21
[24] Iginio Gagliardone, Danit Gal, Thiago Alves, and Gabriela Martinez. 2015. Countering Online Hate Speech. Retrieved
February 15, 2020 from https://unesdoc.unesco.org/ark:/48223/pf0000233231.
[25] Björn Gambäck and Utpal Kumar Sikdar. 2017. Using convolutional neural networks to classify hate-speech. In Pro-
ceedings of the 1st Workshop on Abusive Language Online. DOI:https://doi.org/10.18653/v1/w17-3013
[26] Dario Garcia-Gasulla, Ferran Parés, Armand Vilalta, Jonathan Moreno, Eduard Ayguadé, Jesús Labarta, Ulises Cortés,
and Toyotaro Suzumura. 2017. On the behavior of convolutional nets for feature extraction. arxiv:1703.01127.
[27] Njagi Dennis Gitari, Zhang Zuping, Hanyurwimfura Damien, and Jun Long. 2015. A lexicon-based approach for hate
speech detection. International Journal of Multimedia and Ubiquitous Engineering 10, 4 (2015), 215–230.
[28] I. Goodfellow, Y. Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press, Cambridge, MA. http://www.
deeplearningbook.org.
[29] Edel Greevy and Alan F. Smeaton. 2004. Classifying racist texts using a support vector machine. In Proceedings of
the 27th Annual International Conference on Research and Development in Information Retrieval (SIGIR’04). ACM, New
York, NY. DOI:https://doi.org/10.1145/1008992.1009074
[30] Jake Harwood. 2011. Book review: Waltman, M., & Haas, J. (2011). The communication of hate. New York, NY: Peter
Lang. vii + 202 pp. ISBN: 978-1433104473. Journal of Language and Social Psychology 30, 3 (2011), 350–352. DOI:https://
doi.org/10.1177/0261927X11407170 arXiv:https://doi.org/10.1177/0261927X11407170
[31] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Deep residual learning for image recognition.
arxiv:1512.03385.
[32] C. J. Hutto and Eric Gilbert. 2015. VADER: A parsimonious rule-based model for sentiment analysis of social media
text. In Proceedings of the 8th International Conference on Weblogs and Social Media (ICWSM’14).
[33] John L. Stacy Joshua S. White, Jeanna N. Matthews. 2012. Coalmine: an experience in building a system for social
media analytics. In Proceedings Volume 8408: Cyber Sensing 2012. SPIE, 8408. DOI:https://doi.org/10.1117/12.918933
[34] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arxiv:1412.6980.
[35] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. 1998. Gradient-based learning applied to document recognition. Pro-
ceedings of the IEEE 86, 11 (Nov. 1998), 2278–2324. DOI:https://doi.org/10.1109/5.726791
[36] Zachary Chase Lipton. 2015. A critical review of recurrent neural networks for sequence learning. arxiv:1506.00019.
[37] Walid Magdy, Kareem Darwish, and Norah Abokhodair. 2015. Quantifying public response towards Islam on Twitter
after Paris attacks. arXiv:1512.04570.
[38] Estelle De Marco. 2017. D2.1b: Definition of Illegal Hatred and Implications. Retrieved February 15, 2020 from http://
www.mandola-project.eu/publications/.
[39] Y. Mehdad and J. Tetreault. 2016. Do characters abuse more than words? In Proceedings of the 17th Annual Meeting of
the Special Interest Group on Discourse and Dialogue. DOI:https://doi.org/10.18653/v1/w16-3638
[40] Stefano Menini, Giovanni Moretti, Michele Corazza, Elena Cabrio, Sara Tonelli, and Serena Villata. 2019. A system to
monitor cyberbullying based on message classification and social network analysis. In Proceedings of the 3rd Workshop
on Abusive Language Online. 105–110. https://www.aclweb.org/anthology/W19-3511.

ACM Transactions on Internet Technology, Vol. 20, No. 2, Article 11. Publication date: March 2020.
11:20 D. Paschalides et al.

[41] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in
vector space. arxiv:1301.3781.
[42] Fred Morstatter, Jürgen Pfeffer, Huan Liu, and Kathleen M. Carley. 2013. Is the sample good enough? Comparing data
from Twitter’s streaming API with Twitter’s Firehose. arxiv:1306.5204.
[43] Chikashi Nobata, Joel Tetreault, Achint Thomas, Yashar Mehdad, and Yi Chang. 2016. Abusive language detection in
online user content. In Proceedings of the 25th International Conference on World Wide Web (WWW’16). ACM, New
York, NY. DOI:https://doi.org/10.1145/2872427.2883062
[44] Francisco Ordóñez and Daniel Roggen. 2016. Deep convolutional and LSTM recurrent neural networks for multimodal
wearable activity recognition. Sensors 16, 1 (Jan. 2016), 115. DOI:https://doi.org/10.3390/s16010115
[45] Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, and Noah A. Smith. 2013. Im-
proved part-of-speech tagging for online conversational text with word clusters. In Proceedings of the 2013 Conference
of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 380–
390. http://aclweb.org/anthology/N13-1039.
[46] Ji Ho Park and Pascale Fung. 2017. One-step and two-step classification for abusive language detection on Twitter.
In Proceedings of the 1st Workshop on Abusive Language Online.
[47] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation.
In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.
http://www.aclweb.org/anthology/D14-1162.
[48] Georgios Pitsilis, Heri Ramampiaro, and Helge Langseth. 2018. Effective hate-speech detection in Twitter data
using recurrent neural networks. Applied Intelligence 48, 12 (Dec. 2018), 4730–4742. DOI:https://doi.org/10.1007/
s10489-018-1242-y
[49] A. Ritter, S. Clark, E., and Oren E. 2011. Named entity recognition in tweets: An experimental study. In Proceedings
of the Conference on Empirical Methods in Natural Language Processing (EMNLP’11). 1524–1534.
[50] Alan Ritter, Mausam, Oren Etzioni, and Sam Clark. 2012. Open domain event extraction from Twitter. In Proceedings
of the 18th ACM International Conference on Knowledge Discovery and Data Mining (KDD’12). 1104–1112..
[51] Robert E. Schapire and Yoav Freund. 2012. Boosting: Foundations and Algorithms. MIT Press, Cambridge, MA.
[52] Anna Schmidt and Michael Wiegand. 2017. A survey on hate speech detection using natural language processing. In
Proceedings of the 5th International Workshop on Natural Language Processing for Social Media. DOI:https://doi.org/
10.18653/v1/w17-1101
[53] Mazin Sidahmed. 2016. Claims of Hate Crimes Possibly Linked to Trump’s Election Reported Across the US. Re-
trieved February 15, 2020 from https://www.theguardian.com/us-news/2016/nov/10/hate-crime-spike-us-donald-
trump-president.
[54] Leandro Araújo Silva, Mainack Mondal, Denzil Correa, Fabrício Benevenuto, and Ingmar Weber. 2016. Analyzing the
targets of hate in online social media. arxiv:1603.07709.
[55] Naftali Tishby and Noga Zaslavsky. 2015. Deep learning and the information bottleneck principle. arxiv:1503.02406.
[56] Alan Travis. 2017. Anti-Muslim Hate Crime Surges After Manchester and London Bridge Attacks. Retrieved Feb-
ruary 15, 2020 from https://www.theguardian.com/society/2017/jun/20/anti-muslim-hate-surges-after-manchester-
and-london-bridge-attacks.
[57] European Union. 2016. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016
on the Protection of Natural Persons with Regard to the Processing of Personal Data and on the Free Movement of
Such Data, and Repealing Directive 95/46/EC (General Data Protection Regulation). Retrieved February 15, 2020 from
http://data.europa.eu/eli/reg/2016/679/oj.
[58] Fabio Del Vigna, Andrea Cimino, Felice Dell’Orletta, Marinella Petrocchi, and Maurizio Tesconi. 2017. Hate me, hate
me not: Hate speech detection on Facebook. In Proceedings of the 1st Italian Conference on Cybersecurity (ITASEC’17).
86–95.
[59] William Warner and Julia Hirschberg. 2012. Detecting hate speech on the World Wide Web. In Proceedings of the 2nd
Workshop on Language in Social Media (LSM’12). 19–26. http://dl.acm.org/citation.cfm?id=2390374.2390377.
[60] Zeerak Waseem. 2016. Are you a racist or am I seeing things? Annotator influence on hate speech detection on
Twitter. In Proceedings of the 1st Workshop on NLP and Computational Social Science. DOI:https://doi.org/10.18653/
v1/w16-5618
[61] Zeerak Waseem and Dirk Hovy. 2016. Hateful symbols or hateful people? Predictive features for hate speech detec-
tion on Twitter. In Proceedings of the NAACL Student Research Workshop. 88–93. http://www.aclweb.org/anthology/
N16-2013.
[62] David H. Wolpert. 1992. Stacked generalization. Neural Networks 5 (1992), 241–259.
[63] Guang Xiang, Bin Fan, Ling Wang, Jason Hong, and Carolyn Rose. 2012. Detecting offensive tweets via topical feature
discovery over a large scale Twitter corpus. In Proceedings of the 21st ACM International Conference on Information
and Knowledge Management (CIKM’12). ACM, New York, NY. DOI:https://doi.org/10.1145/2396761.2398556

ACM Transactions on Internet Technology, Vol. 20, No. 2, Article 11. Publication date: March 2020.
MANDOLA 11:21

[64] Shuhan Yuan, Xintao Wu, and Yang Xiang. 2016. A two phase deep learning model for identifying discrimination from
tweets. In Proceedings of the 19th International Conference on Extending Database Technology (EDBT’16). 696–697.
[65] Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. 2019. SemEval-
2019 Task 6: Identifying and categorizing offensive language in social media (OffensEval). In Proceedings of the 13th
International Workshop on Semantic Evaluation. 75–86. DOI:https://doi.org/10.18653/v1/S19-2010
[66] Shiwei Zhang, Xiuzhen Zhang, and Jeffrey Chan. 2017. A word-character convolutional neural network for language-
agnostic Twitter sentiment analysis. In Proceedings of the 22nd Australasian Document Computing Symposium
(ADCS’17). DOI:https://doi.org/10.1145/3166072.3166082
[67] Xiang Zhang and Yann LeCun. 2015. Text understanding from scratch. arxiv:1502.01710.
[68] Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification.
arxiv:1509.01626.
[69] Yin Zhang, Rong Jin, and Zhi-Hua Zhou. 2010. Understanding bag-of-words model: A statistical framework.
International Journal of Machine Learning and Cybernetics 1, 1 (Dec. 2010), 43–52. DOI:https://doi.org/10.1007/
s13042-010-0001-0
[70] Z. Zhang, D. Robinson, and J. Tepper. 2018. Detecting hate speech on Twitter using a convolution-GRU based deep
neural network. In The Semantic Web. Springer International Publishing, 745–760.
[71] W. X. Zhao, J. Jiang, J. Weng, J. He, E. P. Lim, H. Yan, and X. Li. 2011. Comparing Twitter and traditional media using
topic models. In Advances in Information Retrieval, P. Clough, C. Foley, C. Gurrin, G. J. F. Jones, W. Kraaij, Hyowon
L., and V. Mudoch (Eds.). Springer, Berlin, Germany, 338–349.
[72] H. Zhong, H. Li, A. Squicciarini, S. Rajtmajer, C. Griffin, D. Miller, and C. Caragea. 2016. Content-driven detection of
cyberbullying on the Instagram social network. In Proceedings of the 25th International Joint Conference on Artificial
Intelligence (IJCAI’16). 3952–3958.

Received April 2019; revised September 2019; accepted November 2019

ACM Transactions on Internet Technology, Vol. 20, No. 2, Article 11. Publication date: March 2020.

Digital Footprints
From Everand
Digital Footprints
Alex Mather
No ratings yet
hate_speech_chapter_final_preprint
No ratings yet
hate_speech_chapter_final_preprint
27 pages
RP 1
No ratings yet
RP 1
7 pages
A Survey On Automatic Detection of Hate Speech in Text
No ratings yet
A Survey On Automatic Detection of Hate Speech in Text
30 pages
Investigating Deep Learning Approaches For Hate
No ratings yet
Investigating Deep Learning Approaches For Hate
12 pages
Dynamics of Online Hate
No ratings yet
Dynamics of Online Hate
12 pages
Navigating The Dark Web of Hate: Supervised Machine Learning Paradigm and NLP For Detecting Online Hate Speeches
No ratings yet
Navigating The Dark Web of Hate: Supervised Machine Learning Paradigm and NLP For Detecting Online Hate Speeches
8 pages
paper 12
No ratings yet
paper 12
11 pages
Hate Speech Detection: Challenges and Solutions: A1111111111 A1111111111 A1111111111 A1111111111 A1111111111
No ratings yet
Hate Speech Detection: Challenges and Solutions: A1111111111 A1111111111 A1111111111 A1111111111 A1111111111
16 pages
Hate Speech Detection - Challenges and Solutions - PLOS ONE
No ratings yet
Hate Speech Detection - Challenges and Solutions - PLOS ONE
9 pages
Journal Pone 0305657
No ratings yet
Journal Pone 0305657
24 pages
Countering Hate Speech On Social Media
No ratings yet
Countering Hate Speech On Social Media
2 pages
Hate Speech Detection in Twitter Using Natural Language Processing
No ratings yet
Hate Speech Detection in Twitter Using Natural Language Processing
7 pages
(8a) Davidson Et Al. - 2017
No ratings yet
(8a) Davidson Et Al. - 2017
4 pages
Defence University College of Engineering: M-Tech Thesis Progress Report
No ratings yet
Defence University College of Engineering: M-Tech Thesis Progress Report
15 pages
Content Moderation Issues
From Everand
Content Moderation Issues
Jonathan Hawkins
No ratings yet
3 Detection of Hate Speech in Social Networks A Surv
No ratings yet
3 Detection of Hate Speech in Social Networks A Surv
19 pages
A Multilingual Evaluation For Online Hate Speech Detection
No ratings yet
A Multilingual Evaluation For Online Hate Speech Detection
22 pages
Final Report Edit
No ratings yet
Final Report Edit
26 pages
12 V May 2024
No ratings yet
12 V May 2024
8 pages
TW Identifying Cyber Hate On PDF
No ratings yet
TW Identifying Cyber Hate On PDF
15 pages
DETECTION OF HATE BASED POLITICAL SPEECH
No ratings yet
DETECTION OF HATE BASED POLITICAL SPEECH
5 pages
RRL Individual
No ratings yet
RRL Individual
3 pages
RP 5
No ratings yet
RP 5
7 pages
Automatic Hate Speech Detection On Social Media: A Brief Survey
No ratings yet
Automatic Hate Speech Detection On Social Media: A Brief Survey
6 pages
Combating_Online_Hate_A_Comparative_Study_on_Identification_of_Hate_Speech_and_Offensive_Content_in_Social_Media_Text
No ratings yet
Combating_Online_Hate_A_Comparative_Study_on_Identification_of_Hate_Speech_and_Offensive_Content_in_Social_Media_Text
6 pages
Final Year
No ratings yet
Final Year
25 pages
8 - Hateful Symbols or Hateful People Predictive Features For Hate Speech Detection On Twitter
No ratings yet
8 - Hateful Symbols or Hateful People Predictive Features For Hate Speech Detection On Twitter
6 pages
1 Identification of Hate Speech in Social Media
No ratings yet
1 Identification of Hate Speech in Social Media
6 pages
Ssoar 2023 Sponholz Hate Speech
No ratings yet
Ssoar 2023 Sponholz Hate Speech
23 pages
RP 3
No ratings yet
RP 3
4 pages
Comparison_of_Deep_Learning_and_Ensemble_Learning_in_Classification_of_Toxic_Comments
No ratings yet
Comparison_of_Deep_Learning_and_Ensemble_Learning_in_Classification_of_Toxic_Comments
6 pages
was antisemitism on the rise during the tumultuous weeks of Elon Musk’s Twitter takeover?
No ratings yet
was antisemitism on the rise during the tumultuous weeks of Elon Musk’s Twitter takeover?
29 pages
(ACCEPTED) Detection-Of-Hateful-Twitter-Users-With-Graph-Convolutional-Network-Model
No ratings yet
(ACCEPTED) Detection-Of-Hateful-Twitter-Users-With-Graph-Convolutional-Network-Model
15 pages
ICDIS-2019 Paper 181
No ratings yet
ICDIS-2019 Paper 181
9 pages
12 V May 2024
No ratings yet
12 V May 2024
8 pages
Seminar Research Format
No ratings yet
Seminar Research Format
14 pages
Social Media Governance
From Everand
Social Media Governance
Michael Johnson
No ratings yet
Disconnected: Exploring the Decline of Social Networks
From Everand
Disconnected: Exploring the Decline of Social Networks
Milan Frankl
No ratings yet
Insights of Learning Approach Towards Determination of Potentially Objectional Communication in Social Networking
No ratings yet
Insights of Learning Approach Towards Determination of Potentially Objectional Communication in Social Networking
8 pages
Detecting and Visualizing Hate Speech in Social Media: A Cyber Watchdog For Surveillance-Modha2020
No ratings yet
Detecting and Visualizing Hate Speech in Social Media: A Cyber Watchdog For Surveillance-Modha2020
11 pages
b.e-cse-batchno-168
No ratings yet
b.e-cse-batchno-168
42 pages
Machine Learning Based Automatic Hate Speech Recognition System
No ratings yet
Machine Learning Based Automatic Hate Speech Recognition System
4 pages
RP 4
No ratings yet
RP 4
7 pages
FDIA 2023 Paper 4
No ratings yet
FDIA 2023 Paper 4
12 pages
Online Hate Speech Essay Competition Runner Up
100% (1)
Online Hate Speech Essay Competition Runner Up
12 pages
A296 D Stamped
No ratings yet
A296 D Stamped
4 pages
Hate Speech Detection Using Machine Learning2
No ratings yet
Hate Speech Detection Using Machine Learning2
4 pages
BDA_Minor_Specialization_Literature_Review_IEEE_format
No ratings yet
BDA_Minor_Specialization_Literature_Review_IEEE_format
5 pages
Burnap Et Al (2015)
No ratings yet
Burnap Et Al (2015)
20 pages
HOW AI BOTS HAVE REINFORCED GENDER BIAS IN HATE SPEECH
No ratings yet
HOW AI BOTS HAVE REINFORCED GENDER BIAS IN HATE SPEECH
16 pages
Text_Based_Hate-Speech_Analysis
No ratings yet
Text_Based_Hate-Speech_Analysis
8 pages
An Explainable Artificial Intelligence Model for Detecting Xenophobic Tweets
No ratings yet
An Explainable Artificial Intelligence Model for Detecting Xenophobic Tweets
27 pages
A Context-Aware Based Model For The Detection of Online Swahili Hate Speech Using Bertuser
No ratings yet
A Context-Aware Based Model For The Detection of Online Swahili Hate Speech Using Bertuser
59 pages
2024.ltedi-1.20
No ratings yet
2024.ltedi-1.20
6 pages
Offensive Social Network Posts Classification Using Apache Spark Platform
No ratings yet
Offensive Social Network Posts Classification Using Apache Spark Platform
7 pages
Research Acoy
No ratings yet
Research Acoy
13 pages
Hidden Hate Groups
From Everand
Hidden Hate Groups
Nora Franklin
No ratings yet
Social Media Based Hate Speech Detection Using Machine Learning
No ratings yet
Social Media Based Hate Speech Detection Using Machine Learning
10 pages
Semester Project Report by Qaiser
No ratings yet
Semester Project Report by Qaiser
5 pages
Analyzing The Success of Selfish Mining With Multiple Players
No ratings yet
Analyzing The Success of Selfish Mining With Multiple Players
87 pages
Secure E-Payment Protocol
No ratings yet
Secure E-Payment Protocol
8 pages
Embedment of Montgomery Algorithm On Elliptic Curve Cryptography Over RSA Public Key Cryptography
No ratings yet
Embedment of Montgomery Algorithm On Elliptic Curve Cryptography Over RSA Public Key Cryptography
7 pages
An Experimental Study On Performance Evaluation of Asymmetric Encryption Algorithms
No ratings yet
An Experimental Study On Performance Evaluation of Asymmetric Encryption Algorithms
5 pages
Placing Regenerators in Optical Networks To Satisfy Multiple Sets of Requests
No ratings yet
Placing Regenerators in Optical Networks To Satisfy Multiple Sets of Requests
19 pages
CSR Failures in Food Supply Chains
No ratings yet
CSR Failures in Food Supply Chains
17 pages
Nat DHCP
No ratings yet
Nat DHCP
21 pages
Masquerade Attack Detection
No ratings yet
Masquerade Attack Detection
17 pages
Hadoop-Use - Cases
No ratings yet
Hadoop-Use - Cases
28 pages
Industrial Engineering in The Big Data e 1
No ratings yet
Industrial Engineering in The Big Data e 1
500 pages
Get (Ebook) Hands-on AIOps: Best Practices Guide to Implementing AIOps by Navin Sabharwal, Gaurav Bhardwaj ISBN 9781484282663, 1484282663 PDF ebook with Full Chapters Now
100% (6)
Get (Ebook) Hands-on AIOps: Best Practices Guide to Implementing AIOps by Navin Sabharwal, Gaurav Bhardwaj ISBN 9781484282663, 1484282663 PDF ebook with Full Chapters Now
81 pages
Big Data Analytics
No ratings yet
Big Data Analytics
17 pages
Conference Brochure 20th August 2022
No ratings yet
Conference Brochure 20th August 2022
6 pages
Sec Attacks in CC
No ratings yet
Sec Attacks in CC
575 pages
1.the Value of Big Data in An Accounting Firm
No ratings yet
1.the Value of Big Data in An Accounting Firm
4 pages
MSC Management With Data Analytics
No ratings yet
MSC Management With Data Analytics
3 pages
Immediate Download Digital Entertainment: The Next Evolution in Service Sector 1st Edition Edition Subhankar Das Ebooks 2024
100% (1)
Immediate Download Digital Entertainment: The Next Evolution in Service Sector 1st Edition Edition Subhankar Das Ebooks 2024
57 pages
Big Data
No ratings yet
Big Data
16 pages
Linkedin:: Varun Mathur Mob: +61 449963107
No ratings yet
Linkedin:: Varun Mathur Mob: +61 449963107
2 pages
Mca 2021 ND23
No ratings yet
Mca 2021 ND23
2 pages
Final 1
No ratings yet
Final 1
10 pages
A Closer Look at Big Data Analytics (R. Anandan, Editor, Professor Etc.) (Z-Library)
No ratings yet
A Closer Look at Big Data Analytics (R. Anandan, Editor, Professor Etc.) (Z-Library)
378 pages
Sample CV 003
No ratings yet
Sample CV 003
3 pages
Industrial IoT Challenges Design Principles Applications and Security Ismail Butun download
100% (5)
Industrial IoT Challenges Design Principles Applications and Security Ismail Butun download
54 pages
Operations Management, Sustainability & Industry 5.0 A Critical Analysis
100% (1)
Operations Management, Sustainability & Industry 5.0 A Critical Analysis
12 pages
Business Intelligence-Report - MBA EVENING - Huzaifa Rabea Sami
No ratings yet
Business Intelligence-Report - MBA EVENING - Huzaifa Rabea Sami
12 pages
Chapter 4 Business
No ratings yet
Chapter 4 Business
31 pages
Mantenimiento Predictivo MATLAB
No ratings yet
Mantenimiento Predictivo MATLAB
11 pages
Profile of Xiamen ITG Group Corp., Ltd. May 2023)
No ratings yet
Profile of Xiamen ITG Group Corp., Ltd. May 2023)
6 pages
Analytics CoE - Winter.1 2010
No ratings yet
Analytics CoE - Winter.1 2010
4 pages
Unit 1 - BD - Introduction To Big Data
No ratings yet
Unit 1 - BD - Introduction To Big Data
75 pages
Study Pack - SL 5 Integrated Case Study
No ratings yet
Study Pack - SL 5 Integrated Case Study
776 pages
EPSM Syllabus 290419
No ratings yet
EPSM Syllabus 290419
12 pages
Informatica Big Data For Developers
No ratings yet
Informatica Big Data For Developers
5 pages
Unit III_Full
No ratings yet
Unit III_Full
31 pages
Bigdata Unit-Ii
No ratings yet
Bigdata Unit-Ii
33 pages
Top Oracle Financials Interview Questions & Answers For 2022
No ratings yet
Top Oracle Financials Interview Questions & Answers For 2022
18 pages
Research Papers
No ratings yet
Research Papers
2 pages

A Big-Data Processing and Visualization Platform

Uploaded by

A Big-Data Processing and Visualization Platform

Uploaded by

MANDOLA: A Big-Data Processing and Visualization

Platform for Monitoring and Detecting Online Hate Speech

DEMETRIS PASCHALIDES and DIMOSTHENIS STEPHANIDIS, University of Cyprus

2.1 Hate Speech Classification

Fig. 1. Overview of the MANDOLA Big-Data Processing System architecture.

3 MANDOLA BIG-DATA PROCESSING SYSTEM

3.1 Data Collection Engine

3.2 Data Pre-Processing

3.3 Hate Speech Detection Module

Feature Category Feature Value

Model Structure Category

Micro- Micro- Micro- Balanced

Fig. 3. Hate topic inference input text pre-processing.

3.4 Hate Topic Inference Module

3.5 Hate Speech Metadata Storage

3.6 Visualization Dashboard

4 MANDOLA EVENT VISUALIZATION: USE CASE

Received April 2019; revised September 2019; accepted November 2019

You might also like