Abusive Language Detection in Online Conversations by Combining Content-And Graph-Based Features
Abusive Language Detection in Online Conversations by Combining Content-And Graph-Based Features
In recent years, online social networks have allowed world-wide users to meet and
discuss. As guarantors of these communities, the administrators of these platforms must
prevent users from adopting inappropriate behaviors. This verification task, mainly done
by humans, is more and more difficult due to the ever growing amount of messages to
check. Methods have been proposed to automatize this moderation process, mainly by
providing approaches based on the textual content of the exchanged messages. Recent
work has also shown that characteristics derived from the structure of conversations,
in the form of conversational graphs, can help detecting these abusive messages. In
this paper, we propose to take advantage of both sources of information by proposing
fusion methods integrating content- and graph-based features. Our experiments on raw
Edited by:
chat logs show not only that the content of the messages, but also their dynamics
Sabrina Gaito,
University of Milan, Italy within a conversation contain partially complementary information, allowing performance
Reviewed by: improvements on an abusive message classification task with a final F-measure of
Roberto Interdonato, 93.26%.
Territoires, Environnement,
Télédètection et Information Spatiale Keywords: automatic abuse detection, content analysis, conversational graph, online conversations,
(TETIS), France social networks
Eric A. Leclercq,
Université de Bourgogne, France
*Correspondence:
1. INTRODUCTION
Vincent Labatut
[email protected] The internet has widely impacted the way we communicate. Online communities, in particular,
have grown to become important places for interpersonal communications. They get more and
Specialty section: more attention from companies to advertise their products or from governments interested in
This article was submitted to monitoring public discourse. Online communities come in various shapes and forms, but they are
Data Mining and Management, all exposed to abusive behavior. The definition of what exactly is considered as abuse depends on
a section of the journal the community, but generally includes personal attacks, as well as discrimination based on race,
Frontiers in Big Data
religion, or sexual orientation.
Received: 01 April 2019 Abusive behavior is a risk, as it is likely to make important community members leave, therefore
Accepted: 14 May 2019 endangering the community, and even trigger legal issues in some countries. Moderation consists
Published: 04 June 2019
in detecting users who act abusively, and in taking actions against them. Currently, this moderation
Citation: work is mainly a manual process, and since it implies high human and financial costs, companies
Cécillon N, Labatut V, Dufour R and have a keen interest in its automation. One way of doing so is to consider this task as a classification
Linarès G (2019) Abusive Language
problem consisting in automatically determining if a user message is abusive or not.
Detection in Online Conversations by
Combining Content- and
A number of works have tackled this problem, or related ones, in the literature. Most of
Graph-Based Features. them focus only on the content of the targeted message to detect abuse or similar properties.
Front. Big Data 2:8. For instance (Spertus, 1997), applies this principle to detect hostility (Dinakar et al., 2011), for
doi: 10.3389/fdata.2019.00008 cyberbullying, and (Chen et al., 2012) for offensive language. These approaches rely on a mix of
standard NLP features and manually crafted application-specific The rest of this article is organized as follows. In section 2, we
resources (e.g., linguistic rules). We also proposed a content- describe the methods and strategies used in this work. In section 3
based method (Papegnies et al., 2017a) using a wide array we present our dataset, the experimental setup we use for this
of language features (Bag-of-Words, tf -idf scores, sentiment classification task, and the performances we obtained. Finally,
scores). Other approaches are more machine learning intensive, we summarize our contributions in section 4 and present some
but require larger amounts of data. Recently, Wulczyn et al. perspectives for this work.
(2017) created three datasets containing individual messages
collected from Wikipedia discussion pages, annotated for
toxicity, personal attacks and aggression, respectively. They 2. METHODS
have been leveraged in recent works to train Recursive Neural
Network operating on word embeddings and character n-gram In this section, we summarize the content-based method
features (Pavlopoulos et al., 2017; Mishra et al., 2018). However, from Papegnies et al. (2017b) (section 2.1) and the graph-based
the quality of these direct content-based approaches is very often method from Papegnies et al. (2019) (section 2.2). We then
related to the training data used to learn abuse detection models. present the fusion method proposed in this paper, aiming at
In the case of online social networks, the great variety of users, taking advantage of both sources of information (section 2.3).
including very different language registers, spelling mistakes, as Figure 1 shows the whole process, and is discussed through
well as intentional users obfuscation, makes it almost impossible this section.
to have models robust enough to be applied in all cases. (Hosseini
et al., 2017) have then shown that it is very easy to bypass 2.1. Content-Based Method
automatic toxic comment detection systems by making the This method corresponds to the bottom-left part of Figure 1 (in
abusive content difficult to detect (intentional spelling mistakes, green). It consists in extracting certain features from the content
uncommon negatives...). of each considered message, and to train a Support Vector
Because the reactions of other users to an abuse case are Machine (SVM) classifier to distinguish abusive (Abuse class) and
completely beyond the abuser’s control, some authors consider non-abusive (Non-abuse class) messages (Papegnies et al., 2017b).
the content of messages occurring around the targeted message, These features are quite standard in Natural Language Processing
instead of focusing only on the targeted message itself. For (NLP), so we only describe them briefly here.
instance, (Yin et al., 2009) use features derived from the sentences We use a number of morphological features. We use the
neighboring a given message to detect harassment on the Web. message length, average word length, and maximal word length,
(Balci and Salah, 2015) take advantage of user features such as all expressed in number of characters. We count the number
the gender, the number of in-game friends or the number of of unique characters in the message. We distinguish between
daily logins to detect abuse in the community of an online game. six classes of characters (letters, digits, punctuation, spaces, and
In our previous work (Papegnies et al., 2019), we proposed a others) and compute two features for each one: number of
radically different method that completely ignores the textual occurrences, and proportion of characters in the message. We
content of the messages, and relies only on a graph-based proceed similarly with capital letters. Abusive messages often
modeling of the conversation. This is the only graph-based contain a lot of copy/paste. To deal with such redundancy,
approach ignoring the linguistic content proposed in the context we apply the Lempel–Ziv–Welch (LZW) compression algorithm
of abusive messages detection. Our conversational network (Batista and Meira, 2004) to the message and take the ratio of
extraction process is inspired from other works leveraging such its raw to compress lengths, expressed in characters. Abusive
graphs for other purposes: chat logs (Mutton, 2004) or online messages also often contain extra-long words, which can be
forums (Forestier et al., 2011) interaction modeling, user group identified by collapsing the message: extra occurrences of letters
detection (Camtepe et al., 2004). Additional references on abusive repeated more than two times consecutively are removed. For
message detection and conversational network modeling can be instance, “looooooool” would be collapsed to “lool”. We compute
found in Papegnies et al. (2019). the difference between the raw and collapsed message lengths.
In this paper, based on the assumption that the interactions We also use language features. We count the number of words,
between users and the content of the exchanged messages unique words and bad words in the message. For the latter, we use
convey different information, we propose a new method to a predefined list of insults and symbols considered as abusive, and
perform abuse detection while leveraging both sources. For this we also count them in the collapsed message. We compute two
purpose, we take advantage of the content-(Papegnies et al., overall tf –idf scores corresponding to the sums of the standard
2017b) and graph-based (Papegnies et al., 2019) methods that tf –idf scores of each individual word in the message. One is
we previously developed. We propose three different ways to processed relatively to the Abuse class, and the other to the Non-
combine them, and compare their performance on a corpus abuse class. We proceed similarly with the collapsed message.
of chat logs originating from the community of a French Finally, we lower-case the text and strip punctuation, in order
multiplayer online game. We then perform a feature study, to represent the message as a basic Bag-of-Words (BoW). We
finding the most informative ones and discussing their role. Our then train a Naive Bayes classifier to detect abuse using this sparse
contribution is twofold: the exploration of fusion methods, and binary vector (as represented in the very bottom part of Figure 1).
more importantly the identification of discriminative features for The output of this simple classifier is then used as an input feature
this problem. for the SVM classifier.
2.2. Graph-Based Method this period is modeled by a vertex in the produced conversational
This method corresponds to the top-left part of Figure 1 (in graph. A mobile window is slid over the whole period, one
red). It completely ignores the content of the messages, and message at a time. At each step, the network is updated either by
only focuses on the dynamics of the conversation, based on the creating new links, or by updating the weights of existing ones.
interactions between its participants (Papegnies et al., 2019). It This sliding window has a fixed length expressed in number of
is three-stepped: (1) extracting a conversational graph based on messages, which is derived from ergonomic constraints relative
the considered message as well as the messages preceding and/or to the online conversation platform studied in section 3. It allows
following it; (2) computing the topological measures of this graph focusing on a smaller part of the context period. At a given time,
to characterize its structure; and (3) using these values as features the last message of the window (in blue in Figure 2) is called
to train an SVM to distinguish between abusive and non-abusive current message and its author current author. The weight update
messages. The vertices of the graph model the participants of the method assumes that the current message is aimed at the authors
conversation, whereas its weighted edges represent how intensely of the other messages present in the window, and therefore
they communicate. connects the current author to them (or strengthens their weights
The graph extraction is based on a number of concepts if the edge already exists). It also takes chronology into account by
illustrated in Figure 2, in which each rectangle represents a favoring the most recent authors in the window. Three different
message. The extraction process is restricted to a so-called context variants of the conversational network are extracted for one given
period, i.e., a sub-sequence of messages including the message of targeted message: the Before network is based on the messages
interest, itself called targeted message and represented in red in posted before the targeted message, the After network on those
Figure 2. Each participant posting at least one message during posted after, and the Full network on the whole context period.
FIGURE 1 | Representation of our processing pipeline. Existing methods refers to our previous work described in Papegnies et al. (2017b) (content-based method)
and Papegnies et al. (2019) (graph-based method), whereas the contribution presented in this article appears on the right side (fusion strategies). Figure available at
10.6084/m9.figshare.7442273 under CC-BY license.
FIGURE 2 | Illustration of the main concepts used during network extraction (see text for details). Figure available at 10.6084/m9.figshare.7442273 under CC-BY
license.
FIGURE 3 | Example of the three types of conversational networks extracted for a given context period: Before (Left), After (Center), and Full (Right). The author of
the targeted message is represented in red. Figure available at 10.6084/m9.figshare.7442273 under CC-BY license.
Figure 3 shows an example of such networks obtained for a the content- and graph-based methods, respectively. Second, we
message of the corpus described in section 3.1. fetch these two scores to a third SVM, trained to determine if a
Once the conversational networks have been extracted, they message is abusive or not. This approach relies on the assumption
must be described through numeric values in order to feed the that these scores contain all the information the final classifier
SVM classifier. This is done through a selection of standard needs, and not the noise present in the raw features.
topological measures allowing to describe a graph in a number Finally, the third fusion strategy can be considered as Hybrid
of distinct ways, focusing on different scales and scopes. The Fusion, as it seeks to combine both previous proposed ones.
scale denotes the nature of the characterized entity. In this We create a feature set containing the content- and graph-based
work, the individual vertex and the whole graph are considered. features, like with Early Fusion, but also both scores used in
When considering a single vertex, the measure focuses on the Late Fusion. This whole set is used to train a new SVM. The
targeted author (i.e., the author of the targeted message). The idea is to check whether the scores do not convey certain useful
scope can be either micro-, meso-, or macroscopic: it corresponds information present in the raw features, in which case combining
to the amount of information considered by the measure. scores and features should lead to better results.
For instance, the graph density is microscopic, the modularity
is mesoscopic, and the diameter is macroscopic. All these
measures are computed for each graph, and allow describing the 3. EXPERIMENTS
conversation surrounding the message of interest. The SVM is
then trained using these values as features. In this work, we use In this section, we first describe our dataset and the experimental
exactly the same measures as in Papegnies et al. (2019). protocol followed in our experiments (section 3.1). We then
present and discuss our results, in terms of classification
performance (sections 3.2) and feature selection (section 3.3).
2.3. Fusion
We now propose a new method seeking to take advantage of both
previously described ones. It is based on the assumption that the 3.1. Experimental Protocol
content- and graph-based features convey different information. The dataset is the same as in our previous publications (Papegnies
Therefore, they could be complementary, and their combination et al., 2017b, 2019). It is a proprietary database containing
could improve the classification performance. We experiment 4,029,343 messages in French, exchanged on the in-game chat
with three different fusion strategies, which are represented in the of SpaceOrigin1 , a Massively Multiplayer Online Role-Playing
right-hand part of Figure 1. Game (MMORPG). Among them, 779 have been flagged as being
The first strategy follows the principle of Early Fusion. It abusive by at least one user in the game, and confirmed as such by
consists in constituting a global feature set containing all content- a human moderator. They constitute what we call the Abuse class.
and graph-based features from sections 2.1 and 2.2, then training Some inconsistencies in the database prevent us from retrieving
a SVM directly using these features. The rationale here is that the the context of certain messages, which we remove from the set.
classifier has access to the whole raw data, and must determine After this cleaning, the Abuse class contains 655 messages. In
which part is relevant to the problem at hand. order to keep a balanced dataset, we further extract the same
The second strategy is Late Fusion, and we proceed in two number of messages at random from the ones that have not
steps. First, we apply separately both methods described in been flagged as abusive. This constitutes our Non-abuse class.
sections 2.1 and 2.2, in order to obtain two scores corresponding
to the output probability of each message to be abusive given by 1 https://play.spaceorigin.fr/
Each message, whatever its class, is associated to its surrounding correlation between the score of the graph- and content-based
context (i.e., messages posted in the same thread). classifiers is 0.56, which is consistent with these observations.
The graph extraction method used to produce the graph- Next, when comparing the fusion strategies, it appears that
based features requires to set certain parameters. We use the Late Fusion performs better than the others, with an F-measure
values matching the best performance, obtained during the of 93.26. This is a little bit surprising: we were expecting to get
greedy search of the parameter space performed in Papegnies superior results from the Early Fusion, which has direct access to
et al. (2019). In particular, regarding the two most important a much larger number of raw features (488). By comparison, the
parameters (see section 2.2), we fix the context period size to Late Fusion only gets 2 features, which are themselves the outputs
1,350 messages and the sliding window length to 10 messages. of two other classifiers. This means that the Content-Based
Implementation-wise, we use the iGraph library (Csardi and and Graph-Based classifiers do a good work in summarizing
Nepusz, 2006) to extract the conversational networks and their inputs, without loosing much of the information necessary
process the corresponding features. We use the Sklearn to efficiently perform the classification task. Moreover, we
toolkit (Pedregosa et al., 2011) to get the text-based features. We assume that the Early Fusion classifier struggles to estimate an
use the SVM classifier implemented in Sklearn under the name appropriate model when dealing with such a large number of
SVC (C-Support Vector Classification). Because of the relatively features, whereas the Late Fusion one benefits from the pre-
small dataset, we set-up our experiments using a 10-fold cross- processing performed by its two predecessors, which act as
validation. Each fold is balanced between the Abuse and Non- if reducing the dimensionality of the data. This seems to be
abuse classes, 70% of the dataset being used for training and 30% confirmed by the results of the Hybrid Fusion, which produces
for testing. better results than the Early Fusion, but is still below the Late
Fusion. This point could be explored by switching to classification
algorithm less sensitive to the number of features. Alternatively,
3.2. Classification Performance when considering the three SVMs used for the Late Fusion, one
Table 1 presents the Precision, Recall and F-measure scores could see a simpler form of a very basic Multilayer Perceptron, in
obtained on the Abuse class, for both baselines [Content- which each neuron has been trained separately (without system-
based (Papegnies et al., 2017b) and Graph-based (Papegnies et al., wide backpropagation). This could indicate that using a regular
2019)] and all three proposed fusion strategies (Early Fusion, Late Multilayer Perceptron directly on the raw features could lead to
Fusion and Hybrid Fusion). It also shows the number of features improved results, especially if enough training data is available.
used to perform the classification, the time required to compute Regarding runtime, the graph-based approach takes more
the features and perform the cross validation (Total Runtime) and than 8 h to run for the whole corpus, mainly because of the
to compute one message in average (Average Runtime). Note that feature computation step. This is due to the number of features,
Late Fusion has only 2 direct inputs (content- and graph-based and to the compute-intensive nature of some of them. The
SVMs), but these in turn have their own inputs, which explains content-based approach is much faster, with a total runtime of
the values displayed in the table. <1 min, for the exact opposite reasons. Fusion methods require
Our first observation is that we get higher F-measure to compute both content- and graph-based features, so they have
values compared to both baselines when performing the fusion, the longest runtime.
independently from the fusion strategy. This confirms what we
expected, i.e., that the information encoded in the interactions 3.3. Feature Study
between the users differs from the information conveyed by the We now want to identify the most discriminative features for all
content of the messages they exchange. Moreover, this shows three fusion strategies. We apply an iterative method based on the
that both sources are at least partly complementary, since the Sklearn toolkit, which allows us to fit a linear kernel SVM to the
performance increases when merging them. On a side note, the dataset and provide a ranking of the input features reflecting their
TABLE 1 | Comparison of the performances obtained with the methods (Content-based, Graph-based, Fusion) and their subsets of Top Features (TF).
importance in the classification process. Using this ranking, we performance. The third is the Capital Ratio (proportion of
identify the least discriminant feature, remove it from the dataset, capital letters in the comment), which is likely to be caused by
and train a new model with the remaining features. The impact abusive message tending to be shouted, and therefore written
of this deletion is measured by the performance difference, in in capitals. The Graph-Based TF are discussed in depth in our
terms of F-measure. We reiterate this process until only one previous article (Papegnies et al., 2019). To summarize, the
feature remains. We call Top Features (TF) the minimal subset of most important features help detecting changes in the direct
features allowing to reach 97% of the original performance (when neighborhood of the targeted author (Coreness, Strength), in the
considering the complete feature set). average node centrality at the level of the whole graph in terms of
We apply this process to both baselines and all three fusion distance (Closeness), and in the general reciprocity of exchanges
strategies. We then perform a classification using only their between users (Reciprocity).
respective TF. The results are presented in Table 1. Note that We obtain 4 features for Early Fusion TF. One is the
the Late Fusion TF performance is obtained using the scores Naive Bayes feature (content-based), and the other three are
produced by the SVMs trained on Content-based TF and Graph- topological measures (graph-based features). Two of the latter
based TF. These are also used as features when computing the correspond to the Coreness of the targeted author, computed
TF for Hybrid Fusion TF (together with the raw content- and for the Before and After graphs. The third topological measure
graph-based features). In terms of classification performance, is his/her Eccentricity. This reflects important changes in the
by construction, the methods are ranked exactly like when interactions around the targeted author. It is likely caused
considering all available features. by angry users piling up on the abusive user after he has
The Top Features obtained for each method are listed in posted some inflammatory remark. For Hybrid Fusion TF, we
Table 2. The last 4 columns precise which variants of the graph- also get 4 features, but those include in first place both SVM
based features are concerned. Indeed, as explained in section 2.2, outputs from the content- and graph-based classifiers. Those
most of these topological measures can handle/ignore edge are completed by 2 graph-based features, including Strength
weights and/or edge directions, can be vertex- or graph-focused, (also found in the Graph-based and Late Fusion TF) and
and can be computed for each of the three types of networks Coreness (also found in the Graph-based, Early Fusion and
(Before, After, and Full). Late Fusion TF).
There are three Content-Based TF. The first is the Naive Besides a better understanding of the dataset and classification
Bayes prediction, which is not surprising as it comes from a process, one interesting use of the TF is that they can allow
fully fledged classifier processing BoWs. The second is the tf - decreasing the computational cost of the classification. In our
idf score computed over the Abuse class, which shows that case, this is true for all methods: we can retain 97% of the
considering term frequencies indeed improve the classification performance while using only a handful of features instead of
hundreds. For instance, with the Late Fusion TF, we need only the Wikipedia-based corpus proposed by Wulczyn et al. (2017),
3% of the total Late Fusion runtime. and complete them by reconstructing the original conversations
containing the annotated messages. This could also be the
4. CONCLUSION AND PERSPECTIVES opportunity to test our methods on an other language than
French. Our content-based method may be impacted by this
In this article, we tackle the problem of automatic abuse change, but this should not be the case for the graph-based
detection in online communities. We take advantage of the method, as it is independent from the content (and therefore
methods that we previously developed to leverage message the language). Besides language, a different online community
content (Papegnies et al., 2017a) and interactions between is likely to behave differently from the one we studied before.
users (Papegnies et al., 2019), and create a new method using In particular, its members could react differently to abuse. The
both types of information simultaneously. We show that the Wikipedia dataset would therefore allow assessing how such
features extracted from our content- and graph-based approaches cultural differences affect our classifiers, and identifying which
are complementary, and that combining them allows to sensibly observations made for Space Origin still apply to Wikipedia.
improve the results up to 93.26 (F-measure). One limitation of
our method is the computational time required to extract certain DATA AVAILABILITY
features. However, we show that using only a small subset of
relevant features allows to dramatically reduce the processing The datasets for this manuscript are not publicly available
time (down to 3%) while keeping more than 97% of the because Private dataset. Requests to access the data should
original performance. actually be addressed to the corresponding author, V. Labatut.
Another limitation of our work is the small size of our dataset.
We must find some other corpora to test our methods at a much AUTHOR CONTRIBUTIONS
higher scale. However, all the available datasets are composed of
isolated messages, when we need threads to make the most of All authors listed have made a substantial, direct and intellectual
our approach. A solution could be to start from datasets such as contribution to the work, and approved it for publication.