Cyberbullying Detection Based On Semantic-Enhanced Marginalized Denoising Auto-Encoder PDF
Cyberbullying Detection Based On Semantic-Enhanced Marginalized Denoising Auto-Encoder PDF
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2016.2531682, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 1
Abstract—As a side effect of increasingly popular social media, cyberbullying has emerged as a serious problem afflicting children,
adolescents and young adults. Machine learning techniques make automatic detection of bullying messages in social media possible,
and this could help to construct a healthy and safe social media environment. In this meaningful research area, one critical issue is
robust and discriminative numerical representation learning of text messages. In this paper, we propose a new representation learning
method to tackle this problem. Our method named Semantic-Enhanced Marginalized Denoising Auto-Encoder (smSDA) is developed
via semantic extension of the popular deep learning model stacked denoising autoencoder. The semantic extension consists of
semantic dropout noise and sparsity constraints, where the semantic dropout noise is designed based on domain knowledge and the
word embedding technique. Our proposed method is able to exploit the hidden feature structure of bullying information and learn a
robust and discriminative representation of text. Comprehensive experiments on two public cyberbullying corpora (Twitter and
MySpace) are conducted, and the results show that our proposed approaches outperform other baseline text representation learning
methods.
Index Terms—Cyberbullying Detection, Text Mining, Representation Learning, Stacked Denoising Autoencoders, Word Embedding
1 I NTRODUCTION
1949-3045 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2016.2531682, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 2
data, i.e., data sparsity make the issue more challenging. there is a strong correlation between bullying word fuck and
Firstly, labeling data is labor intensive and time consuming. normal word off since they often occur together. If bullying
Secondly, cyberbullying is hard to describe and judge from messages do not contain such obvious bullying features,
a third view due to its intrinsic ambiguities. Thirdly, due such as fuck is often misspelled as fck, the correlation may
to protection of Internet users and privacy issues, only a help to reconstruct the bullying features from normal ones
small portion of messages are left on the Internet, and most so that the bullying message can be detected. It should
bullying posts are deleted. As a result, the trained classifier be noted that introducing dropout noise has the effects of
may not generalize well on testing messages that contain enlarging the size of the dataset, including training data
nonactivated but discriminative features. The goal of this size, which helps alleviate the data sparsity problem. In
present study is to develop methods that can learn robust addition, L1 regularization of the projection matrix is added
and discriminative representations to tackle the above prob- to the objective function of each autoencoder layer in our
lems in cyberbullying detection. model to enforce the sparstiy of projection matrix, and this
Some approaches have been proposed to tackle these in turn facilitates the discovery of the most relevant terms
problems by incorporating expert knowledge into feature for reconstructing bullying terms. The main contributions of
learning. Yin et.al proposed to combine BoW features, senti- our work can be summarized as follows:
ment features and contextual features to train a support vec-
* Our proposed Semantic-enhanced Marginalized S-
tor machine for online harassment detection [10]. Dinakar
tacked Denoising Autoencoder is able to learn ro-
et.al utilized label specific features to extend the general
bust features from BoW representation in an effi-
features, where the label specific features are learned by
cient and effective way. These robust features are
Linear Discriminative Analysis [11]. In addition, common
learned by reconstructing original input from cor-
sense knowledge was also applied. Nahar et.al presented a
rupted (i.e., missing) ones. The new feature space
weighted TF-IDF scheme via scaling bullying-like features
can improve the performance of cyberbullying de-
by a factor of two [12]. Besides content-based information,
tection even with a small labeled training corpus.
Maral et.al proposed to apply users’ information, such as
* Semantic information is incorporated into the re-
gender and history messages, and context information as
construction process via the designing of semantic
extra features [13], [14]. But a major limitation of these
dropout noises and imposing sparsity constraints
approaches is that the learned feature space still relies on
on mapping matrix. In our framework, high-quality
the BoW assumption and may not be robust. In addition,
semantic information, i.e., bullying words, can be
the performance of these approaches rely on the quality
extracted automatically through word embeddings.
of hand-crafted features, which require extensive domain
Finally, these specialized modifications make the
knowledge.
new feature space more discriminative and this in
In this paper, we investigate one deep learning method
turn facilitates bullying detection.
named stacked denoising autoencoder (SDA) [15]. SDA
* Comprehensive experiments on real-data sets have
stacks several denoising autoencoders and concatenates the
verified the performance of our proposed model.
output of each layer as the learned representation. Each
denoising autoencoder in SDA is trained to recover the This paper is organized as follows. In Section 2, some re-
input data from a corrupted version of it. The input is lated work is introduced. The proposed Semantic-enhanced
corrupted by randomly setting some of the input to zero, Marginalized Stacked Denoising Auto-encoder for cyber-
which is called dropout noise. This denoising process helps bullying detection is presented in Section 3. In Section 4,
the autoencoders to learn robust representation. In addition, experimental results on several collections of cyberbullying
each autoencoder layer is intended to learn an increasingly data are illustrated. Finally, concluding remarks are provid-
abstract representation of the input [16]. In this paper, we ed in Section 5.
develop a new text representation model based on a variant
of SDA: marginalized stacked denoising autoencoders (mS- 2 R ELATED W ORK
DA) [17], which adopts linear instead of nonlinear projection
This work aims to learn a robust and discriminative text rep-
to accelerate training and marginalizes infinite noise distri-
resentation for cyberbullying detection. Text representation
bution in order to learn more robust representations. We
and automatic cyberbullying detection are both related to
utilize semantic information to expand mSDA and develop
our work. In the following, we briefly review the previous
Semantic-enhanced Marginalized Stacked Denoising Au-
work in these two areas.
toencoders (smSDA). The semantic information consists of
bullying words. An automatic extraction of bullying words
based on word embeddings is proposed so that the involved 2.1 Text Representation Learning
human labor can be reduced. During training of smSDA, we In text mining, information retrieval and natural language
attempt to reconstruct bullying features from other normal processing, effective numerical representation of linguistic
words by discovering the latent structure, i.e. correlation, units is a key issue. The Bag-of-words (BoW) model is
between bullying and normal words. The intuition behind the most classical text representation and the cornerstone
this idea is that some bullying messages do not contain of some states-of-arts models including Latent Semantic
bullying words. The correlation information discovered by Analysis (LSA) [18] and topic models [19], [20]. BoW model
smSDA helps to reconstruct bullying features from normal represents a document in a textual corpus using a vector
words, and this in turn facilitates detection of bullying of real numbers indicating the occurrence of words in the
messages without containing bullying words. For example, document. Although BoW model has proven to be efficient
1949-3045 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2016.2531682, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 3
and effective, the representation is often very sparse. To Although the incorporation of knowledge base can achieve
address this problem, LSA applies Singular Value Decompo- a performance improvement, the construction of a complete
sition (SVD) on the word-document matrix for BoW model and general one is labor-consuming. Nahar et.al proposed to
to derive a low-rank approximation. Each new feature is a scale bullying words by a factor of two in the original BoW
linear combination of all original features to alleviate the features [12]. The motivation behind this work is quit similar
sparsity problem. Topic models, including Probabilistic La- to that of our model to enhance bullying features. However,
tent Semantic Analysis [21] and Latent Dirichlet Allocation the scaling operation in [12] is quite arbitrary. Ptaszynski
[20], are also proposed. The basic idea behind topic models et.al searched sophisticated patterns in a brute-force way
is that word choice in a document will be influenced by the [26]. The weights for each extracted pattern need to be calcu-
topic of the document probabilistically. Topic models try to lated based on annotated training corpus, and thus the per-
define the generation process of each word occurred in a formance may not be guaranteed if the training corpus has a
document. limited size. Besides content-based information, Maral et.al
Similar to the approaches aforementioned, our proposed also employ users’ information, such as gender and history
approach takes the BoW representation as the input. How- messages, and context information as extra features [13],
ever, our approach has some distinct merits. Firstly, the mul- [14]. Huang et.al also considered social network features
ti-layers and non-linearity of our model can ensure a deep to learn the features for cyberbullying detection [9]. The
learning architecture for text representation, which has been shared deficiency among these forementioned approaches is
proven to be effective for learning high-level features [22]. constructed text features are still from BoW representation,
Second, the applied dropout noise can make the learned which has been criticized for its inherent over-sparsity and
representation more robust. Third, specific to cyberbullying failure to capture semantic structure [18], [19], [20]. Differ-
detection, our method employs the semantic information, ent from these approaches, our proposed model can learn
including bullying words and sparsity constraint imposed robust features by reconstructing the original data from
on mapping matrix in each layer and this will in turn pro- corrupted data and introduce semantic corruption noise and
duce more discriminative representation. sparsity mapping matrix to explore the feature structure
which are predictive of the existence of bullying so that the
learned representation can be discriminative.
2.2 Cyberbullying Detection
With the increasing popularity of social media in recent
years, cyberbullying has emerged as a serious problem 3 S EMANTIC -E NHANCED M ARGINALIZED
afflicting children and young adults. Previous studies of S TACKED D ENOISING AUTO - ENCODER
cyberbullying focused on extensive surveys and its psy- We first introduce notations used in our paper. Let D =
chological effects on victims, and were mainly conducted {w1 , . . . , wd } be the dictionary covering all the words exist-
by social scientists and psychologists [6], [23], [24], [25]. ing in the text corpus. We represent each message using a
Although these efforts facilitate our understanding for cy- BoW vector x ∈ Rd . Then, the whole corpus can be denoted
berbullying, the psychological science approach based on as a matrix: X = [x1 , . . . , xn ] ∈ Rd×n , where n is the
personal surveys is very time-consuming and may not number of available posts.
be suitable for automatic detection of cyberbullying. Since We next briefly review the marginalized stacked de-
machine learning is gaining increased popularity in recent noising auto-encoder and present our proposed Semantic-
years, the computational study of cyberbullying has at- enhanced Marginalized Stacked Denoising Auto-Encoder.
tracted the interest of researchers. Several research areas
including topic detection and affective analysis are closely
related to cyberbullying detection. Owing to their efforts, 3.1 Marginalized Stacked Denoising Auto-encoder
automatic cyberbullying detection is becoming possible. In Chen et.al proposed a modified version of Stacked Denois-
machine learning-based cyberbullying detection, there are ing Auto-encoder that employs a linear instead of a non-
two issues: 1) text representation learning to transform each linear projection so as to obtain a closed-form solution [17].
post/message into a numerical vector and 2) classifier train- The basic idea behind denoising auto-encoder is to recon-
ing. Xu et.al presented several off-the-shelf NLP solutions struct the original input from a corrupted one x̃1 , . . . , x̃n
including BoW models, LSA and LDA for representation with the goal of obtaining robust representation.
learning to capture bullying signals in social media [8]. Marginalized Denoising Auto-encoder: In this mod-
As an introductory work, they did not develop specialized el, denoising auto-encoder attempts to reconstruct original
models for cyberbullying detection. Yin et.al proposed to data using the corrupted data via a linear projection. The
combine BoW features, sentiment feature and contextual projection matrix can be learned as:
features to train a classifier for detecting possible harassing
n
posts [10]. The introduction of the sentiment and contex- 1 X 2
W = argmin kxi − Wx̃i k (1)
tual features has been proven to be effective. Dinakar et.al W 2n i=1
used Linear Discriminative Analysis to learn label specific
features and combine them with BoW features to train a where W ∈ Rd×d . For simplicity, we can write Eq. (1) in
classifier [11]. The performance of label-specific features matrix form as:
largely depends on the size of training corpus. In addition,
they need to construct a bullyspace knowledge base to boost 1 h i
the performance of natural language processing methods. W = argmin tr (X − WX̃)T (X − WX̃) (2)
W 2n
1949-3045 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2016.2531682, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 4
where X̃ = [x̃1 , . . . , x̃n ] is the corrupted version of X. It is where Z ∈ Rd(L+1)×n . Each column of Z represents the final
easily shown that Eq. (2) is an ordinary least square problem representation of each individual data sample.
having a closed-form solution:
3.2 Semantic Enhancement for mSDA
W = PQ−1 (3) The advantage of corrupting the original input in mSDA
where P = XX̃T and Q = X̃X̃T . In fact, this corruption can be explained by feature co-occurrence statistics. The co-
can be marginalized over the noise distribution [17]. The occurrence information is able to derive a robust feature rep-
more corruptions we take in the denoising auto-encoder, resentation under an unsupervised learning framework, and
the more robust transformation can be learned. Therefore, this also motivates other state-of-the-art text feature learning
the best choice is using infinite versions of corrupted data. If methods such as Latent Semantic Analysis and topic models
the data corpus is corrupted infinite times, the matrix P and [18], [20]. As shown in Figure 1. (a), a denoising auto-
Q are converged to their corresponding expectation, and Eq. encoder is trained to reconstruct these removed features
(3) can be formulated as: values from the rest uncorrupted ones. Thus, the learned
mapping matrix W is able to capture correlation between
−1 these removed features and other features. It is shown that
W = E [P] E [Q] (4)
the learned representation is robust and can be regarded as
Pn T
where E [P] = i=1 E xi x̃i and E [Q] = a high level concept feature since the correlation informa-
n T tion is invariant to domain-specific vocabularies. We next
P
i=1 E x̃ x̃
i i . These expected matrices can be computed
based on noise distribution. In [17], dropout noise is describe how to extend mSDA for cyberbullying detection.
adopted to corrupt data samples by setting a feature to zero The major modifications include semantic droupout noise
with a probability p. Assuming the scatter matrix of the and sparse mapping constraints.
original data samples is denoted as S = XXT , the expected
matrices can be computed as: 3.2.1 Semantic Dropout Noise
( The dropout noise adopted in mSDA is an uniform distri-
(1 − p)2 Si,j if i 6= j , bution, where each feature has the same probability to be
E [Q]i,j = (5) removed. In cyberbullying detection, most bullying posts
(1 − p)Si,j if i = j .
contain bullying words such as profanity words and foul
and languages. These bullying words are very predictive of
E [P]i,j = (1 − p)Si,j (6) the existence of cyberbullying. However, a direct use of
these bullying features may not achieve good performance
where i and j denotes the indices of features. It can be because these words only account for a small portion of the
seen that it is very efficient to compute W by marginalizing whole vocabulary and these vulgar words are only one kind
dropout noise in denoising auto-encoder. After the mapping of discriminative features for bullying [10], [26]. In other
weights W are computed, a nonlinear squashing function, way, we can explore these cyberbullying words by using
such as a hyperbolic tangent function, can be applied to de- a different dropout noise that features corresponding to
rive the output of the marginalized denoising auto-encoder: bullying words have a larger probability of corruption than
other features. The imposed large probability on bullying
H = tanh(WX) (7) words emphasizes the correlation between bullying features
and normal ones. This kind of dropout noise can be denoted
Stacking Structure: Chen et.al [17] also proposed to
as semantic dropout noise, because semantic information is
apply stacking structures on marginalized denoising au-
used to design dropout structure.
toencoder, in which the output of the (k − 1)th layer is fed
As shown in Figure 1. (b), the correlation between fea-
as the input into the k th layer. If we define the output of the
tures can enable other normal words to predict bullying
k th mDA as Hk and the original input as H0 respectively,
labels. Considering a simple but intuitive example, ”Leave
the mapping between two consecutive layers is given as:
him alone, he is just a chink”1 , which is obviously a bullying
message. However, the classifier will set the weight of the
Hk = tanh(Wk Hk−1 ) (8) discriminative word ”chink” to zero, if the small sized
where Wk denotes the mapping in k th layer. The model training corpus does not cover it. Our proposed smSDA can
training can be done greedily layer by layer. This means deal with the problem by learning a robust feature represen-
that the mapping weights Wk is learned in a closed-form tation, which is a high level concept representation. In the
to reconstruct the output of (k − 1)th mDA layer from its learned representation, the word ”chink” are reconstruct-
marginalized corruptions, as shown in Eq. (4). If the number ed by context words co-occurring with the specific word
of layers is set to L, the final representation for input data X (”chink”) and the context words may be shared by other
is the concatenation of the uncorrupted original input and bullying words contained in training corpus. Therefore, the
outputs of all layers as follows: correlation explored by this auto-encoder structure enables
the subsequent classifier to learn the discriminative word
and improve the classification performance. In addition,
X
H1 the semantic dropout noise exploits the correlation between
Z= . (9)
.. 1. ”Chink (also chinki, chinky, chinkie) is an English ethnic slur usually
HL referring to a person of Chinese or East Asian ethnicity” from Wikipedia
1949-3045 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2016.2531682, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 5
bullying features and normal features better and hence, 3.2.2 Sparsity Constraints
facilitates cyberbullying detection.
In mSDA, the mapping matrix W is learned to reconstruc-
Due to the introduced semantic dropout noise, the ex-
t removed features from other uncorrupted features and
pected matrices: E [P] and E [Q] will be computed slightly
hence is able to capture the feature correlation information.
different from Eqs. (5) and (6). Assuming we have an
Here, we inject the sparsity constraints on the mapping
available bullying words list and the corresponding features
weights W so that each row has a small number of nonzero
set Zb , the semantic dropout noise can be described as the
elements. This sparsity constraint is quite intuitive because
following probability density function (PDF):
one word is only related to a small portion of vocabulary
instead of the whole vocabulary. In our proposed smSDA,
p(x̃d = 0) = pn if d ∈
/ Zb , the sparsity constraint is realized by the incorporation of
p(x̃d = xd ) = 1 − p n if d ∈
/ Zb , L1 regularization term into the objective function as in the
P DF = (10) lasso problem [27]. The optimization function for each layer
p(x̃d = 0) = pb if d ∈ Zb ,
in smSDA is given as follows:
p(x̃d = xd ) = 1 − pb if d ∈ Zb ,
1949-3045 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2016.2531682, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 6
x W xɶ
Bullying Context
Bullying Words
Words
X
X
Sparse, but
X Mass Features
Predictive
Correlation
Explored
X
Context Words are
also predictive of
bullying labels.
Original Input Mapping Matrix Corrupted Input
(a) (b)
Fig. 1. Illustration of Motivations behind smSDA. In Figure 1(a), the cross symbol denotes that its corresponding feature is corrupted, i.e., turned
off.
role and should be chosen properly. In the following, the 1.1 read good
steps for constructing bullying feature set Zb are given, in
1.0 nice
which the first layer and the other layers are addressed
shit
separately. For the first layer, expert knowledge and word 0.9
embeddings are used. For the other layers, discriminative
0.8
feature selection is conducted.
fuck bitch
Layer One: firstly, we build a list of words with negative 0.7
affective, including swear words and dirty words. Then, we
0.6
compare the word list with the BoW features of our own stupid slut nerd
whore
corpus, and regard the intersections as bullying features. 0.5
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
However, it is possible that expert knowledge is limited
and does not reflect the current usage and style of cyberlan-
guage. Therefore, we expand the list of pre-defined insulting Fig. 2. Two dimensional visualization of our used word embeddings via
words, i.e. insulting seeds, based on word embeddings as PCA. Displayed terms include both bullying ones and normal ones. It
follows: shows that similar words are nearby vectors.
Word embeddings use real-valued and low-dimension-
al vectors to represent semantics of words [32], [33]. The
well-trained word embeddings lie in a vector space where tracted if their cosine similarities with insult seed exceed a
similar words are placed close to each other. In addition, the predefined threshold. For bigram wl wr , we simply use an
cosine similarity between word embeddings is able to quan- additive model to derive the corresponding embedding as
tify the semantic similarity between words. Considering the follows:
Interent messages are our interested corpus, we utilize a
well-trained word2vec model on a large-scale twitter corpus v(wl wr ) = v(wl ) + v(wr ) (21)
containing 400 million tweets [34]. A visualization of some Finally, the constructed bullying features are used to
word embeddings after dimensionality reduction (PCA) is train the first layer in our proposed smSDA. It includes two
shown in Figure 2. It is observed that curse words form dis- parts: one is the original insulting seeds based on domain
tinct clusters, which are also far away from normal words. knowledge and the other is the extended bullying words
Even insulting words are located at different regions due via word embeddings. The length of Zb is k .
to different word usages and insulting expressions. In addi- Subsequent Layers: we perform feature selection using
tion, since the word embeddings adopted here are trained Fisher score to select ‘’bullying‘’ features. Fisher score is
in a large scale corpus from Twitter, the similarity captured an univariate metric reflecting the discriminative power of
by word embeddings can represent the specific language a feature [35], [36]. For the rth feature, the corresponding
pattern. For example, the embedding of the misspelled word Fisher score can be computed based on training data with
fck is close to the embedding of fuck so that the word fck can labels:
be automatically extracted based on word embeddings.
Pc
We extend the pre-defined insulting seeds based on word ni (µi − µ)2
embeddings. For each insulting seed, similar words are ex- Fr = i=1 Pc 2
(22)
i=1 ni σi
1949-3045 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2016.2531682, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 7
where c denotes the number of classes and ni represent the data and features, the classifier may not be trained
number of data in class i. µ and µi denote the mean of entire very well. Stacked densoing autoencoder (SDA), as
data and class i for the rth feature, and σi is the variance of an unsupervised representation learning method, is
class i on rth feature. After Fisher scores are estimated, fea- able to learn a robust feature space. In SDA, the
tures with top k scores are selected as ‘’bullying‘’ features, feature correlation is explored by the reconstruction
where ‘’bullying‘’ is generalized as discriminative. of corrupted data. The learned robust feature rep-
resentation can then boost the training of classifier
3.4 smSDA for Cyberbullying Detection and finally improve the classification accuracy. In
In section 3.3, we propose the Semantic-enhanced Marginal- addition, the corruption of data in SDA actually
ized Stacked Denoising Auto-encoder (smSDA). In this generates artificial data to expand data size, which
subsection, we describe how to leverage it for cyberbully- alleviate the small size problem of training data.
ing detection. smSDA provides robust and discriminative 2) For cyberbullying problem, we design semantic
representations The learned numerical representations can dropout noise to emphasize bullying features in the
then be fed into Support Vector Machine (SVM). In the new feature space, and the yielded new represen-
new space, due to the captured feature correlation and tation is thus more discriminative for cyberbullying
semantic information, the SVM, even trained in a small size detection.
of training corpus, is able to achieve a good performance 3) The sparsity constraint is injected into the solution
on testing documents(this will be verified in the following of mapping matrix W for each layer, considering
experiments). The detailed steps of our model are provided each word is only correlated to a small portion of the
below: whole vocabulary. We formulate the solution for the
Assuming the first nl posts are labeled and the corre- mapping weights W as an Iterated Ridge Regres-
sponding vector of binary labels is y = {y1 , . . . , ynl }. The sion problem, in which the semantic dropout noise
binary label 1 or 0 indicates the post is or is not a cyberbully- distribution can be easily marginalized to ensure the
ing one. Here, nl n, which means the labeled posts have efficient training of our proposed smSDA.
a small size. The bullying feature set Zb is constructed in 4) Based on word embeddings, bullying features can
a layer-wise way. Based on prior knowledge, we construct be extracted automatically. In addition, the possible
a pre-defined bullying wordlist and compare it with the limitation of expert knowledge can be alleviated by
original vocabulary of the whole corpus X. The words the use of word embedding.
appearing in both the vocabulary and the bullying wordlist
are selected as insulting seeds. The insulting seeds are then
4 E XPERIMENTS
expanded and refined automatically via word embeddings,
which defines the bullying features Zb for layer one. The In this section, we evaluate our proposed semantic-
experiments in Section 4 will show that the construction of enhanced marginalized stacked denoising auto-encoder
the set Zb is very simple and efficient with litter human (smSDA) with two public real-world cyberbullying corpora.
labor. For the subsequent layers, after obtaining the output We start by describing the adopted corpora and experimen-
of each layer, the set Zb is updated using feature ranking tal setup. Experimental results are then compared with other
with Fish score according to Eq. (22). baseline methods to test the performance of our approach.
Based on predefined dropout probabilities for bullying At last, we provide a detailed analysis to explain the good
features and other normal features pb and pn and the performance of our method.
bullying feature set Zb , we compute these two expected
matrices E [P] and E [Q] according to Eqs. (12) and (11), 4.1 Descriptions of Datasets
if the semantic dropout noise is adopted. When it comes
to the unbiased semantic dropout noise, Eqs. (14) and (15) Two datasets are used here. One is from Twitter and another
instead of Eqs. (12) and (11) are used to compute these two is from MySpace groups. The details of these two datasets
expected matrices. Then, we iteratively perform Eq. (21) for are described below:
Tmax times, where the initial value for W is calculated Twitter Dataset: Twitter is ‘’a real-time information net-
based on Eq. (20). When the mapping matrix is learned, work that connects you to the latest stories, ideas, opin-
the output of each layer is given according to Eq. (8). Due ions and news about what you find interesting‘’ (https:
to the stacking structure, the output of L layers and the //about.twitter.com/). Registered users can read and post
initial input are concatenated together to form the final tweets, which are defined as the messages posted on Twitter
representation Z ∈ Rd(L+1)×n following Eq. (9). It is clear with a maximum length of 140 characters.
that the new space has a dimension of (L + 1)d. A linear The Twitter dataset is composed of tweets crawled by
SVM [37] is trained on the training corpus, i.e. the first nl the public Twitter stream API through two steps. In Step 1,
columns in Z and tested on the rest data samples. keywords starting with ”bull” including ”bully”, ”bullied”
and ”bullying” are used as queries in Twitter to preselect
3.5 Merits of smSDA some tweets that potentially contain bullying contents. Re-
tweets are removed by excluding tweets containing the
Some important merits of our proposed approach are sum- acronym ‘’RT‘’. In Step 2, the selected tweets are manually
marized as follows: labeled as bullying trace or non-bullying trace based on the
1) Most cyberbullying detection methods rely on the contents of the tweets. 7321 tweets are randomly sampled
BoW model. Due to the sparsity problems of both from the whole tweets collections from August 6, 2011 to
1949-3045 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2016.2531682, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 8
1949-3045 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2016.2531682, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 9
4. https://radimrehurek.com/gensim/index.html
5. The code has been kindly provided at http://research.cs.wisc. Fig. 7. Word Cloud Visualization of the Bullying Features in MySpace
edu/bullying/data.html
Datasets.
6. A collection of insulting words can be found in the website: http:
//www.noswearing.com/dictionary
1949-3045 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2016.2531682, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 10
Accuracy (%)
and MySpace datasets. The average results, for these two 0.84
datasets, on classification accuracy and F1 score are shown 0.83
in Table 2. Figures 8 and 9 show the results of seven 0.82 BoW LSA mSDA smSDAu
compared approaches on all sub-datasets constructed from 0.81 sBoW LDA smSDA
Twitter and MySpace datasets, respectively. Since BWM does 0.80
0 1 2 3 4 5 6 7 8 9
not require training documents, its results over the whole 0.76 Datasets Index
corpus are reported in Table 2. It is clear that our approaches 0.74
outperform the other approaches in these two Twitter and 0.72
F1 Score (%)
MySpace corpora. 0.70
The first observation is that semantic BoW model (sBow) 0.68
performs slightly better than BoW. Based on BoW, sBoW just 0.66 BoW LSA mSDA smSDAu
arbitrarily scale the bullying features by a factor of 2. This 0.64 sBoW LDA smSDA
0.62
means that semantic information can boost the performance 0 1 2 3 4 5 6 7 8 9
of cyberbullying detection. For a fair comparison, the bul- Datasets Index
lying features used in our method and sBoW are unified
to be the same. Our approaches, especially smSDA, gains Fig. 8. Classification Accuracies and F1 Scores of All Compared Meth-
a significant performance improvement compared to sBoW. ods on Twitter Datasets.
This is because bullying features only account for a small
portion of all features used. It is difficult to learn robust
features for small training data by intensifying each bullying the correlated features by reconstructing masking feature
features’ amplitude. Our approach aims to find the cor- values from uncorrupted feature values. Further, the stack-
relation between normal features and bullying features by ing structure and the nonlinearity contribute to mSDA’s
reconstructing corrupted data so as to yield robust features. ability for discovering complex factors behind data. Based
In addition, Bullying Word Matching (BWM), as a simple on mSDA, our proposed smSDA utilizes semantic dropout
and intuitive method of using semantic information, gives noise and sparsity constraints on mapping matrix, in which
the worst performance. In BWM, the existence of bullying the efficiency of training can be kept. This extension leads
words are defined as rules for classification. It shows that to a stable performance improvement on cyberbullying de-
only an elaborated utilization of such bullying words in- tection and the detailed analysis has been provided in the
stead of a simple one can help cyberbullying detection. following section.
We also compare our methods with two stat-of-arts text We compare the performances of smSDA and smSDAu ,
representation learning methods LSA and LDA. These two which adopt biased semantic dropout noise and unbi-
methods do not produce good performance on all datasets. ased semantic dropout noise, respectively. The results have
This may be because that both methods belong to dimen- shown that smSDAu performs slightly worse than smSDA.
sionality reduction techniques, which are performed on the This may be explained by the fact that the unbiased semantic
document-word occurrence matrix. Although the two meth- dropout noise cancels the enhancement of bullying features.
ods try to minimize the reconstruction error as our approach As shown in Eq. (14), the off-diagonal elements in the matrix
does, the optimization in LSA and LDA is conducted after xi x̃T
i that are used to compute mapping weights are the
dimensionality reduction. The reduced dimension is a key same, which can not contribute to the reinforcement of
parameter to determine the quality of learned feature space. bullying features.
Here, we fix the dimension of latent space to 100. Therefore,
a deliberate searching for this parameter which may im- 4.4 Analysis of Semantic Extension
prove the performances of LSA and LDA and the selection As shown in the section 4.3, the semantic extension can
of hyperparameter itself is another tough research topic. boost the performance on classification results for cyber-
Another reason may be that the data samples are small (less bullying detection. In this section, we discuss the advan-
than 2000) and the length of each Internet message is short tages of this extension qualitatively. In our proposed smS-
(For Twitter, maximum length is 140 characters),and thus DA, because of the semantic dropout noise and sparsity
the constructed document-word occurrence matrix may not constraints, the learned representation is able to discover
represent the true co-occurrence of terms. the correlation between words containing latent bullying
Deep learning methods including mSDA and smSDA semantics. Table 3 shows the reconstruction terms of three
generally outperform other standard approaches. This trend example bullying words for mSDA and smSDA, respec-
is particularly prominent in F1 measure because cyberbul- tively. In this example, one-hot vector is used as input,
lying detection problems are class-imbalance. The larger which represents a document containing one bullying word.
improvements on F1 score verify the performance of our Table 3 lists the reconstructed terms in decreasing order
approach further. Deep learning models have achieved re- of their feature values, which represents the strength of
markable performance in various scenarios with its own their correlations with the input word. The results are
robust feature learning ability [22]. mSDA is able to cap- obtained using one layer architecture without non-linear
ture the correlation between input features and combine activation considering the raw terms directly correspond to
1949-3045 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2016.2531682, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 11
TABLE 2
Accuracies (%), and F1 Scores (%) for Compared Methods on Twitter and MySpace Datasets. The Mean Values are Given, respectively. Bold
Face Indicates Best Performance.
Dataset Measures BWM BoW sBow LSA LDA mSDA smSDAu smSDA
Accuracies 69.3 82.6 82.7 81.6 81.1 84.1 82.9 84.9
Twitter
F1 Scores 16.1 68.1 68.3 65.8 66.1 70.4 69.3 71.9
Accuracies 34.2 80.1 80.1 77.7 77.8 87.8 88.0 89.7
MySpace
F1 Scores 36.4 41.2 42.5 45.0 43.1 76.1 76.0 77.6
TABLE 3
Term Reconstruction on Twitter datasets. Each Row Shows Specific
0.95 Bullying Word, along with Top-4 Reconstructed Words (ranked with
0.90 their frequency values from top to bottom) via mSDA (left column) and
smSDA (right column).
Accuracy (%)
0.85
0.80 Bullying Words
Reconstructed Words for
mSDA smSDA
0.75 BoW LSA mSDA smSDAu
@USER @USER
sBoW LDA smSDA
0.70 shut HTTPLINK
0 1 2 3 4 5 6 7 8 9 bitch
friend fuck up
0.9 Datasets Index tell shut
0.8 because off
friend pissed
0.7
F1 Score (%)
fucking
off shit
0.6 gets of
some abuse
0.5 big this shit
BoW LSA mSDA smSDAu shit
0.4 sBoW LDA smSDA with shit lol
0.3 lol big
0 1 2 3 4 5 6 7 8 9
Datasets Index
learned representation by considering word order in mes-
Fig. 9. Classification Accuracies and F1 Scores of All Compared Meth- sages.
ods on MySpace Datasets.
ACKNOWLEDGMENTS
each output dimension under such a setting. It is shown We thank Junming Xu and Prof. Jerry Zhu for their Twitter
that these reconstructed words discovered by smSDA are datasets.
more correlated to bullying words than those by mSDA.
For example, fucking is reconstructed by because, friend, off, R EFERENCES
gets in mSDA. Except off, the other three words seem to be [1] A. M. Kaplan and M. Haenlein, “Users of the world, unite! the
unreasonable. However, in smSDA, fucking is reconstructed challenges and opportunities of social media,” Business horizons,
by off, pissed, shit and of. The occurrence of the term of vol. 53, no. 1, pp. 59–68, 2010.
may be due to the frequent misspelling in Internet writing. [2] R. M. Kowalski, G. W. Giumetti, A. N. Schroeder, and M. R.
Lattanner, “Bullying in the digital age: A critical review and meta-
It is obvious that the correlation discovered by smSDA is analysis of cyberbullying research among youth.” 2014.
more meaningful. This indicates that smSDA can learn the [3] M. Ybarra, “Trends in technology-based sexual and non-sexual
words’ correlations which may be the signs of bullying aggression over time and linkages to nontechnology aggression,”
National Summit on Interpersonal Violence and Abuse Across the
semantics, and therefore the learned robust features boost Lifespan: Forging a Shared Agenda, 2010.
the performance on cyberbullying detection. [4] B. K. Biggs, J. M. Nelson, and M. L. Sampilo, “Peer relations in
the anxiety–depression link: Test of a mediation model,” Anxiety,
Stress, & Coping, vol. 23, no. 4, pp. 431–447, 2010.
5 C ONCLUSION [5] S. R. Jimerson, S. M. Swearer, and D. L. Espelage, Handbook of
bullying in schools: An international perspective. Routledge/Taylor
This paper addresses the text-based cyberbullying detection & Francis Group, 2010.
problem, where robust and discriminative representations [6] G. Gini and T. Pozzoli, “Association between bullying and psy-
of messages are critical for an effective detection system. chosomatic problems: A meta-analysis,” Pediatrics, vol. 123, no. 3,
pp. 1059–1065, 2009.
By designing semantic dropout noise and enforcing spar-
[7] A. Kontostathis, L. Edwards, and A. Leatherman, “Text mining
sity, we have developed semantic-enhanced marginalized and cybercrime,” Text Mining: Applications and Theory. John Wiley
denoising autoencoder as a specialized representation learn- & Sons, Ltd, Chichester, UK, 2010.
ing model for cyberbullying detection. In addition, word [8] J.-M. Xu, K.-S. Jun, X. Zhu, and A. Bellmore, “Learning from
bullying traces in social media,” in Proceedings of the 2012 conference
embeddings have been used to automatically expand and of the North American chapter of the association for computational
refine bullying word lists that is initialized by domain linguistics: Human language technologies. Association for Compu-
knowledge. The performance of our approaches has been tational Linguistics, 2012, pp. 656–666.
experimentally verified through two cyberbullying corpora [9] Q. Huang, V. K. Singh, and P. K. Atrey, “Cyber bullying detection
using social and textual analysis,” in Proceedings of the 3rd Inter-
from social medias: Twitter and MySpace. As a next step national Workshop on Socially-Aware Multimedia. ACM, 2014, pp.
we are planning to further improve the robustness of the 3–6.
1949-3045 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2016.2531682, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 12
[10] D. Yin, Z. Xue, L. Hong, B. D. Davison, A. Kontostathis, and [31] C. Vogel, Computational Methods for Inverse Problems. Society for
L. Edwards, “Detection of harassment on web 2.0,” Proceedings Industrial and Applied Mathematics, 2002. [Online]. Available:
of the Content Analysis in the WEB, vol. 2, pp. 1–7, 2009. http://epubs.siam.org/doi/abs/10.1137/1.9780898717570
[11] K. Dinakar, R. Reichart, and H. Lieberman, “Modeling the detec- [32] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient esti-
tion of textual cyberbullying.” in The Social Mobile Web, 2011. mation of word representations in vector space,” arXiv preprint
[12] V. Nahar, X. Li, and C. Pang, “An effective approach for cy- arXiv:1301.3781, 2013.
berbullying detection,” Communications in Information Science and [33] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
Management Engineering, 2012. “Distributed representations of words and phrases and their com-
[13] M. Dadvar, F. de Jong, R. Ordelman, and R. Trieschnigg, “Im- positionality,” in Advances in neural information processing systems,
proved cyberbullying detection using gender information,” in 2013, pp. 3111–3119.
Proceedings of the 12th -Dutch-Belgian Information Retrieval Workshop
[34] F. Godin, B. Vandersmissen, W. De Neve, and R. Van de
(DIR2012). Ghent, Belgium: ACM, 2012.
Walle, “Named entity recognition for twitter microposts using
[14] M. Dadvar, D. Trieschnigg, R. Ordelman, and F. de Jong, “Im-
distributed word representations,” in Proceedings of the Workshop
proving cyberbullying detection with user context,” in Advances in
on Noisy User-generated Text. Beijing, China: Association for
Information Retrieval. Springer, 2013, pp. 693–696.
Computational Linguistics, July 2015, pp. 146–153. [Online].
[15] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,
Available: http://www.aclweb.org/anthology/W15-4322
“Stacked denoising autoencoders: Learning useful representations
[35] T. H. Dat and C. Guan, “Feature selection based on fisher ratio and
in a deep network with a local denoising criterion,” The Journal of
mutual information analyses for robust brain computer interface,”
Machine Learning Research, vol. 11, pp. 3371–3408, 2010.
in Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE
[16] P. Baldi, “Autoencoders, unsupervised learning, and deep archi-
International Conference on, vol. 1. IEEE, 2007, pp. I–337.
tectures,” Unsupervised and Transfer Learning Challenges in Machine
[36] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification. John
Learning, Volume 7, p. 43, 2012.
Wiley & Sons, 2012.
[17] M. Chen, Z. Xu, K. Weinberger, and F. Sha, “Marginalized de-
[37] C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector
noising autoencoders for domain adaptation,” arXiv preprint arX-
machines,” ACM Transactions on Intelligent Systems and Technology
iv:1206.4683, 2012.
(TIST), vol. 2, no. 3, p. 27, 2011.
[18] T. K. Landauer, P. W. Foltz, and D. Laham, “An introduction to
[38] J. Sui, “Understanding and fighting bullying with machine learn-
latent semantic analysis,” Discourse processes, vol. 25, no. 2-3, pp.
ing,” Ph.D. dissertation, THE UNIVERSITY OF WISCONSIN-
259–284, 1998.
MADISON, 2015.
[19] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceed-
[39] J. Bayzick, A. Kontostathis, and L. Edwards, “Detecting the pres-
ings of the National academy of Sciences of the United States of America,
ence of cyberbullying using computer software,” in Proceedings of
vol. 101, no. Suppl 1, pp. 5228–5235, 2004.
the ACM WebSci’11. Koblenz, Germany: ACM, June 2011, pp. 1–2.
[20] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”
the Journal of machine Learning research, vol. 3, pp. 993–1022, 2003.
[21] T. Hofmann, “Unsupervised learning by probabilistic latent se-
mantic analysis,” Machine learning, vol. 42, no. 1-2, pp. 177–196,
2001.
[22] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: Rui Zhao received the BEng in Measurement
A review and new perspectives,” Pattern Analysis and Machine and Control from Southeast University, Nanjing,
Intelligence, IEEE Transactions on, vol. 35, no. 8, pp. 1798–1828, 2013. China, in 2012. He is currently pursuing the
[23] B. L. McLaughlin, A. A. Braga, C. V. Petrie, M. H. Moore et al., Ph.D. degree from the School of Electrical and
Deadly Lessons:: Understanding Lethal School Violence. National Electronic Engineering, Nanyang Technological
Academies Press, 2002. University, Singapore.
[24] J. Juvonen and E. F. Gross, “Extending the school ground- His current research interests include text min-
s?bullying experiences in cyberspace,” Journal of School health, ing and machine learning.
vol. 78, no. 9, pp. 496–505, 2008.
[25] M. Fekkes, F. I. Pijpers, A. M. Fredriks, T. Vogels, and S. P.
Verloove-Vanhorick, “Do bullied children get ill, or do ill chil-
dren get bullied? a prospective cohort study on the relationship
between bullying and health-related symptoms,” Pediatrics, vol.
117, no. 5, pp. 1568–1574, 2006.
[26] M. Ptaszynski, F. Masui, Y. Kimura, R. Rzepka, and K. Araki,
“Brute force works best against bullying,” in Proceedings of IJCAI Kezhi Mao received his BEng, MEng and PhD
2015 Joint Workshop on Constraints and Preferences for Configuration from Jinan University, Northeastern Universi-
and Recommendation and Intelligent Techniques for Web Personaliza- ty and Sheffield University in 1989, 1992 and
tion. ACM, 2015. 1998, respectively. He worked as a Lecturer
[27] R. Tibshirani, “Regression shrinkage and selection via the lasso,” at Northeastern University from March 1992 to
Journal of the Royal Statistical Society. Series B (Methodological), pp. May 1995, a Research Associate at University
267–288, 1996. of Sheffield from April 1998 to September 1998,
[28] C. C. Paige and M. A. Saunders, “Lsqr: An algorithm for sparse a Research Fellow at Nanyang Technological U-
linear equations and sparse least squares,” ACM Transactions on niversity from September 1998 to May 2001, an
Mathematical Software (TOMS), vol. 8, no. 1, pp. 43–71, 1982. Assistant Professor at School of Electrical and
[29] M. A. Saunders et al., “Cholesky-based methods for sparse least Electronic Engineering, Nanyang Technological
squares: The benefits of regularization,” Linear and Nonlinear Con- University from June 2001 to Sept 2005. He has been an Associate
jugate Gradient-Related Methods, pp. 92–100, 1996. Professor since October 2005.
[30] J. Fan and R. Li, “Variable selection via nonconcave penalized like- His areas of interests include computational intelligence, pattern
lihood and its oracle properties,” Journal of the American statistical recognition, text mining and knowledge extraction, cognitive science,
Association, vol. 96, no. 456, pp. 1348–1360, 2001. and big data and text analytics.
1949-3045 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.