0% found this document useful (0 votes)

277 views12 pages

Cyberbullying Detection Based On Semantic-Enhanced Marginalized Denoising Auto-Encoder PDF

This document proposes a new method called Semantic-Enhanced Marginalized Denoising Auto-Encoder (smSDA) to learn robust and discriminative representations of text for cyberbullying detection. The smSDA method extends stacked denoising autoencoders with semantic dropout noise and sparsity constraints. It is designed to exploit hidden feature structures in bullying information and learn representations of text messages. Experiments on two cyberbullying datasets show smSDA outperforms other baseline text representation learning methods.

Uploaded by

earle akhil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

277 views12 pages

Cyberbullying Detection Based On Semantic-Enhanced Marginalized Denoising Auto-Encoder PDF

Uploaded by

earle akhil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2016.2531682, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 1

Cyberbullying Detection based on

Semantic-Enhanced Marginalized Denoising
Auto-Encoder
Rui Zhao and Kezhi Mao

Abstract—As a side effect of increasingly popular social media, cyberbullying has emerged as a serious problem afflicting children,
adolescents and young adults. Machine learning techniques make automatic detection of bullying messages in social media possible,
and this could help to construct a healthy and safe social media environment. In this meaningful research area, one critical issue is
robust and discriminative numerical representation learning of text messages. In this paper, we propose a new representation learning
method to tackle this problem. Our method named Semantic-Enhanced Marginalized Denoising Auto-Encoder (smSDA) is developed
via semantic extension of the popular deep learning model stacked denoising autoencoder. The semantic extension consists of
semantic dropout noise and sparsity constraints, where the semantic dropout noise is designed based on domain knowledge and the
word embedding technique. Our proposed method is able to exploit the hidden feature structure of bullying information and learn a
robust and discriminative representation of text. Comprehensive experiments on two public cyberbullying corpora (Twitter and
MySpace) are conducted, and the results show that our proposed approaches outperform other baseline text representation learning
methods.

Index Terms—Cyberbullying Detection, Text Mining, Representation Learning, Stacked Denoising Autoencoders, Word Embedding

1 I NTRODUCTION

S OCIAL Media, as defined in [1], is ‘’a group of Internet-

based applications that build on the ideological and
technological foundations of Web 2.0, and that allow the
One way to address the cyberbullying problem is to
automatically detect and promptly report bullying messages
so that proper measures can be taken to prevent possi-
creation and exchange of user-generated content.‘’ Via social ble tragedies. Previous works on computational studies
media, people can enjoy enormous information, convenient of bullying have shown that natural language processing
communication experience and so on. However, social me- and machine learning are powerful tools to study bullying
dia may have some side effects such as cyberbullying, which [7], [8]. Cyberbullying detection can be formulated as a
may have negative impacts on the life of people, especially supervised learning problem. A classifier is first trained on
children and teenagers. a cyberbullying corpus labeled by humans, and the learned
Cyberbullying can be defined as aggressive, intentional classifier is then used to recognize a bullying message.
actions performed by an individual or a group of people via Three kinds of information including text, user demography,
digital communication methods such as sending messages and social network features are often used in cyberbullying
and posting comments against a victim. Different from tra- detection [9]. Since the text content is the most reliable, our
ditional bullying that usually occurs at school during face- work here focuses on text-based cyberbullying detection.
to-face communication, cyberbullying on social media can In the text-based cyberbullying detection, the first and
take place anywhere at any time. For bullies, they are free also critical step is the numerical representation learning
to hurt their peers’ feelings because they do not need to face for text messages. In fact, representation learning of text is
someone and can hide behind the Internet. For victims, they extensively studied in text mining, information retrieval and
are easily exposed to harassment since all of us, especially natural language processing (NLP). Bag-of-words (BoW)
youth, are constantly connected to Internet or social media. model is one commonly used model that each dimension
As reported in [2], cyberbullying victimization rate ranges corresponds to a term. Latent Semantic Analysis (LSA)
from 10% to 40%. In the United States, approximately 43% and topic models are another popular text representation
of teenagers were ever bullied on social media [3]. The models, which are both based on BoW models. By mapping
same as traditional bullying, cyberbullying has negative, text units into fixed-length vectors, the learned represen-
insidious and sweeping impacts on children [4], [5], [6]. tation can be further processed for numerous language
The outcomes for victims under cyberbullying may even processing tasks. Therefore, the useful representation should
be tragic such as the occurrence of self-injurious behaviour discover the meaning behind text units. In cyberbullying
or suicides. detection, the numerical representation for Internet mes-
sages should be robust and discriminative. Since messages
• R. Zhao and K. Mao are with with the School of Electrical and Electronic on social media are often very short and contain a lot of
Engineering, Nanyang Technological University, Nanyang Avenue, Sin- informal language and misspellings, robust representations
gapore 639798. for these messages are required to reduce their ambigui-
E-mail: rzhao001,[email protected]
ty. Even worse, the lack of sufficient high-quality training

1949-3045 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2016.2531682, IEEE
Transactions on Affective Computing
IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 2

data, i.e., data sparsity make the issue more challenging. there is a strong correlation between bullying word fuck and
Firstly, labeling data is labor intensive and time consuming. normal word off since they often occur together. If bullying
Secondly, cyberbullying is hard to describe and judge from messages do not contain such obvious bullying features,
a third view due to its intrinsic ambiguities. Thirdly, due such as fuck is often misspelled as fck, the correlation may
to protection of Internet users and privacy issues, only a help to reconstruct the bullying features from normal ones
small portion of messages are left on the Internet, and most so that the bullying message can be detected. It should
bullying posts are deleted. As a result, the trained classifier be noted that introducing dropout noise has the effects of
may not generalize well on testing messages that contain enlarging the size of the dataset, including training data
nonactivated but discriminative features. The goal of this size, which helps alleviate the data sparsity problem. In
present study is to develop methods that can learn robust addition, L1 regularization of the projection matrix is added
and discriminative representations to tackle the above prob- to the objective function of each autoencoder layer in our
lems in cyberbullying detection. model to enforce the sparstiy of projection matrix, and this
Some approaches have been proposed to tackle these in turn facilitates the discovery of the most relevant terms
problems by incorporating expert knowledge into feature for reconstructing bullying terms. The main contributions of
learning. Yin et.al proposed to combine BoW features, senti- our work can be summarized as follows:
ment features and contextual features to train a support vec-
* Our proposed Semantic-enhanced Marginalized S-
tor machine for online harassment detection [10]. Dinakar
tacked Denoising Autoencoder is able to learn ro-
et.al utilized label specific features to extend the general
bust features from BoW representation in an effi-
features, where the label specific features are learned by
cient and effective way. These robust features are
Linear Discriminative Analysis [11]. In addition, common
learned by reconstructing original input from cor-
sense knowledge was also applied. Nahar et.al presented a
rupted (i.e., missing) ones. The new feature space
weighted TF-IDF scheme via scaling bullying-like features
can improve the performance of cyberbullying de-
by a factor of two [12]. Besides content-based information,
tection even with a small labeled training corpus.
Maral et.al proposed to apply users’ information, such as
* Semantic information is incorporated into the re-
gender and history messages, and context information as
construction process via the designing of semantic
extra features [13], [14]. But a major limitation of these
dropout noises and imposing sparsity constraints
approaches is that the learned feature space still relies on
on mapping matrix. In our framework, high-quality
the BoW assumption and may not be robust. In addition,
semantic information, i.e., bullying words, can be
the performance of these approaches rely on the quality
extracted automatically through word embeddings.
of hand-crafted features, which require extensive domain
Finally, these specialized modifications make the
knowledge.
new feature space more discriminative and this in
In this paper, we investigate one deep learning method
turn facilitates bullying detection.
named stacked denoising autoencoder (SDA) [15]. SDA
* Comprehensive experiments on real-data sets have
stacks several denoising autoencoders and concatenates the
verified the performance of our proposed model.
output of each layer as the learned representation. Each
denoising autoencoder in SDA is trained to recover the This paper is organized as follows. In Section 2, some re-
input data from a corrupted version of it. The input is lated work is introduced. The proposed Semantic-enhanced
corrupted by randomly setting some of the input to zero, Marginalized Stacked Denoising Auto-encoder for cyber-
which is called dropout noise. This denoising process helps bullying detection is presented in Section 3. In Section 4,
the autoencoders to learn robust representation. In addition, experimental results on several collections of cyberbullying
each autoencoder layer is intended to learn an increasingly data are illustrated. Finally, concluding remarks are provid-
abstract representation of the input [16]. In this paper, we ed in Section 5.
develop a new text representation model based on a variant
of SDA: marginalized stacked denoising autoencoders (mS- 2 R ELATED W ORK
DA) [17], which adopts linear instead of nonlinear projection
This work aims to learn a robust and discriminative text rep-
to accelerate training and marginalizes infinite noise distri-
resentation for cyberbullying detection. Text representation
bution in order to learn more robust representations. We
and automatic cyberbullying detection are both related to
utilize semantic information to expand mSDA and develop
our work. In the following, we briefly review the previous
Semantic-enhanced Marginalized Stacked Denoising Au-
work in these two areas.
toencoders (smSDA). The semantic information consists of
bullying words. An automatic extraction of bullying words
based on word embeddings is proposed so that the involved 2.1 Text Representation Learning
human labor can be reduced. During training of smSDA, we In text mining, information retrieval and natural language
attempt to reconstruct bullying features from other normal processing, effective numerical representation of linguistic
words by discovering the latent structure, i.e. correlation, units is a key issue. The Bag-of-words (BoW) model is
between bullying and normal words. The intuition behind the most classical text representation and the cornerstone
this idea is that some bullying messages do not contain of some states-of-arts models including Latent Semantic
bullying words. The correlation information discovered by Analysis (LSA) [18] and topic models [19], [20]. BoW model
smSDA helps to reconstruct bullying features from normal represents a document in a textual corpus using a vector
words, and this in turn facilitates detection of bullying of real numbers indicating the occurrence of words in the
messages without containing bullying words. For example, document. Although BoW model has proven to be efficient

and effective, the representation is often very sparse. To Although the incorporation of knowledge base can achieve
address this problem, LSA applies Singular Value Decompo- a performance improvement, the construction of a complete
sition (SVD) on the word-document matrix for BoW model and general one is labor-consuming. Nahar et.al proposed to
to derive a low-rank approximation. Each new feature is a scale bullying words by a factor of two in the original BoW
linear combination of all original features to alleviate the features [12]. The motivation behind this work is quit similar
sparsity problem. Topic models, including Probabilistic La- to that of our model to enhance bullying features. However,
tent Semantic Analysis [21] and Latent Dirichlet Allocation the scaling operation in [12] is quite arbitrary. Ptaszynski
[20], are also proposed. The basic idea behind topic models et.al searched sophisticated patterns in a brute-force way
is that word choice in a document will be influenced by the [26]. The weights for each extracted pattern need to be calcu-
topic of the document probabilistically. Topic models try to lated based on annotated training corpus, and thus the per-
define the generation process of each word occurred in a formance may not be guaranteed if the training corpus has a
document. limited size. Besides content-based information, Maral et.al
Similar to the approaches aforementioned, our proposed also employ users’ information, such as gender and history
approach takes the BoW representation as the input. How- messages, and context information as extra features [13],
ever, our approach has some distinct merits. Firstly, the mul- [14]. Huang et.al also considered social network features
ti-layers and non-linearity of our model can ensure a deep to learn the features for cyberbullying detection [9]. The
learning architecture for text representation, which has been shared deficiency among these forementioned approaches is
proven to be effective for learning high-level features [22]. constructed text features are still from BoW representation,
Second, the applied dropout noise can make the learned which has been criticized for its inherent over-sparsity and
representation more robust. Third, specific to cyberbullying failure to capture semantic structure [18], [19], [20]. Differ-
detection, our method employs the semantic information, ent from these approaches, our proposed model can learn
including bullying words and sparsity constraint imposed robust features by reconstructing the original data from
on mapping matrix in each layer and this will in turn pro- corrupted data and introduce semantic corruption noise and
duce more discriminative representation. sparsity mapping matrix to explore the feature structure
which are predictive of the existence of bullying so that the
learned representation can be discriminative.
2.2 Cyberbullying Detection
With the increasing popularity of social media in recent
years, cyberbullying has emerged as a serious problem 3 S EMANTIC -E NHANCED M ARGINALIZED
afflicting children and young adults. Previous studies of S TACKED D ENOISING AUTO - ENCODER
cyberbullying focused on extensive surveys and its psy- We first introduce notations used in our paper. Let D =
chological effects on victims, and were mainly conducted {w1 , . . . , wd } be the dictionary covering all the words exist-
by social scientists and psychologists [6], [23], [24], [25]. ing in the text corpus. We represent each message using a
Although these efforts facilitate our understanding for cy- BoW vector x ∈ Rd . Then, the whole corpus can be denoted
berbullying, the psychological science approach based on as a matrix: X = [x1 , . . . , xn ] ∈ Rd×n , where n is the
personal surveys is very time-consuming and may not number of available posts.
be suitable for automatic detection of cyberbullying. Since We next briefly review the marginalized stacked de-
machine learning is gaining increased popularity in recent noising auto-encoder and present our proposed Semantic-
years, the computational study of cyberbullying has at- enhanced Marginalized Stacked Denoising Auto-Encoder.
tracted the interest of researchers. Several research areas
including topic detection and affective analysis are closely
related to cyberbullying detection. Owing to their efforts, 3.1 Marginalized Stacked Denoising Auto-encoder
automatic cyberbullying detection is becoming possible. In Chen et.al proposed a modified version of Stacked Denois-
machine learning-based cyberbullying detection, there are ing Auto-encoder that employs a linear instead of a non-
two issues: 1) text representation learning to transform each linear projection so as to obtain a closed-form solution [17].
post/message into a numerical vector and 2) classifier train- The basic idea behind denoising auto-encoder is to recon-
ing. Xu et.al presented several off-the-shelf NLP solutions struct the original input from a corrupted one x̃1 , . . . , x̃n
including BoW models, LSA and LDA for representation with the goal of obtaining robust representation.
learning to capture bullying signals in social media [8]. Marginalized Denoising Auto-encoder: In this mod-
As an introductory work, they did not develop specialized el, denoising auto-encoder attempts to reconstruct original
models for cyberbullying detection. Yin et.al proposed to data using the corrupted data via a linear projection. The
combine BoW features, sentiment feature and contextual projection matrix can be learned as:
features to train a classifier for detecting possible harassing
n
posts [10]. The introduction of the sentiment and contex- 1 X 2
W = argmin kxi − Wx̃i k (1)
tual features has been proven to be effective. Dinakar et.al W 2n i=1
used Linear Discriminative Analysis to learn label specific
features and combine them with BoW features to train a where W ∈ Rd×d . For simplicity, we can write Eq. (1) in
classifier [11]. The performance of label-specific features matrix form as:
largely depends on the size of training corpus. In addition,
they need to construct a bullyspace knowledge base to boost 1 h i
the performance of natural language processing methods. W = argmin tr (X − WX̃)T (X − WX̃) (2)
W 2n

where X̃ = [x̃1 , . . . , x̃n ] is the corrupted version of X. It is where Z ∈ Rd(L+1)×n . Each column of Z represents the final
easily shown that Eq. (2) is an ordinary least square problem representation of each individual data sample.
having a closed-form solution:
3.2 Semantic Enhancement for mSDA
W = PQ−1 (3) The advantage of corrupting the original input in mSDA
where P = XX̃T and Q = X̃X̃T . In fact, this corruption can be explained by feature co-occurrence statistics. The co-
can be marginalized over the noise distribution [17]. The occurrence information is able to derive a robust feature rep-
more corruptions we take in the denoising auto-encoder, resentation under an unsupervised learning framework, and
the more robust transformation can be learned. Therefore, this also motivates other state-of-the-art text feature learning
the best choice is using infinite versions of corrupted data. If methods such as Latent Semantic Analysis and topic models
the data corpus is corrupted infinite times, the matrix P and [18], [20]. As shown in Figure 1. (a), a denoising auto-
Q are converged to their corresponding expectation, and Eq. encoder is trained to reconstruct these removed features
(3) can be formulated as: values from the rest uncorrupted ones. Thus, the learned
mapping matrix W is able to capture correlation between
−1 these removed features and other features. It is shown that
W = E [P] E [Q] (4)
the learned representation is robust and can be regarded as
Pn T

where E [P] = i=1 E xi x̃i and E [Q] = a high level concept feature since the correlation informa-
n T tion is invariant to domain-specific vocabularies. We next
P
i=1 E x̃ x̃
i i . These expected matrices can be computed
based on noise distribution. In [17], dropout noise is describe how to extend mSDA for cyberbullying detection.
adopted to corrupt data samples by setting a feature to zero The major modifications include semantic droupout noise
with a probability p. Assuming the scatter matrix of the and sparse mapping constraints.
original data samples is denoted as S = XXT , the expected
matrices can be computed as: 3.2.1 Semantic Dropout Noise
( The dropout noise adopted in mSDA is an uniform distri-
(1 − p)2 Si,j if i 6= j , bution, where each feature has the same probability to be
E [Q]i,j = (5) removed. In cyberbullying detection, most bullying posts
(1 − p)Si,j if i = j .
contain bullying words such as profanity words and foul
and languages. These bullying words are very predictive of
E [P]i,j = (1 − p)Si,j (6) the existence of cyberbullying. However, a direct use of
these bullying features may not achieve good performance
where i and j denotes the indices of features. It can be because these words only account for a small portion of the
seen that it is very efficient to compute W by marginalizing whole vocabulary and these vulgar words are only one kind
dropout noise in denoising auto-encoder. After the mapping of discriminative features for bullying [10], [26]. In other
weights W are computed, a nonlinear squashing function, way, we can explore these cyberbullying words by using
such as a hyperbolic tangent function, can be applied to de- a different dropout noise that features corresponding to
rive the output of the marginalized denoising auto-encoder: bullying words have a larger probability of corruption than
other features. The imposed large probability on bullying
H = tanh(WX) (7) words emphasizes the correlation between bullying features
and normal ones. This kind of dropout noise can be denoted
Stacking Structure: Chen et.al [17] also proposed to
as semantic dropout noise, because semantic information is
apply stacking structures on marginalized denoising au-
used to design dropout structure.
toencoder, in which the output of the (k − 1)th layer is fed
As shown in Figure 1. (b), the correlation between fea-
as the input into the k th layer. If we define the output of the
tures can enable other normal words to predict bullying
k th mDA as Hk and the original input as H0 respectively,
labels. Considering a simple but intuitive example, ”Leave
the mapping between two consecutive layers is given as:
him alone, he is just a chink”1 , which is obviously a bullying
message. However, the classifier will set the weight of the
Hk = tanh(Wk Hk−1 ) (8) discriminative word ”chink” to zero, if the small sized
where Wk denotes the mapping in k th layer. The model training corpus does not cover it. Our proposed smSDA can
training can be done greedily layer by layer. This means deal with the problem by learning a robust feature represen-
that the mapping weights Wk is learned in a closed-form tation, which is a high level concept representation. In the
to reconstruct the output of (k − 1)th mDA layer from its learned representation, the word ”chink” are reconstruct-
marginalized corruptions, as shown in Eq. (4). If the number ed by context words co-occurring with the specific word
of layers is set to L, the final representation for input data X (”chink”) and the context words may be shared by other
is the concatenation of the uncorrupted original input and bullying words contained in training corpus. Therefore, the
outputs of all layers as follows: correlation explored by this auto-encoder structure enables
the subsequent classifier to learn the discriminative word
and improve the classification performance. In addition,
 
X
 H1  the semantic dropout noise exploits the correlation between
Z= .  (9)
 
 ..  1. ”Chink (also chinki, chinky, chinkie) is an English ethnic slur usually
HL referring to a person of Chinese or East Asian ethnicity” from Wikipedia

bullying features and normal features better and hence, 3.2.2 Sparsity Constraints
facilitates cyberbullying detection.
In mSDA, the mapping matrix W is learned to reconstruc-
Due to the introduced semantic dropout noise, the ex-
t removed features from other uncorrupted features and
pected matrices: E [P] and E [Q] will be computed slightly
hence is able to capture the feature correlation information.
different from Eqs. (5) and (6). Assuming we have an
Here, we inject the sparsity constraints on the mapping
available bullying words list and the corresponding features
weights W so that each row has a small number of nonzero
set Zb , the semantic dropout noise can be described as the
elements. This sparsity constraint is quite intuitive because
following probability density function (PDF):
one word is only related to a small portion of vocabulary
 instead of the whole vocabulary. In our proposed smSDA,

 p(x̃d = 0) = pn if d ∈
/ Zb , the sparsity constraint is realized by the incorporation of

 p(x̃d = xd ) = 1 − p n if d ∈
/ Zb , L1 regularization term into the objective function as in the
P DF = (10) lasso problem [27]. The optimization function for each layer
p(x̃d = 0) = pb if d ∈ Zb ,
in smSDA is given as follows:


p(x̃d = xd ) = 1 − pb if d ∈ Zb ,


where d denotes the feature set. Then these two marginal- 1 h i

ized matrices can be computed as: W = argmin tr (X − WX̃)T (X − WX̃) + λkWk1
W 2n
(16)
where λ is a regularization parameter that controls the
E [Q]i,j = sparsity of W. The larger the λ is, the sparser the mapping
matrix W is. The solution to Eq. (16) is a very mature

 (1 − pn )Si,j if i = j & i ∈/ Zb ,
math problem: sparse least squares optimization, which has


2
 − pn ) Si,j
(1 if i 6= j & {i, j} ∩ Zb = ∅,


several effective and efficient computation methods [28],

(1 − pb )(1 − pn )Si,j if {i, j} ∈
/ Zb & {i, j} ∩ Zb 6= ∅, [29], [30]. Here, we adopt a method called Iterated Ridge

(1 − pb )2 Si,j if i 6= j & {i, j} ∈ Zb , Regression, which has been proven to be very efficient [30].




The method firstly introduces an approximation:


(1 − pb )Si,j if i = j & i ∈ Zb .
(11)
and wiT wi
( kwi k1 ≈ (17)
(1 − pn )Si,j if j ∩ Zb = ∅, kwi k1
E [P]i,j = (12)
(1 − pb )Si,j if j ∩ Zb 6= ∅.
where wi denotes the i-th row in the whole matrix W. By
substituting this approximation Eq. (17) into the objective
where pb and pn are the probabilities of bullying features
function Eq. (16), we yield an formulation similar to a Ridge
and normal features to be set to zero respectively, and pb >
Regression Problem [31], and the iteration steps to solve W
pn . Here, pb and pn are both tunable hyperparameters for
is given as:
our proposed smSDA.
Unbiased Semantic Dropout Noise As shown h in
i Eq . (10),
i−1
the corrupted data is biased, i.e., E [X] 6= E X̃ . Here, we
h
Wk = X̃T X X̃T X̃ + λ diag(|Wk−1 |)−1 (18)
modified Eq. (10) to achieve an unbiased noise as follows:
where diag denotes the diagonal elements of a matrix, Wk
 and Wk−1 denote the current step and the previous step


p(x̃d = 0) = pn if d ∈
/ Zb , estimations for mapping matrix W, respectively. It is clear
xd
) = 1 − pn if d ∈

p(x̃
d = 1−p / Zb , that the Eq. (18) can be easily formulated when the noise
P DF unbiased = n


p(x̃ d if d ∈ Zb ,
= 0) = pb distribution is marginalized. Similar to Eq. (4), Eq. (18) can
d
p(x̃
 xd
= 1−p ) = 1 − pb
if d ∈ Zb , be written as:
b
(13)
It can be easily shown that under such a noise distribution, h i−1
the corrupted data is unbiased now. These two marginalized Wk = E [P] E [Q] + λ diag(|Wk−1 |)−1 (19)
matrices are re-formulated as:
 1 To speed up the convergence process, the initialization for
 1−pn Si,j if i = j & i ∈
 / Zb , W can be set to the L2 penalized solution for Eq. (2) as
unbiased 1
E [Q]i,j = 1−pb Si,j if i = j & i ∈ Zb , (14) follows:

Si,j if i 6= j .

h i−1
W0 = E [P] E [Q] + λI (20)
and
unbiased
E [P]i,j = Si,j (15) where I is an identify matrix. It can be shown that this itera-
tion procedure can also marginalize the noise distribution
These two computed matrices will then be used to learn the easily, which can ensure an efficient and stable mapping
mapping in each layer in our proposed smSDA. learning.

x W xɶ
Bullying Context
Bullying Words
Words

X
X
Sparse, but
X Mass Features
Predictive

Correlation
Explored

X
Context Words are
also predictive of
bullying labels.
Original Input Mapping Matrix Corrupted Input

(a) (b)
Fig. 1. Illustration of Motivations behind smSDA. In Figure 1(a), the cross symbol denotes that its corresponding feature is corrupted, i.e., turned
off.

3.3 Construction of Bullying Feature Set

1.2
As analyzed above, the bullying features play an important book

role and should be chosen properly. In the following, the 1.1 read good
steps for constructing bullying feature set Zb are given, in
1.0 nice
which the first layer and the other layers are addressed
shit
separately. For the first layer, expert knowledge and word 0.9
embeddings are used. For the other layers, discriminative
0.8
feature selection is conducted.
fuck bitch
Layer One: firstly, we build a list of words with negative 0.7
affective, including swear words and dirty words. Then, we
0.6
compare the word list with the BoW features of our own stupid slut nerd
whore
corpus, and regard the intersections as bullying features. 0.5
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
However, it is possible that expert knowledge is limited
and does not reflect the current usage and style of cyberlan-
guage. Therefore, we expand the list of pre-defined insulting Fig. 2. Two dimensional visualization of our used word embeddings via
words, i.e. insulting seeds, based on word embeddings as PCA. Displayed terms include both bullying ones and normal ones. It
follows: shows that similar words are nearby vectors.
Word embeddings use real-valued and low-dimension-
al vectors to represent semantics of words [32], [33]. The
well-trained word embeddings lie in a vector space where tracted if their cosine similarities with insult seed exceed a
similar words are placed close to each other. In addition, the predefined threshold. For bigram wl wr , we simply use an
cosine similarity between word embeddings is able to quan- additive model to derive the corresponding embedding as
tify the semantic similarity between words. Considering the follows:
Interent messages are our interested corpus, we utilize a
well-trained word2vec model on a large-scale twitter corpus v(wl wr ) = v(wl ) + v(wr ) (21)
containing 400 million tweets [34]. A visualization of some Finally, the constructed bullying features are used to
word embeddings after dimensionality reduction (PCA) is train the first layer in our proposed smSDA. It includes two
shown in Figure 2. It is observed that curse words form dis- parts: one is the original insulting seeds based on domain
tinct clusters, which are also far away from normal words. knowledge and the other is the extended bullying words
Even insulting words are located at different regions due via word embeddings. The length of Zb is k .
to different word usages and insulting expressions. In addi- Subsequent Layers: we perform feature selection using
tion, since the word embeddings adopted here are trained Fisher score to select ‘’bullying‘’ features. Fisher score is
in a large scale corpus from Twitter, the similarity captured an univariate metric reflecting the discriminative power of
by word embeddings can represent the specific language a feature [35], [36]. For the rth feature, the corresponding
pattern. For example, the embedding of the misspelled word Fisher score can be computed based on training data with
fck is close to the embedding of fuck so that the word fck can labels:
be automatically extracted based on word embeddings.
Pc
We extend the pre-defined insulting seeds based on word ni (µi − µ)2
embeddings. For each insulting seed, similar words are ex- Fr = i=1 Pc 2
(22)
i=1 ni σi

where c denotes the number of classes and ni represent the data and features, the classifier may not be trained
number of data in class i. µ and µi denote the mean of entire very well. Stacked densoing autoencoder (SDA), as
data and class i for the rth feature, and σi is the variance of an unsupervised representation learning method, is
class i on rth feature. After Fisher scores are estimated, fea- able to learn a robust feature space. In SDA, the
tures with top k scores are selected as ‘’bullying‘’ features, feature correlation is explored by the reconstruction
where ‘’bullying‘’ is generalized as discriminative. of corrupted data. The learned robust feature rep-
resentation can then boost the training of classifier
3.4 smSDA for Cyberbullying Detection and finally improve the classification accuracy. In
In section 3.3, we propose the Semantic-enhanced Marginal- addition, the corruption of data in SDA actually
ized Stacked Denoising Auto-encoder (smSDA). In this generates artificial data to expand data size, which
subsection, we describe how to leverage it for cyberbully- alleviate the small size problem of training data.
ing detection. smSDA provides robust and discriminative 2) For cyberbullying problem, we design semantic
representations The learned numerical representations can dropout noise to emphasize bullying features in the
then be fed into Support Vector Machine (SVM). In the new feature space, and the yielded new represen-
new space, due to the captured feature correlation and tation is thus more discriminative for cyberbullying
semantic information, the SVM, even trained in a small size detection.
of training corpus, is able to achieve a good performance 3) The sparsity constraint is injected into the solution
on testing documents(this will be verified in the following of mapping matrix W for each layer, considering
experiments). The detailed steps of our model are provided each word is only correlated to a small portion of the
below: whole vocabulary. We formulate the solution for the
Assuming the first nl posts are labeled and the corre- mapping weights W as an Iterated Ridge Regres-
sponding vector of binary labels is y = {y1 , . . . , ynl }. The sion problem, in which the semantic dropout noise
binary label 1 or 0 indicates the post is or is not a cyberbully- distribution can be easily marginalized to ensure the
ing one. Here, nl n, which means the labeled posts have efficient training of our proposed smSDA.
a small size. The bullying feature set Zb is constructed in 4) Based on word embeddings, bullying features can
a layer-wise way. Based on prior knowledge, we construct be extracted automatically. In addition, the possible
a pre-defined bullying wordlist and compare it with the limitation of expert knowledge can be alleviated by
original vocabulary of the whole corpus X. The words the use of word embedding.
appearing in both the vocabulary and the bullying wordlist
are selected as insulting seeds. The insulting seeds are then
4 E XPERIMENTS
expanded and refined automatically via word embeddings,
which defines the bullying features Zb for layer one. The In this section, we evaluate our proposed semantic-
experiments in Section 4 will show that the construction of enhanced marginalized stacked denoising auto-encoder
the set Zb is very simple and efficient with litter human (smSDA) with two public real-world cyberbullying corpora.
labor. For the subsequent layers, after obtaining the output We start by describing the adopted corpora and experimen-
of each layer, the set Zb is updated using feature ranking tal setup. Experimental results are then compared with other
with Fish score according to Eq. (22). baseline methods to test the performance of our approach.
Based on predefined dropout probabilities for bullying At last, we provide a detailed analysis to explain the good
features and other normal features pb and pn and the performance of our method.
bullying feature set Zb , we compute these two expected
matrices E [P] and E [Q] according to Eqs. (12) and (11), 4.1 Descriptions of Datasets
if the semantic dropout noise is adopted. When it comes
to the unbiased semantic dropout noise, Eqs. (14) and (15) Two datasets are used here. One is from Twitter and another
instead of Eqs. (12) and (11) are used to compute these two is from MySpace groups. The details of these two datasets
expected matrices. Then, we iteratively perform Eq. (21) for are described below:
Tmax times, where the initial value for W is calculated Twitter Dataset: Twitter is ‘’a real-time information net-
based on Eq. (20). When the mapping matrix is learned, work that connects you to the latest stories, ideas, opin-
the output of each layer is given according to Eq. (8). Due ions and news about what you find interesting‘’ (https:
to the stacking structure, the output of L layers and the //about.twitter.com/). Registered users can read and post
initial input are concatenated together to form the final tweets, which are defined as the messages posted on Twitter
representation Z ∈ Rd(L+1)×n following Eq. (9). It is clear with a maximum length of 140 characters.
that the new space has a dimension of (L + 1)d. A linear The Twitter dataset is composed of tweets crawled by
SVM [37] is trained on the training corpus, i.e. the first nl the public Twitter stream API through two steps. In Step 1,
columns in Z and tested on the rest data samples. keywords starting with ”bull” including ”bully”, ”bullied”
and ”bullying” are used as queries in Twitter to preselect
3.5 Merits of smSDA some tweets that potentially contain bullying contents. Re-
tweets are removed by excluding tweets containing the
Some important merits of our proposed approach are sum- acronym ‘’RT‘’. In Step 2, the selected tweets are manually
marized as follows: labeled as bullying trace or non-bullying trace based on the
1) Most cyberbullying detection methods rely on the contents of the tweets. 7321 tweets are randomly sampled
BoW model. Due to the sparsity problems of both from the whole tweets collections from August 6, 2011 to

August 31, 2011 and manually labeled2 . It should be pointed TABLE 1

out here that labeling is based on bullying traces. A bullying Statistical Properties of the two datasets.
trace is defined as the response of participants to their
bullying experience. Bullying traces include not only mes- Statistics Twitter MySpace
Feature No. 4413 4240
sages about direct bullying attack, but also messages about Sample No. 7321 1539
reporting a bullying experience, revealing self as a victim Bullying Instances 2102 398
et. al. Therefore, bullying traces far exceed the incidents of
cyberbullying. Automatic detection of bullying traces are
valuable for cyberbullying research [38]. Some examples Non-Bullying Trace
of bullying traces are shown in Figure 3. To preprocess 1 Don't let your mind bully your body into believing it must carry the
these tweets, a tokenizer is applied without any stemming burden of its worries. #TeamFollowBack
or stopword removal operations. In addition, some special 2 Whether life's disabilities, left you outcast, bullied or teased,rejoice
characters including user mentions, URLS and so on are and love yourself today, 'Cause baby, you were born this way
replaced by predefined characters, respectively. The features
3 @USERNAME haha hopefully! Beliebers just bring a new
are composed of unigrams and bigrams that should appear
meaning to cyber bullying
at least twice and the details of preprocessing can be found
in [8]. The statistics of this dataset can be found in Table 1.
MySpace Dataset: MySpace is another web2.0 social net-
Bullying Trace
working website. The registered accounts are allowed to 1 @RodFindlay been sent a few of them. Thought they could bully
view pictures, read chat and check other peoples’ profile me about. Put them right and they won't represent the client
information. anymore!
The MySpace dataset is crawled from MySpace groups. 2 He a bully on his block, in his heart he a clown
Each group consists of several posts by different users, 3 I was bullied #wheniwas13 but now I am the OFFICE bully!!
which can be regarded as a conversation about one topic.
Due to the interactive nature behind cyberbullying, each
data sample is defined as a window of 10 consecutive posts Fig. 3. Some Examples from Twitter Datasets. Three of them are non-
and the windows are moved one post by one post so that we bullying traces. And the other three are bullying traces.
got multiple windows [39]. Then, three people labeled the
data for the existence of bullying content independently. To
be objective, an instance is labeled as cyberbullying only if testing. The process is repeated ten times to generate ten sub-
at least 2 out of 3 coders identify bullying content in the datasets constructed from MySpace data. Finally, we have
windows of posts. The raw text for these data, as XML twenty sub-datasets, in which ten datasets are from Twitter
files, have been kindly provided by Kontostathis et.al3 . The corpus and another ten datasets are from MySpace corpus.
XML files contain information about the posts, such as post
text, post data, and users’ information, which are put into
4.2 Experimental Setup
11 packets. Some posts in MySpace are shown in Figure 4.
Here, we focus on content-based mining, and hence, we only Here, we experimentally evaluate our smSDA on two cyber-
extract and preprocess the posts’ text. The preprocessing bullying detection corpora. The following methods will be
steps of the MySpace raw text include tokenization, dele- compared.
tion of punctuation and special characters. The unigrams
and bigrams features are adopted here. The threshold for
negligible low-frequency terms is set to 20, considering one
post occurred in a long conversation will occur in at least P: He lasted 30 seconds then acted like he couldn't get up.........UUUU
ten windows. The details of this dataset is shown in Table 1. yea
Since there were no standard splits of training vs. test
datasets in our adopted Twitter and MySpace corpora, we B_P: And a girly man like you wouldn't last 10 seconds.
need to define the training and testing datasets. As analyzed
above that the lack of labeled training corpus hinders the
development of automatic cyberbullying detection, the sizes
of training corpus are all controlled to be very small in
our experiments. For Twitter dataset, we randomly select P: Heath was ok... I thought Jack Nicholson was a really good Joker
800 instances, which accounts for 12% of the whole corpus, though.
as the training data and the rest data samples are used as
testing data. To reduce variance, the process is repeated B_P: I don't know what the big deal was about the Dark Knight,
ten times so that we can have ten sub-datasets from Twitter batman's voice was stupid and over done and heath ledger did a
data. For MySpace dataset, we also randomly pick 400 data horrible job. Im glad he died. Nothing beats Jack Nickolson's
samples as the training corpus and use the rest data for performance of the Joker

2. The dataset: bullyingV3.0, has been kindly provided at http://

research.cs.wisc.edu/bullying/data.html Fig. 4. Some Examples from MySpace Datasets. Two Conversions are
3. The dataset: MySpace Group, has been kindly provided at http: Displayed and each one includes a normal post (P ) and a bullying post
//www.chatcoder.com/DataDownload (B P ) .

* BWM: Bullying word matching. If the message con-

tains at least one of our defined bullying words, it
will be classified as bullying.
* BoW Model: the raw BoW features are directly fed
into the classifier.
* Semantic-enhanced BoW Model: This approach is
referred in [12]. Following the original setting, we
scale the bullying features by a factor of 2.
* LSA: Latent Semantic Analysis [18].
* LDA: Latent Dirchilet Allocation [20]. Our imple-
mentation of LDA is based on Gensim4 .
* mSDA: marginalized stacked denoising autoen-
coder [17].
* smSDA and smSDAu : semantic-enhanced marginal-
ized denoising autoencoder that utilizes semantic
dropout noise and unbiased one, respectively.
Fig. 5. Word Cloud Visualization of the List of Words with Negative
For LSA and LDA, the number of latent topics are both Affective.
set to 100. In LDA, we set hyperparameter α for document
topic multinomial and hyperparameter η for word topic
multinomial to 1 and 0.01, respectively. For mSDA5 , the
noise intensity is set to 0.5 and the number of layers for
Tweets and MySpace datasets are both set to 2. Here, the
number of layers is only set to be a moderate number
instead of a large one, considering a large final dimension
will impose a computational burden on the subsequent
classifier training.
For our proposed methods including smSDA and
smSDAu : the noise intensity and the number of layers are set
to the same values as in mSDA to give a fair comparison.
The bullying noise intensity is set to 0.8, which is larger
than 0.5. The hyperparameters λ that controls the sparsity
of the transformation matrix are set to 1 for all layers. The
number of iteration step for solving lasso problems is set to
20. To construct the bullying features Zb for the first layer,
the negative word list containing 350 words is crawled6 ,
whose word cloud visualization is shown in Figure 5. The
intersections between BoW features of our own corpus and Fig. 6. Word Cloud Visualization of the Bullying Features in Twitter
the predefined bullying word list are firstly obtained. Then, Datasets.
as described in 3.3, they are extended and refined based on
word embeddings to form the final bullying features. The
threshold for cosine similarity is set to 0.8. The word cloud
visualizations for the final bullying features in Twitter and
MySpace datasets are shown in Figures 6 and 7, respective-
ly. The bullying features used in Semantic-enhanced BoW
Model are the same as those in smSDA.
Linear SVM [37] is then applied to the new feature space
generated by the above mentioned approaches. In linear
SVM, we search the best regularization parameter C from
{0.0001, 0.001, 0.01, 0.1, 1}. To evaluate the performance of
these methods on binary classification, classification accu-
racy is employed. Considering both datasets have the class
imbalance problem, we also introduce F1-Score, which is
a balance between precision and recall, to evaluate the
performance of all compared approaches.

4. https://radimrehurek.com/gensim/index.html
5. The code has been kindly provided at http://research.cs.wisc. Fig. 7. Word Cloud Visualization of the Bullying Features in MySpace
edu/bullying/data.html
Datasets.
6. A collection of insulting words can be found in the website: http:
//www.noswearing.com/dictionary

4.3 Experimental Results

0.87
In this section, we show a comparison of our proposed
0.86
smSDA method with six benchmark approaches on Twitter 0.85

Accuracy (%)
and MySpace datasets. The average results, for these two 0.84
datasets, on classification accuracy and F1 score are shown 0.83
in Table 2. Figures 8 and 9 show the results of seven 0.82 BoW LSA mSDA smSDAu
compared approaches on all sub-datasets constructed from 0.81 sBoW LDA smSDA
Twitter and MySpace datasets, respectively. Since BWM does 0.80
0 1 2 3 4 5 6 7 8 9
not require training documents, its results over the whole 0.76 Datasets Index
corpus are reported in Table 2. It is clear that our approaches 0.74
outperform the other approaches in these two Twitter and 0.72

F1 Score (%)
MySpace corpora. 0.70
The first observation is that semantic BoW model (sBow) 0.68
performs slightly better than BoW. Based on BoW, sBoW just 0.66 BoW LSA mSDA smSDAu
arbitrarily scale the bullying features by a factor of 2. This 0.64 sBoW LDA smSDA
0.62
means that semantic information can boost the performance 0 1 2 3 4 5 6 7 8 9
of cyberbullying detection. For a fair comparison, the bul- Datasets Index
lying features used in our method and sBoW are unified
to be the same. Our approaches, especially smSDA, gains Fig. 8. Classification Accuracies and F1 Scores of All Compared Meth-
a significant performance improvement compared to sBoW. ods on Twitter Datasets.
This is because bullying features only account for a small
portion of all features used. It is difficult to learn robust
features for small training data by intensifying each bullying the correlated features by reconstructing masking feature
features’ amplitude. Our approach aims to find the cor- values from uncorrupted feature values. Further, the stack-
relation between normal features and bullying features by ing structure and the nonlinearity contribute to mSDA’s
reconstructing corrupted data so as to yield robust features. ability for discovering complex factors behind data. Based
In addition, Bullying Word Matching (BWM), as a simple on mSDA, our proposed smSDA utilizes semantic dropout
and intuitive method of using semantic information, gives noise and sparsity constraints on mapping matrix, in which
the worst performance. In BWM, the existence of bullying the efficiency of training can be kept. This extension leads
words are defined as rules for classification. It shows that to a stable performance improvement on cyberbullying de-
only an elaborated utilization of such bullying words in- tection and the detailed analysis has been provided in the
stead of a simple one can help cyberbullying detection. following section.
We also compare our methods with two stat-of-arts text We compare the performances of smSDA and smSDAu ,
representation learning methods LSA and LDA. These two which adopt biased semantic dropout noise and unbi-
methods do not produce good performance on all datasets. ased semantic dropout noise, respectively. The results have
This may be because that both methods belong to dimen- shown that smSDAu performs slightly worse than smSDA.
sionality reduction techniques, which are performed on the This may be explained by the fact that the unbiased semantic
document-word occurrence matrix. Although the two meth- dropout noise cancels the enhancement of bullying features.
ods try to minimize the reconstruction error as our approach As shown in Eq. (14), the off-diagonal elements in the matrix
does, the optimization in LSA and LDA is conducted after xi x̃T
i that are used to compute mapping weights are the
dimensionality reduction. The reduced dimension is a key same, which can not contribute to the reinforcement of
parameter to determine the quality of learned feature space. bullying features.
Here, we fix the dimension of latent space to 100. Therefore,
a deliberate searching for this parameter which may im- 4.4 Analysis of Semantic Extension
prove the performances of LSA and LDA and the selection As shown in the section 4.3, the semantic extension can
of hyperparameter itself is another tough research topic. boost the performance on classification results for cyber-
Another reason may be that the data samples are small (less bullying detection. In this section, we discuss the advan-
than 2000) and the length of each Internet message is short tages of this extension qualitatively. In our proposed smS-
(For Twitter, maximum length is 140 characters),and thus DA, because of the semantic dropout noise and sparsity
the constructed document-word occurrence matrix may not constraints, the learned representation is able to discover
represent the true co-occurrence of terms. the correlation between words containing latent bullying
Deep learning methods including mSDA and smSDA semantics. Table 3 shows the reconstruction terms of three
generally outperform other standard approaches. This trend example bullying words for mSDA and smSDA, respec-
is particularly prominent in F1 measure because cyberbul- tively. In this example, one-hot vector is used as input,
lying detection problems are class-imbalance. The larger which represents a document containing one bullying word.
improvements on F1 score verify the performance of our Table 3 lists the reconstructed terms in decreasing order
approach further. Deep learning models have achieved re- of their feature values, which represents the strength of
markable performance in various scenarios with its own their correlations with the input word. The results are
robust feature learning ability [22]. mSDA is able to cap- obtained using one layer architecture without non-linear
ture the correlation between input features and combine activation considering the raw terms directly correspond to

TABLE 2
Accuracies (%), and F1 Scores (%) for Compared Methods on Twitter and MySpace Datasets. The Mean Values are Given, respectively. Bold
Face Indicates Best Performance.

Dataset Measures BWM BoW sBow LSA LDA mSDA smSDAu smSDA
Accuracies 69.3 82.6 82.7 81.6 81.1 84.1 82.9 84.9
Twitter
F1 Scores 16.1 68.1 68.3 65.8 66.1 70.4 69.3 71.9
Accuracies 34.2 80.1 80.1 77.7 77.8 87.8 88.0 89.7
MySpace
F1 Scores 36.4 41.2 42.5 45.0 43.1 76.1 76.0 77.6

TABLE 3
Term Reconstruction on Twitter datasets. Each Row Shows Specific
0.95 Bullying Word, along with Top-4 Reconstructed Words (ranked with
0.90 their frequency values from top to bottom) via mSDA (left column) and
smSDA (right column).
Accuracy (%)

0.85
0.80 Bullying Words
Reconstructed Words for
mSDA smSDA
0.75 BoW LSA mSDA smSDAu
@USER @USER
sBoW LDA smSDA
0.70 shut HTTPLINK
0 1 2 3 4 5 6 7 8 9 bitch
friend fuck up
0.9 Datasets Index tell shut
0.8 because off
friend pissed
0.7
F1 Score (%)

fucking
off shit
0.6 gets of
some abuse
0.5 big this shit
BoW LSA mSDA smSDAu shit
0.4 sBoW LDA smSDA with shit lol
0.3 lol big
0 1 2 3 4 5 6 7 8 9
Datasets Index
learned representation by considering word order in mes-
Fig. 9. Classification Accuracies and F1 Scores of All Compared Meth- sages.
ods on MySpace Datasets.

ACKNOWLEDGMENTS
each output dimension under such a setting. It is shown We thank Junming Xu and Prof. Jerry Zhu for their Twitter
that these reconstructed words discovered by smSDA are datasets.
more correlated to bullying words than those by mSDA.
For example, fucking is reconstructed by because, friend, off, R EFERENCES
gets in mSDA. Except off, the other three words seem to be [1] A. M. Kaplan and M. Haenlein, “Users of the world, unite! the
unreasonable. However, in smSDA, fucking is reconstructed challenges and opportunities of social media,” Business horizons,
by off, pissed, shit and of. The occurrence of the term of vol. 53, no. 1, pp. 59–68, 2010.
may be due to the frequent misspelling in Internet writing. [2] R. M. Kowalski, G. W. Giumetti, A. N. Schroeder, and M. R.
Lattanner, “Bullying in the digital age: A critical review and meta-
It is obvious that the correlation discovered by smSDA is analysis of cyberbullying research among youth.” 2014.
more meaningful. This indicates that smSDA can learn the [3] M. Ybarra, “Trends in technology-based sexual and non-sexual
words’ correlations which may be the signs of bullying aggression over time and linkages to nontechnology aggression,”
National Summit on Interpersonal Violence and Abuse Across the
semantics, and therefore the learned robust features boost Lifespan: Forging a Shared Agenda, 2010.
the performance on cyberbullying detection. [4] B. K. Biggs, J. M. Nelson, and M. L. Sampilo, “Peer relations in
the anxiety–depression link: Test of a mediation model,” Anxiety,
Stress, & Coping, vol. 23, no. 4, pp. 431–447, 2010.
5 C ONCLUSION [5] S. R. Jimerson, S. M. Swearer, and D. L. Espelage, Handbook of
bullying in schools: An international perspective. Routledge/Taylor
This paper addresses the text-based cyberbullying detection & Francis Group, 2010.
problem, where robust and discriminative representations [6] G. Gini and T. Pozzoli, “Association between bullying and psy-
of messages are critical for an effective detection system. chosomatic problems: A meta-analysis,” Pediatrics, vol. 123, no. 3,
pp. 1059–1065, 2009.
By designing semantic dropout noise and enforcing spar-
[7] A. Kontostathis, L. Edwards, and A. Leatherman, “Text mining
sity, we have developed semantic-enhanced marginalized and cybercrime,” Text Mining: Applications and Theory. John Wiley
denoising autoencoder as a specialized representation learn- & Sons, Ltd, Chichester, UK, 2010.
ing model for cyberbullying detection. In addition, word [8] J.-M. Xu, K.-S. Jun, X. Zhu, and A. Bellmore, “Learning from
bullying traces in social media,” in Proceedings of the 2012 conference
embeddings have been used to automatically expand and of the North American chapter of the association for computational
refine bullying word lists that is initialized by domain linguistics: Human language technologies. Association for Compu-
knowledge. The performance of our approaches has been tational Linguistics, 2012, pp. 656–666.
experimentally verified through two cyberbullying corpora [9] Q. Huang, V. K. Singh, and P. K. Atrey, “Cyber bullying detection
using social and textual analysis,” in Proceedings of the 3rd Inter-
from social medias: Twitter and MySpace. As a next step national Workshop on Socially-Aware Multimedia. ACM, 2014, pp.
we are planning to further improve the robustness of the 3–6.

[10] D. Yin, Z. Xue, L. Hong, B. D. Davison, A. Kontostathis, and [31] C. Vogel, Computational Methods for Inverse Problems. Society for
L. Edwards, “Detection of harassment on web 2.0,” Proceedings Industrial and Applied Mathematics, 2002. [Online]. Available:
of the Content Analysis in the WEB, vol. 2, pp. 1–7, 2009. http://epubs.siam.org/doi/abs/10.1137/1.9780898717570
[11] K. Dinakar, R. Reichart, and H. Lieberman, “Modeling the detec- [32] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient esti-
tion of textual cyberbullying.” in The Social Mobile Web, 2011. mation of word representations in vector space,” arXiv preprint
[12] V. Nahar, X. Li, and C. Pang, “An effective approach for cy- arXiv:1301.3781, 2013.
berbullying detection,” Communications in Information Science and [33] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
Management Engineering, 2012. “Distributed representations of words and phrases and their com-
[13] M. Dadvar, F. de Jong, R. Ordelman, and R. Trieschnigg, “Im- positionality,” in Advances in neural information processing systems,
proved cyberbullying detection using gender information,” in 2013, pp. 3111–3119.
Proceedings of the 12th -Dutch-Belgian Information Retrieval Workshop
[34] F. Godin, B. Vandersmissen, W. De Neve, and R. Van de
(DIR2012). Ghent, Belgium: ACM, 2012.
Walle, “Named entity recognition for twitter microposts using
[14] M. Dadvar, D. Trieschnigg, R. Ordelman, and F. de Jong, “Im-
distributed word representations,” in Proceedings of the Workshop
proving cyberbullying detection with user context,” in Advances in
on Noisy User-generated Text. Beijing, China: Association for
Information Retrieval. Springer, 2013, pp. 693–696.
Computational Linguistics, July 2015, pp. 146–153. [Online].
[15] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol,
Available: http://www.aclweb.org/anthology/W15-4322
“Stacked denoising autoencoders: Learning useful representations
[35] T. H. Dat and C. Guan, “Feature selection based on fisher ratio and
in a deep network with a local denoising criterion,” The Journal of
mutual information analyses for robust brain computer interface,”
Machine Learning Research, vol. 11, pp. 3371–3408, 2010.
in Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE
[16] P. Baldi, “Autoencoders, unsupervised learning, and deep archi-
International Conference on, vol. 1. IEEE, 2007, pp. I–337.
tectures,” Unsupervised and Transfer Learning Challenges in Machine
[36] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern classification. John
Learning, Volume 7, p. 43, 2012.
Wiley & Sons, 2012.
[17] M. Chen, Z. Xu, K. Weinberger, and F. Sha, “Marginalized de-
[37] C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector
noising autoencoders for domain adaptation,” arXiv preprint arX-
machines,” ACM Transactions on Intelligent Systems and Technology
iv:1206.4683, 2012.
(TIST), vol. 2, no. 3, p. 27, 2011.
[18] T. K. Landauer, P. W. Foltz, and D. Laham, “An introduction to
[38] J. Sui, “Understanding and fighting bullying with machine learn-
latent semantic analysis,” Discourse processes, vol. 25, no. 2-3, pp.
ing,” Ph.D. dissertation, THE UNIVERSITY OF WISCONSIN-
259–284, 1998.
MADISON, 2015.
[19] T. L. Griffiths and M. Steyvers, “Finding scientific topics,” Proceed-
[39] J. Bayzick, A. Kontostathis, and L. Edwards, “Detecting the pres-
ings of the National academy of Sciences of the United States of America,
ence of cyberbullying using computer software,” in Proceedings of
vol. 101, no. Suppl 1, pp. 5228–5235, 2004.
the ACM WebSci’11. Koblenz, Germany: ACM, June 2011, pp. 1–2.
[20] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”
the Journal of machine Learning research, vol. 3, pp. 993–1022, 2003.
[21] T. Hofmann, “Unsupervised learning by probabilistic latent se-
mantic analysis,” Machine learning, vol. 42, no. 1-2, pp. 177–196,
2001.
[22] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: Rui Zhao received the BEng in Measurement
A review and new perspectives,” Pattern Analysis and Machine and Control from Southeast University, Nanjing,
Intelligence, IEEE Transactions on, vol. 35, no. 8, pp. 1798–1828, 2013. China, in 2012. He is currently pursuing the
[23] B. L. McLaughlin, A. A. Braga, C. V. Petrie, M. H. Moore et al., Ph.D. degree from the School of Electrical and
Deadly Lessons:: Understanding Lethal School Violence. National Electronic Engineering, Nanyang Technological
Academies Press, 2002. University, Singapore.
[24] J. Juvonen and E. F. Gross, “Extending the school ground- His current research interests include text min-
s?bullying experiences in cyberspace,” Journal of School health, ing and machine learning.
vol. 78, no. 9, pp. 496–505, 2008.
[25] M. Fekkes, F. I. Pijpers, A. M. Fredriks, T. Vogels, and S. P.
Verloove-Vanhorick, “Do bullied children get ill, or do ill chil-
dren get bullied? a prospective cohort study on the relationship
between bullying and health-related symptoms,” Pediatrics, vol.
117, no. 5, pp. 1568–1574, 2006.
[26] M. Ptaszynski, F. Masui, Y. Kimura, R. Rzepka, and K. Araki,
“Brute force works best against bullying,” in Proceedings of IJCAI Kezhi Mao received his BEng, MEng and PhD
2015 Joint Workshop on Constraints and Preferences for Configuration from Jinan University, Northeastern Universi-
and Recommendation and Intelligent Techniques for Web Personaliza- ty and Sheffield University in 1989, 1992 and
tion. ACM, 2015. 1998, respectively. He worked as a Lecturer
[27] R. Tibshirani, “Regression shrinkage and selection via the lasso,” at Northeastern University from March 1992 to
Journal of the Royal Statistical Society. Series B (Methodological), pp. May 1995, a Research Associate at University
267–288, 1996. of Sheffield from April 1998 to September 1998,
[28] C. C. Paige and M. A. Saunders, “Lsqr: An algorithm for sparse a Research Fellow at Nanyang Technological U-
linear equations and sparse least squares,” ACM Transactions on niversity from September 1998 to May 2001, an
Mathematical Software (TOMS), vol. 8, no. 1, pp. 43–71, 1982. Assistant Professor at School of Electrical and
[29] M. A. Saunders et al., “Cholesky-based methods for sparse least Electronic Engineering, Nanyang Technological
squares: The benefits of regularization,” Linear and Nonlinear Con- University from June 2001 to Sept 2005. He has been an Associate
jugate Gradient-Related Methods, pp. 92–100, 1996. Professor since October 2005.
[30] J. Fan and R. Li, “Variable selection via nonconcave penalized like- His areas of interests include computational intelligence, pattern
lihood and its oracle properties,” Journal of the American statistical recognition, text mining and knowledge extraction, cognitive science,
Association, vol. 96, no. 456, pp. 1348–1360, 2001. and big data and text analytics.

Wallace - Greek Grammar Beyond The Basics
100% (3)
Wallace - Greek Grammar Beyond The Basics
96 pages
Book Store Management System
60% (5)
Book Store Management System
24 pages
Cyberbullying Detection Based On Semantic-Enhanced Marginalized Denoising Auto-Encoder
No ratings yet
Cyberbullying Detection Based On Semantic-Enhanced Marginalized Denoising Auto-Encoder
21 pages
Forms and Uses of Asserting Information
100% (1)
Forms and Uses of Asserting Information
35 pages
Jackendoff-Bloom-Wynn - Language Logic and Concepts
No ratings yet
Jackendoff-Bloom-Wynn - Language Logic and Concepts
484 pages
INTERCHANGE 1 - QUIZ UNITS 15 & 16 - Revisión Del Intento
76% (21)
INTERCHANGE 1 - QUIZ UNITS 15 & 16 - Revisión Del Intento
4 pages
A Robust Management
0% (2)
A Robust Management
51 pages
Hebrew in Three Months Glenda Abramson
100% (3)
Hebrew in Three Months Glenda Abramson
284 pages
Morphology PDF
No ratings yet
Morphology PDF
15 pages
Chapter 5 - Grammar For IELTS
No ratings yet
Chapter 5 - Grammar For IELTS
19 pages
Imperative Sentences
100% (1)
Imperative Sentences
20 pages
Complete English Grammar in
100% (2)
Complete English Grammar in
96 pages
Text, Context and Knowledge
No ratings yet
Text, Context and Knowledge
38 pages
Adaptation in Translating Advertisement From English Into Arabic
No ratings yet
Adaptation in Translating Advertisement From English Into Arabic
50 pages
Synaesthetic Metaphors
No ratings yet
Synaesthetic Metaphors
4 pages
Cinque 08 Two Types of No
No ratings yet
Cinque 08 Two Types of No
30 pages
Research Paper Presentation
No ratings yet
Research Paper Presentation
19 pages
Cyberbullying Detection Based On Semantic-Enhanced Marginalized Stacked Denoising Auto - Encoder
No ratings yet
Cyberbullying Detection Based On Semantic-Enhanced Marginalized Stacked Denoising Auto - Encoder
20 pages
T e 1645111091 Year 3 Writers Toolkit Booklet - Ver - 1
No ratings yet
T e 1645111091 Year 3 Writers Toolkit Booklet - Ver - 1
30 pages
Basic German Grammar That Must Be Mastered
No ratings yet
Basic German Grammar That Must Be Mastered
3 pages
Average Delay (MS) For Different Initial Energies
No ratings yet
Average Delay (MS) For Different Initial Energies
4 pages
Automatic Detection of Cyberbullying in Social Media Text
No ratings yet
Automatic Detection of Cyberbullying in Social Media Text
21 pages
Deep Learning For Detecting Cyberbullying Across Multiple Social Media Platforms - Jurnal
No ratings yet
Deep Learning For Detecting Cyberbullying Across Multiple Social Media Platforms - Jurnal
12 pages
Early Detection of Cyberbullying On Social Media Networks
No ratings yet
Early Detection of Cyberbullying On Social Media Networks
11 pages
Cyber Harassment
No ratings yet
Cyber Harassment
21 pages
Sobre Morfologia Chinesa
No ratings yet
Sobre Morfologia Chinesa
8 pages
C Contents
No ratings yet
C Contents
3 pages
Aggressive, Repetitive, Intentional, Visible, and Imbalanced: Refining Representations For Cyberbullying Classification
No ratings yet
Aggressive, Repetitive, Intentional, Visible, and Imbalanced: Refining Representations For Cyberbullying Classification
12 pages
Open and Closed Words
No ratings yet
Open and Closed Words
6 pages
Summary Unit 09 - Elementary: " Past Simple "
No ratings yet
Summary Unit 09 - Elementary: " Past Simple "
5 pages
English
No ratings yet
English
9 pages
Topic 3 Morphology PDF
No ratings yet
Topic 3 Morphology PDF
13 pages
Assignment 1 Basic Grammar
No ratings yet
Assignment 1 Basic Grammar
4 pages
Cyberbullying Detection Based On
No ratings yet
Cyberbullying Detection Based On
6 pages
Cyberbullying Detection Based On
No ratings yet
Cyberbullying Detection Based On
6 pages
NGỮ NGHĨA HỌC
No ratings yet
NGỮ NGHĨA HỌC
2 pages
Appositive Clause Vs Relative Clause
No ratings yet
Appositive Clause Vs Relative Clause
11 pages
Cyberbullying IEEE
No ratings yet
Cyberbullying IEEE
16 pages
DL 4
No ratings yet
DL 4
10 pages
DL 3
No ratings yet
DL 3
9 pages
DL 5
No ratings yet
DL 5
7 pages
Conjunction Words
No ratings yet
Conjunction Words
1 page
Online Social Network Bullying Detection Using Intelligence Techniques
No ratings yet
Online Social Network Bullying Detection Using Intelligence Techniques
8 pages
Cyberbullying Detection Through Sentiment Analysis
No ratings yet
Cyberbullying Detection Through Sentiment Analysis
6 pages
English Homework1
No ratings yet
English Homework1
4 pages
Cyberbullying Detection Through Sentiment Analysis
No ratings yet
Cyberbullying Detection Through Sentiment Analysis
6 pages
Cyberbullying Detection and Classification Using Information Retrieval Algorithm
No ratings yet
Cyberbullying Detection and Classification Using Information Retrieval Algorithm
6 pages
Cyberbullying Detection Based On Emotion
No ratings yet
Cyberbullying Detection Based On Emotion
12 pages
Adjectives
No ratings yet
Adjectives
3 pages
Cyber Bullying
No ratings yet
Cyber Bullying
20 pages
Blood Bank Management System
No ratings yet
Blood Bank Management System
20 pages
1 s2.0 S2666307423000360 Main
No ratings yet
1 s2.0 S2666307423000360 Main
13 pages
(IJCST-V10I5P24) :mrs R Jhansi Rani, M Narendra
No ratings yet
(IJCST-V10I5P24) :mrs R Jhansi Rani, M Narendra
8 pages
Antonyms
No ratings yet
Antonyms
13 pages
Predicting Cyberbullying in Social Media Using Machine Learning
No ratings yet
Predicting Cyberbullying in Social Media Using Machine Learning
7 pages
2022 Using Deep Transfer Learning
No ratings yet
2022 Using Deep Transfer Learning
19 pages
2020 Based On Deep Learning Architecture
No ratings yet
2020 Based On Deep Learning Architecture
14 pages
CBDPPT
No ratings yet
CBDPPT
25 pages
The Use of Arduino Interface and Date Palm (Phoenix Dactylifera) Seeds in Making An Improvised Air Ionizer-Purifier
No ratings yet
The Use of Arduino Interface and Date Palm (Phoenix Dactylifera) Seeds in Making An Improvised Air Ionizer-Purifier
7 pages
Cyberbullying Detection On Twitter Using Machine Learning A Review
No ratings yet
Cyberbullying Detection On Twitter Using Machine Learning A Review
5 pages
DL 8
No ratings yet
DL 8
5 pages
Automatic Detection of Cyberbullying On Social Networks Based On Bullying Features
No ratings yet
Automatic Detection of Cyberbullying On Social Networks Based On Bullying Features
6 pages
Machine Learning Based Cyber Bullying Detection
No ratings yet
Machine Learning Based Cyber Bullying Detection
5 pages
(13 Sept 2023) Cyberbullying - Detection - and - Severity - Determination - Model
No ratings yet
(13 Sept 2023) Cyberbullying - Detection - and - Severity - Determination - Model
9 pages
Batch-9 Paper
No ratings yet
Batch-9 Paper
8 pages
Cyberbullying Detection Using Machine Learning
No ratings yet
Cyberbullying Detection Using Machine Learning
6 pages
2022 Using ML and Deep Learning
No ratings yet
2022 Using ML and Deep Learning
13 pages
CBDA Research Paper
No ratings yet
CBDA Research Paper
29 pages
A Comprehensive Review On Cyberbullying Prevention
No ratings yet
A Comprehensive Review On Cyberbullying Prevention
7 pages
Paper 8
No ratings yet
Paper 8
15 pages
2025 Updated GR 9 Efal Task 3, 5 & 10 Framework
No ratings yet
2025 Updated GR 9 Efal Task 3, 5 & 10 Framework
21 pages
04 Handout Predicate Logic
No ratings yet
04 Handout Predicate Logic
16 pages
Cyber Bullying Detection Using Machine Learning
No ratings yet
Cyber Bullying Detection Using Machine Learning
4 pages
CBDA Research Paper
No ratings yet
CBDA Research Paper
19 pages
Research Paper3
No ratings yet
Research Paper3
9 pages
English Grammar Practice Notes For Teachers of English As A Foreign Language and Students Guide
No ratings yet
English Grammar Practice Notes For Teachers of English As A Foreign Language and Students Guide
141 pages
Irjet V7i12375
No ratings yet
Irjet V7i12375
15 pages
Paper 4
No ratings yet
Paper 4
5 pages
Paper 7
No ratings yet
Paper 7
13 pages
Paper Final
No ratings yet
Paper Final
8 pages
SSRN 4705261
No ratings yet
SSRN 4705261
9 pages
Paper 13
No ratings yet
Paper 13
8 pages
1.social Media Cyber Bullying Detection Using Machine Learning.
No ratings yet
1.social Media Cyber Bullying Detection Using Machine Learning.
11 pages
Cyber Bullying Detection
No ratings yet
Cyber Bullying Detection
5 pages
Icest Journal Paper
No ratings yet
Icest Journal Paper
12 pages
JES 2 Sandip+Bankar 6 2241
No ratings yet
JES 2 Sandip+Bankar 6 2241
9 pages
Advanced Cyberbullying Detection A Hybrid Model Integrated With Nave Bayes
No ratings yet
Advanced Cyberbullying Detection A Hybrid Model Integrated With Nave Bayes
5 pages
Articulo TTI FACPYA
No ratings yet
Articulo TTI FACPYA
15 pages
Detection and Classification of Cyberbullying in Social Media Using Text Mining
No ratings yet
Detection and Classification of Cyberbullying in Social Media Using Text Mining
6 pages
Cyber Bullying Detection Using Social and Textual Analysis: Qianjia Huang Vivek K. Singh Pradeep K. Atrey
No ratings yet
Cyber Bullying Detection Using Social and Textual Analysis: Qianjia Huang Vivek K. Singh Pradeep K. Atrey
4 pages
Smart Contract Vulnerability Detection
No ratings yet
Smart Contract Vulnerability Detection
12 pages
Apna Research Paper
No ratings yet
Apna Research Paper
13 pages