Spam
Spam
PII: S0925-2312(17)30398-3
DOI: 10.1016/j.neucom.2016.10.080
Reference: NEUCOM 18148
Please cite this article as: Luyang Li, Bing Qin, Wenjing Ren, Ting Liu, Document Representation
and Feature Combination for Deceptive Spam Review Detection, Neurocomputing (2017), doi:
10.1016/j.neucom.2016.10.080
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
T
Research Center for Social Computing and Information Retrieval,
Harbin Institute of Technology, Harbin, China
IP
CR
Abstract
US
Deceptive spam reviews of products or service are harmful for customers in deci-
sion making. Existing approaches to detect deceptive spam reviews are concerned
in feature designing. Hand-crafted features can show some linguistic phenomena,
however can hardly reveal the latent semantic meaning of the review. We present
AN
a neural network based model to learn the representation of reviews. The model
makes a hard attention through the composition from sentence representation into
document representation. Specifically, we compute the importance weights of
M
each sentence and incorporate them into the composition process of document
representation. In the mixed-domain detection experiment, the results verify the
effectiveness of our model by comparing with other neural network based meth-
ED
ods. As the feature selection is very important in this direction, we make a feature
combination to enhance the performance. Then we get 86.1% F1 value which
outperform the state-of-the-art method. In the cross-domain detection experiment,
our method has better robustness.
PT
Keywords:
Spam review detection, Opinion spam, Representation learning
CE
1. Introduction
Deceptive opinion spam detection is an urgent and meaningful task in the field
AC
negative reviews to damage the reputations of the objects [6]. It is very difficult
for people to distinguish deceptive spam. In the test of Ott et al. [5], the average
accuracy of three human judges is only 57.33%. Hence, the research in detecting
deceptive opinion spam is necessary and meaningful.
The reviews are commonly short documents. The objective of the task is to
T
distinguish whether the document whether is a spam or a truth. The task can be
transformed into a 2-category classification problem. The majority of existing ap-
IP
proaches follows Ott et al. [5] and utlizes machine learning algorithms to build the
classifiers. Under this direction, most studies focus on designing effective features
CR
to enhance the classification performance. Feature engineering is important, how-
ever, we can hardly learn the inherent law of data from a semantic perspective. In
view of the good performance of neural network based models in the natural lan-
US
guage processing tasks currently, the document-level representation can be learnt
by neural network based models, and be used as features of the review.
In this work, we try to make a comparison and analysis between represen-
tation learning algorithms and conventional features while solving the problem.
AN
We present a novel method which is sentence weighted neural network (SWNN)
model to learn the document-level representation of the review and detect spam
reviews. Learning the representation of the document can capture the global fea-
M
ture and take word order and sentence order into consideration. We also make a
feature combination with SWNN, that the features are firstly used jointly in the
spam review detection. We verify the effectiveness of SWNN and the feature com-
ED
bination in two types of experiments. One is to verify the capability of domain mi-
gration on cross-domain dataset, and another is to verify the capability of domain-
independent spam review detection on mix-domain dataset. The experiments run
on the public data sets [7]. The domain migration experiment verifies that feature
PT
combination with SWNN has the best robustness. The domain-independent ex-
periment verifies that the feature combination with unigram perform better than
the feature combination with SWNN. The final result outperforms other strong
CE
2
ACCEPTED MANUSCRIPT
It should be noted that the work is the extension of our last work which is about
learning document representation for deceptive opinion spam detection [8]. In the
T
last work, we presented the SWNN model to learn document representation and
detect deceptive spam reviews. In this work, we make two improvements. First,
IP
we introduce some new syntax features and use the feature combination to resolve
the problem. Second, we incorporate syntax features with SWNN to jointly detect
CR
spam reviews. The experimental result outperforms the original one.
2. Related work
US
We present a brief review of the related work from two perspectives. One is
deceptive opinion spam detection, and the other is neural networks for specific
task representation learning.
AN
2.1. Deceptive Opinion Spam Detection
On the Internet, various kinds of spam bring troubles to people. Over the
years, many studies focus on spam detection. Web spam has been extensively
M
studied [9, 10, 11, 12, 13, 14, 15]. The objective of the web spam is to gain high
page rank and attract people to click by fooling search engines. Email spam is an-
other related research, which pushes unsolicited advertisements to users [16, 17].
ED
Social media spam is a type of spam information which spreads rumors on the
social media [18]. The web spam and mail spam have a common character that
they have irrelevant words. Opinion spam is quite different and more crafty which
PT
contains opinions of users about products and services. By the explosive growth
of user-generated content, the number of opinion spam in the reviews increases
continuously. This phenomenon attracts the attention of researchers. Opinion
CE
spam is firstly investigated by Liu et al. [6] who also summarize the opinion spam
into different types. In terms of different damages to users, we can further con-
clude the opinion spam into two types which are including deceptive opinion spam
AC
and product-irrelevant spam. In the former spam, the spammers give undeserving
positive reviews or unjust negative reviews to the object for misleading costumers.
The latter spam contains no comments about the object. Obviously, the deceptive
opinion spam is more difficult to detect.
The approaches to detect deceptive opinion spam can be divided into unsuper-
vised methods and supervised methods. Liu et al. [19] take a Bayesian approach
3
ACCEPTED MANUSCRIPT
and formulate opinion spam detection as a clustering problem. There are also
many unsupervised methods to research on spammers detection [20, 21, 22, 23]
and reviewing patterns mining [24]. Due to the lack of gold standard data, most
methods take researches on pseudo labeled data. Liu et al. [6] assume duplicate
and near duplicate reviews to be deceptive spam. They also apply features of re-
T
view texts, reviewers and products. Yoo et al. [25] first collect a small amount of
deceptive spam and truth reviews and do a linguistic analysis on them. By apply-
IP
ing Amazon Mechanical Turk, Ott et al. [5, 26, 27] gather a gold standard labeled
data. A few follow-up spam detecting methods have been presented on the data
CR
set. Ott et al. estimate prevalence of deceptive opinion spam in reviews [26], and
identify negative spam [27]. Li et al. [28] identify manipulated offerings on re-
view portals. Feng et al. [29] apply context free grammar parse trees to extract
US
syntactic features to improve the performance of the model. Feng and Hirst [30]
take the group of reference reviews into account according to the same product.
Although there are deceptive opinion spam in the Ott’s data sets, it still can not
reflect the real condition with the lack of cross-domain data, and the Turkers also
AN
lack of professional knowledge. Li et al. [7] create a cross-domain data sets (i.e.
hotel, restaurant, and doctor) with part of reviews from domain experts. On this
labeled data set, they use n-gram features as well as POS and LIWC features in
M
classification and show that POS perform more robust on cross-domain data.
features in a variety of natural language processing tasks [32, 33, 34], such as POS
tagging, chunking, named entity recognition [32, 35], semantic role labeling, pars-
ing [36], language modeling [37, 38], sentiment analysis tasks [39, 40] and text
CE
two processing stages. Firstly, word embedding should be learnt by massive text
corpus. Some work utilizes global context of document and multiple word proto-
types [42], or global word-word co-occurrence to improve word embedding [43].
There are also some work for task-specific word-embedding [40]. After obtaining
word representation, many studies focus on researching the semantic composition
4
ACCEPTED MANUSCRIPT
Non-linear
Pooling
T
Convolutional
IP
CR
Lookup
US
The Chicago Hilton is very great
cursive Neural Network (RNN) [46], matrixvector RNN [47] and Recursive Neu-
ral Tensor Network (RNTN) [39] to learn the semantic of unfixed-length phrases.
Hermann et al. [48] learn the semantic of sentences by Combinatory Categorial
ED
3. Methodology
CE
In the section, we present the details of neural network based models to learn
document representation for deceptive spam review detection. We develop two
convolutional neural network models to learn document representation. In the fol-
AC
lowing subsections, we firstly introduce the conventional model and then present
the details of our proposed models.
5
ACCEPTED MANUSCRIPT
Linear
Tanh
T
Document Convolution Convolution
IP
…
CR
Sentence Convolution Convolution Convolution
Convolution
…
US … … …
AN
s1 s2 sn
neural network which consists of four layers. Given a sentence “The Chicago
Hilton is very great”, the model applies the lookup layer to map these words into
corresponding word embeddings which are continuous real-valued vectors. The
ED
a global feature vector by combining the local feature vectors through previous
layers. Common operations are doing average or max operations over the corre-
sponding vectors. The average operation captures the influence of all words to
CE
the certain task. The max operation captures the most useful local features pro-
duced by convolutional layer. The non-linear layer is necessary to extract high
level features.
AC
6
ACCEPTED MANUSCRIPT
Linear
Tanh
T
Weighted Pooling α1 α2 αn
…
IP
Sentence weight Sentence weight Sentence weight
generation generation generation
CR
Convolution Convolution Convolution Convolution
Lookup … … … …
s1
US s2
task.
SCNN model. As the architecture is shown in the Fig. 2, SCNN model con-
M
that is more suitable for our objective. Hinge loss function is defined as shown in
Eq. 1, where t is the gold label of the review r, t∗ stands for the another label, and
mδ is the margin in the experiment.
CE
7
ACCEPTED MANUSCRIPT
T
{U1 , ..., Ui , ..., Un } is the universal set of words in the review, where Ui is the
word set of the ith sentence, and Wj stand for the weight of the jth word. The
IP
sentence weight is a normalization value like in the following formula.
P
CR
j∈U Wj
αi = P i (2)
k∈U Wk
US
the input document review transforms into the fixed-length vector through con-
volutional layer. The process of generating sentence weights produce normalized
weight αi corresponding to the ith sentence. Through the pooling layer, the sen-
AN
tence vectors transform into a document vector by a weighted-average operation.
More important sentences have more influences when producing the document
vector. The vector transforms through tanh layer to extract high level features.
The linear layer produces the scores of the categories.
M
3.3. Features
We add two types of features to the proposed model. POS can capture syn-
ED
tax feature and first-person pro By incorporating the features, SWNN model can
capture both semantic and syntax features.
POS. In Li’s analysis [7] between spam reviews and truth reviews, the ob-
servations of the POS distribution are in agreement with the early findings in the
PT
literature [51, 52]. The findings are that truth reviews contain more nouns(N), ad-
jectives(JJ), prepositions(IN) and determiners(DT); spam reviews contain more
verbs(V), adverbs(RB), pronouns(PRP) and pre-determiners(PDT). Thus, POS
CE
54, 55, 52]. In the literatures, they find spam reviews contain less first-person
pronouns.
3.4. Complexity
We try to make a theoretical analysis in the time complexity of the proposed
methods. During learning neural network based models, the time is mainly costed
8
ACCEPTED MANUSCRIPT
T
the time complexity of the proposed methods is O(n ∗ d2 ).
IP
4. Experiments
CR
We conduct experiments to empirically evaluate our document representation
learning model by applying it in spam review detection. We do two types of ex-
periments which are cross-domain classification and mixed-domain classification.
ture combination.
ferent data sources. The spam reviews are edited by Turkers and experts. Specif-
ically, Li [7] and Ott [5, 27] use Amazon Mechanical Turk to collect deceptive
reviews from online workers (Turkers). Experts are employees in each domain
ED
who have expert-level domain knowledge. The truth reviews are from customers
who really have consumption experience.
In the cross-domain classification, we want to make a comparison with Li’s
PT
method. According to Li’s paper, he applies only 200 spam reviews from 356
spam reviews in Doctor domain, and does not apply “Expert” data in his experi-
ment. Hence, we do our best to use data with the same distribution in the cross-
CE
In the mixed-domain spam review classification, all the spam review samples
from Turkers and experts in Table 1 and truth reviews from customers are utilized.
Then spam reviews are 1,636 and truth reviews are 1,200. We use five-fold cross
validation. The data is split into five equal folds, and four folds are treated as
training data, the remaining fold is as test data.
9
ACCEPTED MANUSCRIPT
T
Table 1: Statistics of the three domain dataset.
IP
We use accuracy (A), precision (P), recall (R) and F1 score to evaluate the
CR
effectiveness of the methods. Accuracy score reflects the prediction capability on
both spam samples and non-spam samples. Precision score reflects the correct-
ness of predicting spam samples. Recall score reflects the coverage of correctly
predicting spam samples in the true spam samples. F1 score reflects a trade off
prediction capability.
of topic models and generalized additive models. However, SAGE do not outper-
form SVM. We apply SVM 1 as the classifier in the comparison experiment. In
Li’s experiment, the method gains best results by using Unigram an POS features
CE
in test datasets (restaurant and doctor domains) by training hotel domain data.
Hence, we just list the best results from his paper.
Results and Analysis. Table 2 shows the results from baseline method as
well as our methods. Unigram get the best result on restaurant domain, but it is
AC
1
We use LIBSVM as the software tool to run SVM classifier.
10
ACCEPTED MANUSCRIPT
based methods perform more robust across domains. We can see SWNN with
features does not perform as well as basic CNN with features. The reason is the
sentence weight from SWNN is domain-specific.
Restaurant Doctor
T
Features A P R F1 A P R F1
Unigram 0.785 0.813 0.742 0.778 0.550 0.573 0.725 0.617
IP
POS 0.735 0.697 0.815 0.751 0.540 0.521 0.975 0.679
Paragraph-average 0.733 0.684 0.865 0.764 0.588 0.555 0.885 0.682
CR
Basic CNN+POS+I 0.725 0.679 0.855 0.757 0.583 0.548 0.950 0.695
SWNN 0.690 0.644 0.850 0.733 0.610 0.573 0.860 0.688
SWNN+POS+I 0.668 0.612 0.915 0.733 0.615 0.576 0.870 0.693
US
Table 2: Classifier performance on cross-domain test data.
AN
4.3. Mixed-domain Classification
We gather all domain data into a mixed-domain dataset. We verify the effec-
tiveness of proposed neural network method as well as SVM with feature com-
M
bination. Some comparison experiments are also made among different neural
network methods. The experiment results are based on five-fold cross validation.
Model A P R F1
ED
We adopt previous methods as baseline [5, 7], which utilize SVM as classifier
with unigram, bigram and POS as traditional features.
11
ACCEPTED MANUSCRIPT
Results and Analysis. The results are shown in Table 3, in which unigram
with POS and first-person feature gets the best results in accuracy and F1 values.
In the spam review detection task, unigram is a strong feature. Even being used
alone, unigram has a higher F1 value than SWNN. However, SWNN with features
gain highest value in precision which is useful in the application of spam review
T
detection.
IP
4.3.1. Comparisons among Neural Network based Methods
We apply various neural network based methods to learn the document repre-
CR
sentation and do the spam review classification. The experiments are gained on
mixed-domain dataset by five-fold cross validation.
Model A P R F1
Paragraph-average
Weight-average
Basic LSTM
US
0.729
0.680
0.550
0.704
0.652
0.590
0.915
0.955
0.720
0.795
0.775
0.720
AN
Hier-LSTM 0.618 0.608 0.949 0.741
Basic CNN 0.708 0.694 0.883 0.776
SCNN 0.702 0.698 0.851 0.766
SWNN 0.801 0.800 0.873 0.834
M
tation. Basic CNN is the basic convolutional neural network model. The sentences
are represented through convolutional layer and transform into a document vec-
tor by average-pooling operation. SCNN applies convolutional layer to replace
AC
the average operation. SWNN is the modification of Basic CNN model by using
sentence weights.
Results and Analysis. We do the comparison among various document rep-
resentations. Table 4 shows the results that our SWNN model gains the best
result in deceptive spam classification. The scores of accuracy and F1 are both
12
ACCEPTED MANUSCRIPT
high above the other neural-network based methods. The results show the effec-
tiveness of incorporating sentence weight in representing document. We also find
more complex model like SCNN and Hier-LSTM do not perform as well as sim-
ple model like Paragraph-average model and Basic CNN model. Overfitting is a
primary reason. Hier-LSTM gains F1 value of 97% on training data, but with a
T
low result on test data. For a small dataset, neural network based models with
many parameters is not necessarily a good choice.
IP
Meanwhile, we make an analysis about spam review detection capacity of
SWNN on each domain. From Table 5, we find the proposed method performs
CR
best on restaurant domain and worst on doctor domain. The reviews on hotel
and restaurant domains share more linguistic phenomena [7]. Hence, the model
generalizes better on restaurant reviews than on doctor reviews.
Domain
Hotel
Restaurant
P
US
0.841
0.870
R
0.833
0.882
F1
0.837
0.876
AN
Doctor 0.850 0.810 0.829
we can see the averaged accuracy and F1 value both have one top when window
size is set as 2, hidden layer length is 50 and the learning rate is 0.3. Thus, we use
these settings in our experiments.
CE
justing method. There are four kernel functions to be chosen in the LIBSVM,
which are linear kernel function, polynomial kernel function, gaussian kernel
function (also called radial basis function) and sigmoid kernel function. Different
kernel functions are used with different parameters. We tune parameter c with
linear kernel function; d, g, r and c with polynomial kernel function; g and c with
gaussian kernel function; g, r and c with sigmoid kernel function. We tune each
13
ACCEPTED MANUSCRIPT
0.86
0.84
0.82
0.8
0.78
T
0.76
IP
0.74
0.72
Accuracy
0.7
CR
F1
0.68
2 4 6 8 10 window size
0.83
0.82
US
AN
0.81
0.8
M
0.79
Accuracy
0.78
F1
0.77 hidden layer
ED
0.84
0.82
0.8
0.78
CE
0.76
0.74
0.72
0.7
AC
0.68
Accuracy
0.66
F1
0.64
0.0003 0.003 0.03 0.3 learning rate
14
ACCEPTED MANUSCRIPT
parameter for each kernel function. We find that only a part of parameters can
affect the classification results based on corresponding kernel function. Specifi-
cally, fine tuning of c can enhance the detecting capacity of the models with linear,
gaussian or sigmoid kernel function; g and r can enhance the models with poly-
nomial kernel function. The effect of the above parameters in the model with
T
each kernel function is tested by five-fold cross validation. The specific results are
shown in the appendix. When we adopt gaussian kernel function and set c as 400,
IP
the model has the best results. The results of each kernel function with the most
suitable parameters are shown in Table 6.
CR
Kernel function A P R F1 parameter setting
Linear 0.827 0.838 0.870 0.853 c=1
Polynomial 0.828 0.826 0.891 0.857 c=1, g=0.5, r=10, d=3
Gaussian
Sigmoid
0.835
0.832
0.839
0.836
US
0.885 0.861 c=400, g=0.5
0.883 0.859 c=400, g=0.5, r=0
AN
Table 6: The classification results of SVM with different kernel functions and suitable parameters.
Finally, we adopt gaussian as the kernel function and set c as 400, g as 0.5,
M
5. Conclusion
ED
weighted neural network is more effective than other neural network based mod-
els in the deceptive spam review detection. We also find that neural network based
methods perform more robust than the hand-crafted features on cross-domain data
AC
15
ACCEPTED MANUSCRIPT
fixed mode, which can be improved by a soft alignment and a more flexible mode.
The document is intuitively useful through computing the importance weights of
each sentence. Thus, memory network based model may be effective to resolve
the problem. We will verify the ideas in the future.
T
6. Acknowledgments.
IP
This work was supported by the National High Technology Development 863
Program of China (NSFC) via grant 2015AA015407, National Natural Science
CR
Foundation of China (NSFC) via grant 61133012 and 61273321.
References
US
[1] C. Miller, Company settles case of reviews it faked, New York Times.
[4] A. Topping, Historian orlando figes agrees to pay damages for fake reviews,
M
pp. 309–319.
[6] N. Jindal, B. Liu, Opinion spam and analysis, in: Proceedings of the 2008
International Conference on Web Search and Data Mining, ACM, 2008, pp.
CE
219–230.
[7] J. Li, M. Ott, C. Cardie, E. Hovy, Towards a general rule for identifying
AC
[8] L. Li, W. Ren, B. Qin, T. Liu, Learning Document Representation for De-
ceptive Opinion Spam Detection, Springer International Publishing, 2015.
16
ACCEPTED MANUSCRIPT
T
through content analysis, in: Proceedings of the 15th international confer-
ence on World Wide Web, ACM, 2006, pp. 83–92.
IP
[11] Z. Gyöngyi, H. Garcia-Molina, Link spam alliances, in: Proceedings of the
31st international conference on Very large data bases, VLDB Endowment,
CR
2005, pp. 517–528.
[12] P. T. Metaxas, J. DeStefano, Web spam, propaganda and trust., in: AIRWeb,
2005, pp. 70–78.
US
[13] B. Wu, B. D. Davison, Identifying link farm spam pages, in: Special interest
tracks and posters of the 14th international conference on World Wide Web,
AN
ACM, 2005, pp. 820–829.
[16] P.-A. Chirita, J. Diederich, W. Nejdl, Mailrank: using ranking for spam de-
tection, in: Proceedings of the 14th ACM international conference on Infor-
CE
[17] H. Drucker, D. Wu, V. N. Vapnik, Support vector machines for spam catego-
rization, Neural Networks, IEEE Transactions on 10 (5) (1999) 1048–1054.
AC
[18] F. Wu, J. Shu, Y. Huang, Z. Yuan, Co-detecting social spammers and spam
messages in microblogging via exploiting social contexts, Neurocomputing
201 (2016) 51C65.
17
ACCEPTED MANUSCRIPT
T
[20] G. Wang, S. Xie, B. Liu, P. S. Yu, Review graph based online store review
spammer detection, in: Data mining (icdm), 2011 ieee 11th international
IP
conference on, IEEE, 2011, pp. 1242–1247.
CR
spam, in: Proceedings of the 20th international conference companion on
World wide web, ACM, 2011, pp. 93–94.
US
[22] E.-P. Lim, V.-A. Nguyen, N. Jindal, B. Liu, H. W. Lauw, Detecting product
review spammers using rating behaviors, in: Proceedings of the 19th ACM
international conference on Information and knowledge management, ACM,
2010, pp. 939–948.
AN
[23] A. Mukherjee, B. Liu, N. Glance, Spotting fake reviewer groups in consumer
reviews, in: Proceedings of the 21st international conference on World Wide
Web, ACM, 2012, pp. 191–200.
M
[24] N. Jindal, B. Liu, E.-P. Lim, Finding unusual review patterns using unex-
pected rules, in: Proceedings of the 19th ACM international conference on
ED
[25] K.-H. Yoo, U. Gretzel, Comparison of deceptive and truthful travel reviews,
Information and communication technologies in tourism 2009 (2009) 37–47.
PT
18
ACCEPTED MANUSCRIPT
T
[30] V. W. Feng, G. Hirst, Detecting deceptive opinions with profile compati-
bility, in: Proceedings of the 6th International Joint Conference on Natural
IP
Language Processing, Nagoya, Japan, 2013, pp. 14–18.
CR
Neural networks: An overview of early research, current frameworks and
new challenges, Neurocomputing.
US
[32] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa,
Natural language processing (almost) from scratch, The Journal of Machine
Learning Research 12 (2011) 2493–2537.
AN
[33] N. Kalchbrenner, E. Grefenstette, P. Blunsom, A convolutional neural net-
work for modelling sentences, Proceedings of the 52nd Annual Meeting of
the Association for Computational Linguistics (2014) 655C665.
M
[34] J. Li, J. Dan, E. Hovy, When are tree structures necessary for deep learning of
representations?, Proceedings of the 2015 Conference on Empirical Methods
in Natural Language Processing (2015) 2304C2314.
ED
1088.
19
ACCEPTED MANUSCRIPT
T
[40] D. Tang, F. Wei, N. Yang, M. Zhou, T. Liu, B. Qin, Learning sentiment-
specific word embedding for twitter sentiment classification, in: Proceedings
IP
of the 52nd Annual Meeting of the Association for Computational Linguis-
tics, Vol. 1, 2014, pp. 1555–1565.
CR
[41] P. Wang, B. Xu, J. Xu, G. Tian, C. L. Liu, H. Hao, Semantic expansion using
word embedding clustering and convolutional neural network for improving
short text classification, Neurocomputing 174 (PB) (2016) 806–814.
US
[42] E. H. Huang, R. Socher, C. D. Manning, A. Y. Ng, Improving word represen-
tations via global context and multiple word prototypes, in: Proceedings of
the 50th Annual Meeting of the Association for Computational Linguistics:
AN
Long Papers-Volume 1, Association for Computational Linguistics, 2012,
pp. 873–882.
[43] J. Pennington, R. Socher, C. D. Manning, Glove: Global vectors for word
M
natural language with recursive neural networks, in: Proceedings of the 28th
international conference on machine learning (ICML-11), 2011, pp. 129–
136.
[47] R. Socher, B. Huval, C. D. Manning, A. Y. Ng, Semantic composition-
ality through recursive matrix-vector spaces, in: Proceedings of the 2012
20
ACCEPTED MANUSCRIPT
T
compositional semantics., in: ACL (1), 2013, pp. 894–904.
IP
[49] J. Li, Feature weight tuning for recursive neural networks, Eprint Arxiv.
CR
ments, Computer Science 4 (2014) 1188–1196.
US
[52] P. Rayson, A. Wilson, G. Leech, Grammatical word class variation within the
british national corpus sampler, Language & Computers (2001) 295–306.
AN
[53] M. L. Newman, J. W. Pennebaker, D. S. Berry, J. M. Richards, Lying words:
Predicting deception from linguistic style, Personality & Social Psychology
Bulletin 29 (5) (2003) 665–675.
M
139–165.
21
ACCEPTED MANUSCRIPT
0.86
0.85
0.84
0.83
T
0.82
IP
0.81
Accuracy
0.8
F1
CR
0.79
1 50 100 150 200 250 300 350 400 C
0.9
0.85
0.8
US
AN
0.75
0.7
0.65
M
0.6
Accuracy
0.55
F1
0.5
1 100 150 200 250 300 350 400 500 C
ED
on spam detection classification is show in Fig. A.5, Fig. A.6 and Fig. A.7.
We tune d, g, r and c with polynomial kernel function. We find only g and r
can affect the results; d and c have no influence. When we tune one parameter, the
AC
other parameters are set as default. We also test some combination of the values of
g and r. The results show that SVM with polynomial function will acquire good
results when r is set as 10 and other parameters are set as default. The effect of g
and r on spam detection classification is show in Fig. A.8.
22
ACCEPTED MANUSCRIPT
0.9
0.85
0.8
0.75
T
0.7
0.65
IP
0.6
Accuracy
0.55
F1
CR
0.5
1 100 200 300 400 500 C
0.9
0.85
US
AN
0.8
0.75
0.7
M
0.65
Accuracy
0.6
F1
0.55
ED
0.5 1 5 10 15 20 g
(a) Effect of g
0.9
PT
0.85
0.8
CE
0.75
0.7
0.65
AC
Accuracy
0.6
F1
0.55
0 5 10 15 20 r
(b) Effect of r
Figure A.8: The effect of g and r in SVM with polynomial kernel function.
23
ACCEPTED MANUSCRIPT
Luyang Li received the Master¡¯s degree in July 2011 from the Department of Computer Science,
T
Harbin Institute of Technology, Harbin, China. Since 2011, she has been a Ph.D. candidate at the
Department of Computer Science, Harbin Institute of Technology. Her current research interests
IP
include natural language processing, contradiction detection, deceptice spam review detection and
representation learning.
CR
US
AN
Wenjing Ren received the bachelor degree in July 2015 from the Department of Computer Science,
Harbin Institute of Technology, Harbin, China. Since 2015, she has been a master candidate at the
M
Department of Computer Science, Harbin Institute of Technology. Her current research interests
include natural language processing, contradiction detection, deceptice spam review detection and
representation learning.
ED
PT
CE
AC
Bing Qin received her Ph.D. degree in 2005 from the Department of Computer Science, Harbin
Institute of Technology, Harbin, China. She is a Full Professor of Department of Computer Science,
and the Deputy Director of Research Center for Social Computing and Information Retrieval (HIT-
SCIR) from Harbin Institute of Technology. Her research interests include natural language
processing, information extraction, document-level discourse analysis, and sentiment analysis.
ACCEPTED MANUSCRIPT
T
IP
Ting Liu received his Ph.D. degree in 1998 from the Department of Computer Science, Harbin
Institute of Technology, Harbin, China. He is a Full Professor in the Department of Computer
CR
Science, and the Director of the Research Center for Social Computing and Information Retrieval
(HIT-SCIR) from Harbin Institute of Technology. His research interests include information
retrieval, natural language processing, and social media analysis.
US
AN
M
ED
PT
CE
AC