0% found this document useful (0 votes)
65 views5 pages

Iitp at Semeval-2017 Task 8: A Supervised Approach For Rumour Evaluation

The document summarizes the authors' system for participating in the SemEval-2017 Task 8 on determining rumor veracity and support from Twitter data. It describes a supervised classification approach using lexical, content-based and Twitter-specific features. Evaluation shows promising results for both predicting the stance of tweets toward rumors and determining the veracity of rumors as true, false or unverified.

Uploaded by

Oscar Cardenas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views5 pages

Iitp at Semeval-2017 Task 8: A Supervised Approach For Rumour Evaluation

The document summarizes the authors' system for participating in the SemEval-2017 Task 8 on determining rumor veracity and support from Twitter data. It describes a supervised classification approach using lexical, content-based and Twitter-specific features. Evaluation shows promising results for both predicting the stance of tweets toward rumors and determining the veracity of rumors as true, false or unverified.

Uploaded by

Oscar Cardenas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

IITP at SemEval-2017 Task 8 : A Supervised Approach for Rumour

Evaluation

Vikram Singh, Sunny Narayan, Md Shad Akhtar, Asif Ekbal, Pushpak Bhattacharyya
Indian Institute of Technology Patna, India
{vikram.mtcs15,sunny.cs13,shad.pcs15,asif,pb}@iitp.ac.in

Abstract study on speech act classifier for veracity predic-


tion is proposed in (Vosoughi, 2015). One of
This paper describes our system partici- the earlier work reported on rumour detection and
pation in the SemEval-2017 Task 8 ‘Ru- classification had used twitter specific and content
mourEval: Determining rumour veracity based features for the prediction (Qazvinian et al.,
and support for rumours’. The objec- 2011).
tive of this task was to predict the stance In this paper we present our proposed system
and veracity of the underlying rumour. submitted as part of the SemEval-2017 shared task
We propose a supervised classification ap- on “RumourEval: Determining rumour veracity
proach employing several lexical, content and support for rumours”. Our system is super-
and twitter specific features for learning. vised in nature and uses a diverse set of features
Evaluation shows promising results for (c.f. Section 2.3) for training. The task involves
both the problems. Twitter conversation thread where for every source
tweet a number of direct and nested reply tweets
1 Introduction
are present. An example thread is depicted in Ta-
Twitter along with Facebook is widely used social ble 1. The task defines two separate sub-problems:
networking site which generates tons of authen- A) Support, Deny, Query & Comment (SDQC)
tic and unauthentic information. The purpose of classification and B) veracity prediction. The first
twitter varies from people to people. Twitter has subtask checks the stance of any tweet(source or
been greatly used as a communication channel and reply) w.r.t. the underlying rumour. Reply tweet
also as an information source (Zhao and Rosson, can be direct or nested. Second subtask predicts
2009). However, Twitter like any other social me- the veracity of a rumour i.e. true (rumour), false
dia platform does not always poses authentic in- (not rumour) or unverified (its veracity cannot be
formation. It also brings a negative by-product verified). Further, there were two variants of the
called rumour (Castillo et al., 2011; Derczynski veracity task: closed and open variants. In closed
and Bontcheva, 2014; Qazvinian et al., 2011). Ru- variant, the veracity prediction has to be made
mours are the statement which cannot be verified solely from the tweet text only. In addition usage
for its correctness. These rumours may confuse of extra data (Wikipedia article, news article etc.)
people with the unverified information and drive was allowed for the open variant.
them in poor decision making. In many organiza- The rest of the paper is organized as follows:
tions(political, administration etc.), detection and Section 2 presents a brief description of the pro-
support for rumour invites great interest from the posed approach. Experimental results and discus-
concerned authorities. sion is furnished in Section 3. Finally, we con-
Recently, researchers across the globe have clude in Section 4.
started addressing the challenges related to ru-
mours. A time sequence classification technique 2 System Overview
has been proposed for detecting the stance against
a rumor (Lukasik et al., 2016). Zubiaga et al. We adopted a supervised classification approach
(2016) used sequence of label transitions in tree- for both the tasks. We use Decision Tree (DT),
structured conversations for classifying stance. A Naive Bayes (NB) and Support Vector Machine
Tweet conversation thread Stance
Src: Very good on #Putin coup by @CoalsonR: Three Scenarios For A Succession In Russia http://t.co/fotdqxDfEV Support
Rep1: @andersostlund @CoalsonR @RFERL And how Europe will behave in such a case? Deny
Rep2: @andersostlund @RFERL Putin’ll be made a tsar (and the newborn an heir). Back 2 serfdom as Zorkin suggested. Comment
Rep3: @andersostlund @CoalsonR @RFERL uhmmm botox sesions far more likely anyway Comment
Rep4: @andersostlund What are your thoughts on #WhereIsPutin? Query
Rep5: @tulipgrrl Either a simple flue, more serious illness or serious domestic political problems. Comment
Rep6: @andersostlund @tulipgrrl :mask: Deny

Table 1: Twitter conversational thread. Src: Source tweet; Rep#: Replies.

(SVM) as base classifier for prediction of verac- bedding is computed by concatenating em-
ity. For stance detection, every instance consists beddings of all the words in a tweet. We fix
of a pair of source-reply tweet. We extracted fea- the length of each tweet by padding it to the
tures for both the tweets and fed it to the system maximum number of tokens.
for the classification. In subsequent subsections
we describe dataset, preprocessing and list of fea- • Vulgar words: Conversations on Twitter are
tures that we use in this work. usually very informal and usage of vulgar
words are often in practice. The presence of
2.1 Dataset vulgar words in a sentence declines the orien-
The training dataset consists of 272 source tweets tation of it being a fact, hence, less chances
for which 3966 replies tweet are present. For tun- of it being a rumour. We use a list of vul-
ing the system, validation set contains 256 replies gars words2 3 and define a binary feature that
across 25 source tweets. Each source and reply takes a value ‘1’ if a token is present in the
tweet has one of the four label for stance detection list, otherwise ‘0’.
namely, support, deny, query and comment. For
• Twitter specific features: We use presence
veracity prediction, each of the source tweets be-
and absence of following twitter specific fea-
longs to one of the three classes i.e. true, false and
tures in this work.
unverified. The gold standard test dataset has 28
source and 1021 reply tweets. A detailed statistics – URL and Media: The presence of
is depicted in Table 2. metadata indicates that the user is pro-
viding with more authentic information.
2.2 Preprocessing Hence less chances of it being a rumour.
The distribution of different classes in the dataset For subtask A, a user reply with meta-
is very skewed so the first step that we perform is data suggests it to be a support or deny.
to extract and over sample the under represented – Punctuation, Emoticons and Abbrevia-
class. Classes support, deny and comment were tion.
sampled by a factor of 4, 7 and 7 respectively. Af-
terwards, we perform normalization of urls and • Word count : Rumour sentences tend to be
usernames in which all urls and username were more elaborative and hence longer while fac-
replaced by keyword someurl and @someuser re- tual data is generally short and precise. Also,
spectively. user tends to deny a claim in shorter sentence.
We, therefore, define number of words in a
2.3 Features sentence (excluding stop words and punctua-
In this section we describe features that we em- tions), as a feature.
ployed for building the system. We use following
• POS tag: We use unigram and bigram POS
set of features for both Subtask A and B.
tags extracted from CMU’s ARK4 tool.
• Word Embedding: Word vectors has been
In addition, we also implement few of the task
proved to be an efficient technique in captur-
specific features listed below. Subtask A: SDQC
ing semantic property of a word. We use 200-
dimension pretrained GloVe model1 for com-
2
puting the word embeddings. Sentence em- http://fffff.at/googles-official-list-of-bad-words/
3
http://www.noswearing.com/dictionary
1 4
http://nlp.stanford.edu/data/glove.6B.zip http://www.cs.cmu.edu/ ark/TweetNLP/
Overall Subtask A: SDQC Subtask B: Veracity
Dataset
Source Reply Support Deny Query Comment True False Unverified
Train 272 3966 841 333 330 2734 127 50 95
Dev 25 256 69 11 28 173 10 12 3
Test 28 1021 94 71 106 778 8 12 8

Table 2: Distribution of source and reply tweets with their labels in the dataset

• Negation words: Presence of negation word 3 Experiments and Results


in a tweet signals it to be a denial case.
We use scikit learn machine learning package5 for
Therefore, we use a binary feature indicat-
the implementation. As defined by shared task, we
ing the presence of negation words in the
use classification accuracy and micro-average ac-
tweet. There were 27 negation words taken
curacy as evaluation metrics for SDQC and verac-
into account. The following are the list - no,
ity prediction respectively. For subtask A, we try
not, nobody, nothing, none, never, neither,
various feature combinations to train a SVM clas-
nor, nowhere, hardly, scarcely, barely, don’t,
sifier. Table 3 reports the validation accuracy for
isn’t, wasn’t, shouldn’t, wouldn’t, couldn’t,
SDQC subtasks. As a result we select the feature
doesn’t, hasn’t, haven’t, didn’t, ain’t, can’t,
combination that performs best during the valida-
doesn’t and won’t.
tion phase and submit it for the final prediction on
• Wh- words: Query usually contains Wh- the test dataset. In veracity prediction task, we em-
words (What, Where, Which, When, Who,
Features Accuracy
Why, Whom, Whose). We define a binary
feature that fires when a tweet contains one A. Unigram 54.2969%
of these words. B. Unigram + POS 62.1093%
C. W.E. 61.3281%
Subtask B: Veracity prediction D. (C + POS) 63.2813%
E. (D + URL and Media) 62.8906%
• Presence of Opinion words: An opinion F. (E + Twitter Specific) 63.2813%
carrying sentence cannot be a fact, hence, G. (F + Negation words) 63.2813%
makes it a probable candidate for rumour. We H. (G + Wh-Word) 63.6719%
define two features based on MPQA subjec- I. (H + Vulgar words) 64.0625%
tivity lexicon (Wilson et al., 2005). The first J. (I + Punctuation) 63.6719%
feature takes opinion word count, whereas, K. (J + Word count) 64.0625%
the second feature checks the presence of at
least one strongly subjective token in a tweet. Table 3: SDQC: Accuracy on Development Set
• Number of adjectives: An interesting rela- ploy three classifiers i.e. Decision Tree, SVM and
tion between presence of adjectives in a sen- Naive Bayes to the evaluate our system. We ob-
tence and its subjectivity has been explored in serve that the among three classifiers performance
(Hatzivassiloglou and Wiebe, 2000). As per of Naive Bayes is comparatively better than oth-
(Wiebe, 2000) the probability of a sentence ers as shown in Table 4. For evaluation of test
being subjective, given that there is at least dataset we use our best classifier i.e. Naive Bayes.
one adjective in the sentence, is 0.545. If a Our system reports an accuracy of 64.1% for the
sentence is objective then its chances of be- SDQC classification. For subtask B, we also com-
ing a rumour is very low. Therefore, we use a pute a confidence score for each prediction. We
binary feature that denotes presence/absence obtain micro-average accuracies of 39.28% and
of adjectives in a tweet. 28.57% respectively for the open and close vari-
Since, prediction in close variants has the limi- ants. Reported root mean squared error (RMSE)
tation of using the tweet only, we also extracted for the two variants are 0.746 and 0.807. It should
‘presence of media’ as a binary feature value for be noted that we were the only team which submit-
5
the open variant only. http://scikit-learn.org
ted their system in open variant category. Table 5 True False Unverified
depicts the evaluation result on test dataset. True 6 2 0
False 10 2 0
Micro-average Accuracy Unverified 8 0 0
Classifiers
Open Closed
Decision Tree 58.23% 54.54% Table 8: Veracity (Closed): Confusion Matrix on
SVM 58.75% 59.09% test set
Naive Bayes 59.09% 63.0%
rumour evaluation. As base classification algo-
Table 4: Veracity: Accuracy on Development Set rithm we use Naive Bayes, Support Vector Ma-
chine and Decision Tree for building the model.
In future, we would like to explore deep learning
Task Accuracy RMSE
technique and other relevant features to further im-
Subtask A 64.1% -
prove the performance of the system.
Subtask B(Open) 39.28% 0.746
Subtask B(Closed) 28.57% 0.807
References
Table 5: Evaluation results on test set.
Carlos Castillo, Marcelo Mendoza, and Barbara
Further, we perform error analysis on the re- Poblete. 2011. Information credibility on twitter. In
Proceedings of the 20th international conference on
sults. Confusion matrix for SDQC classification World wide web. ACM, pages 675–684.
is depicted in Table 6. We observe that most of
the classes were confused with the comment class. Leon Derczynski and Kalina Bontcheva. 2014. Pheme:
Veracity in digital social networks. In UMAP Work-
The possible reason could be the presence of rel- shops.
atively high number instances for the comment
‘class’. Similarly, Table 7 & 8 shows confusion Vasileios Hatzivassiloglou and Janyce M Wiebe. 2000.
Effects of adjective orientation and gradability on
matrix for both open and closed variants of sub- sentence subjectivity. In Proceedings of the 18th
task B. Recall for ‘true’ is encouraging i.e. 75% conference on Computational linguistics-Volume 1.
but the problem lies with the precision which is Association for Computational Linguistics, pages
merely 28% & 25% for open and close variants 299–305.
respectively. Michal Lukasik, PK Srijith, Duy Vu, Kalina
Bontcheva, Arkaitz Zubiaga, and Trevor Cohn.
Support Deny Query Comment 2016. Hawkes processes for continuous time se-
Support 42 2 2 48 quence classification: an application to rumour
Deny 11 9 2 49 stance classification in twitter. In Proceedings of
Query 9 7 35 55 54th Annual Meeting of the Association for Compu-
Comment 125 35 32 586 tational Linguistics. Association for Computational
Linguistics, pages 393–398.
Table 6: SDQC: Confusion Matrix on test set (S: Vahed Qazvinian, Emily Rosengren, Dragomir R
support, D: deny, Q: query, C: comment) Radev, and Qiaozhu Mei. 2011. Rumor has it: Iden-
tifying misinformation in microblogs. In Proceed-
ings of the Conference on Empirical Methods in Nat-
True False Unverified ural Language Processing. Association for Compu-
True 6 2 0 tational Linguistics, pages 1589–1599.
False 7 5 0 Soroush Vosoughi. 2015. Automatic detection and ver-
Unverified 8 0 0 ification of rumors on Twitter. Ph.D. thesis, Mas-
sachusetts Institute of Technology.
Table 7: Veracity (Open): Confusion Matrix on
Janyce Wiebe. 2000. Learning subjective adjectives
test set from corpora. In AAAI/IAAI. pages 735–740.
Theresa Wilson, Janyce Wiebe, and Paul Hoffmann.
4 Conclusion 2005. Recognizing contextual polarity in phrase-
level sentiment analysis. In Proceedings of the con-
In this paper we proposed a supervised approach ference on human language technology and empiri-
for determining the support and veracity of a ru- cal methods in natural language processing. Associ-
mour as part of the SemEval-2017 shared task on ation for Computational Linguistics, pages 347–354.
Dejin Zhao and Mary Beth Rosson. 2009. How and
why people twitter: the role that micro-blogging
plays in informal communication at work. In Pro-
ceedings of the ACM 2009 international conference
on Supporting group work. ACM, pages 243–252.
Arkaitz Zubiaga, Elena Kochkina, Maria Liakata, Rob
Procter, and Michal Lukasik. 2016. Stance classifi-
cation in rumours as a sequential task exploiting the
tree structure of social media conversations. arXiv
preprint arXiv:1609.09028 .

You might also like