Character-Based Neural Embeddings For Tweet Clustering
Character-Based Neural Embeddings For Tweet Clustering
36
Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media , pages 36–44,
Valencia, Spain, April 3-7, 2017. c 2017 Association for Computational Linguistics
sion (Section 6) summarizes our findings and di- autoencoder on the task of paraphrase and seman-
rections for future work. tic similarity detection in tweets.
Our work extends the evaluation of the
2 Related Work Tweet2Vec model (Dhingra et al., 2016) to
2.1 Breaking news detection the tweet clustering task, versus the traditional
document-term matrix representation. To the best
There has been a continuous effort over the re- of our knowledge, this work is the first attempt to
cent years to design effective and efficient algo- evaluate the performance of character-based neu-
rithms capable of detecting newsworthy topics in ral embeddings on the tweet clustering task.
the Twitter stream (Hayashi et al., 2015; Ifrim
et al., 2014; Vosecky et al., 2013; Wurzer et al., 3 Experimental Evaluation
2015). These current state-of-the-art approaches
build upon the bag-of-words document model, 3.1 Dataset
which results in high-dimensional, sparse repre- Description and preprocessing. We use the
sentations that do not scale well and are not aware SNOW 2014 test dataset (Papadopoulos et al.,
of semantic similarities, such as paraphrases. 2014) in our evaluation. It contains the IDs of
The problem becomes evident in case of tweets about 1 million tweets produced within 24 hours.
that contain short texts with a long tail of in- We retrieved 845,626 tweets from the Twitter
frequent slang and misspelled words. The per- API, since other tweets had already been deleted
formance of the such approaches over Twitter from the platform. The preprocessing procedure:
datasets is very low, with F-measure up to 0.2 remove RT prefixes, urls and user mentions, bring
against the annotated Wikipidea articles as refer- all characters to lower case and separate punctua-
ence topics (Wurzer et al., 2015) and 0.4 against tion with spaces (the later is necessary only for the
the curated topic pool (Papadopoulos et al., 2014). word-level baseline).
The dataset is further separated into 5 subsets
2.2 Neural embeddings
corresponding to the 1-hour time intervals (18:00,
Artificial neural networks (ANNs) allow to gen- 22:00, 23:15, 01:00 and 01:30) that are annotated
erate dense vector representations (embeddings). with the list of breaking news topics. In total, we
Word2vec (Mikolov et al., 2013) is by far the have 48,399 tweets for clustering evaluation; the
most popular approach. It accumulates the co- majority of them (42,758 tweets) are in English.
occurrence statistics of words that efficiently sum- The dataset comes with the list of the breaking
marizes their semantics. news topics. These topics were manually selected
Brigadir et al. (2014) demonstrated encour- by the independent evaluators from the topic pool
aging results using the word2vec Skip-gram collected from all challenge participants (external
model to generate event timelines from tweets. topics). The list of topics contains 70 breaking
Moran et al. (2016) achieved an improvement over news headlines extracted from tweets (e.g., “The
the state-of-the-art first story detection (FSD) re- new, full Godzilla trailer has roared online”). Each
sults by expanding the tweets with their semanti- topic is annotated with a few (at most 4) tweet IDs,
cally related terms using word2vec. which is not sufficient for an adequate evaluation
Neural embeddings can be efficiently generated of a tweet clustering algorithm.
on the character level as well. They repeatedly
outperformed the word-level baselines on the tasks Dataset extension. We enrich the topic anno-
of language modeling (Kim et al., 2016), part-of- tations by collecting larger tweet clusters using
speech tagging (dos Santos and Zadrozny, 2014), fuzzy string matching3 for each of the topic labels.
and text classification (Zhang et al., 2015). The Fuzzy string matching uses the Levenstein (edit)
main advantage of the character-based approach is distance (Levenshtein, 1966) between the two in-
its language-independence, since it does not de- put strings as the measure of similarity. Leven-
pend on any language-specific preprocessing. stein distance corresponds to the minimum num-
Dhingra et al. (2016) proposed training a recur- ber of character edits (insertions, deletions, or sub-
rent neural network on the task of hashtag pre- stitutions) required to transform one string into the
diction. Vosoughi et al. (2016) demonstrated an 3
https://github.com/seatgeek/
improved performance of a character-based neural fuzzywuzzy
37
other. We choose only the tweets for which the information flow. The gates (reset and update
similarity ratio with the topic string is greater than gate) regulate how much the previous output state
0.9 threshold. (ht−1 ) influences the current state (ht ).
A sample tweet cluster produced with the fuzzy The two GRUs are identical, but the back-
string matching for the topic “Justin Trudeau apol- ward GRU receives the same sequence of tweet-
ogizes for Ukraine joke”: characters in reverse order. Each GRU computes
its own vector-representation for every substring
• Justin Trudeau apologizes for Ukraine joke:
(ht ) using the current character vector (xt ) and
Justin Trudeau said he’s spoken the head...
the vector-representation it computed a step be-
• Justin Trudeau apologizes for Ukraine com-
fore (ht−1 ). These two representations of the same
ments http://t.co/7ImWTRONXt
tweet are combined in the next layer of the neural
• Justin Trudeau apologizes for Ukraine hockey
network to produce the final tweet embedding (see
joke #cdnpoli
more details in Dhingra et.al. (2016)).
In total, we matched 2,585 tweets to 132 clus- The network is trained in minibatches with an
ters using this approach. The resulting tweet clus- objective function to predict the previously re-
ters represent the ground-truth topics within dif- moved hashtags. A hashtag can be considered as
ferent time intervals. The cluster size varies from the ground-truth cluster label for tweets. There-
1 to 361 tweets with an average of 20 tweets per fore, the network is trained to optimize for the cor-
cluster (median: 6.5). rect tweet classification, which corresponds to a
This simple procedure allows us to automati- supervised version of the tweet clustering task an-
cally generate high-quality partial labeling. We notated with the cluster assignment, i.e. hashtags.
further use this topic assignment as the ground- In order to predict the hashtags the tweet em-
truth class labels to automatically evaluate differ- beddings are passed through the linear layer,
ent flat clustering partitions. which produces the output in the size of the num-
ber of hashtags, which we observed in the training
3.2 Tweet representation approaches dataset. The softmax layer on top normalizes the
TweetTerm. Our baseline is the tweet repre- scores from the linear layer to generate the hashtag
sentation approach that was used in the winner- probabilities for every input tweet.
system of SNOW 2014 Data Challenge4 (Ifrim et Tweet embeddings are produced by passing the
al., 2014). This approach represents a collection tweets through the trained Tweet2Vec model (en-
of tweets as a tweet-term matrix by keeping the bi- coder). In this way we can obtain vector represen-
grams and trigrams that occur at least in 10 tweets. tations for all the tweets including the ones that do
not contain any hashtags. The result is a matrix of
Tweet2Vec. This approach includes two stages:
size n × h, where n is the number of tweets and h
(1) training a neural network to predict hashtags
is the number of hidden states (500).
using the subset of tweets that contain hashtags
(88,148 tweets in our case); (2) encoding: use the
3.3 Clustering
trained model to produce tweet embeddings for all
the tweets regardless whether they contain hash- To cluster tweet vectors (character-based tweet
tags or not. We use Tweet2Vec implementation5 embeddings produced by the neural network for
to produce tweet embeddings. Tweet2Vec evaluation or the document-term ma-
Tweet2Vec is a bi-directional recurrent neural trix for TweetTerm) we employ the hierarchical
network that consumes textual input as a sequence clustering algorithm implementation from fast-
of characters. The network architecture includes cluster library (Müllner, 2013).
two Gated Recurrent Units (GRUs) (Cho et al., Hierarchical clustering includes computing
2014): forward and backward GRUs. GRU is an pairwise distances between the tweet vectors, fol-
optimized version of a Long Short-Term Memory lowed by their linkage into a single dendrogram.
(LSTM) architecture (Hochreiter and Schmidhu- There are several distance metrics (Euclidean,
ber, 1997). It includes 2 gates that control the Manhattan, cosine, etc.) and linkage methods
4 to compare distances (single, average, complete,
https://github.com/heerme/
twitter-topics weighted, etc.). We evaluated the performance of
5
https://github.com/bdhingra/tweet2vec different methods using the cophenetic correlation
38
coefficient (CPCC) (Sokal and Rohlf, 1962) and into different clusters. Therefore, the top of the
found the best performing combination: Euclidean dendrogram, where all the documents reside in a
distance and average linkage method. single cluster always achieves the maximum com-
The hierarchical clustering dendrogram can pleteness score.
produce n different flat clusterings for the same V-Measure is designed to balance out the two
dataset: from n single-member clusters with one extremes of homogeneity and completeness. It is
document per cluster to a single cluster that con- the harmonic mean of the two and corresponds to
tains all n documents. The distance threshold de- the Normalized Mutual Information (NMI) score.
fines the granularity (number and size) of the pro- AMI score is an extension of NMI adjusted for
duced clusters. chance. The more clusters are considered the more
chance the labelings correlate. AMI allows us to
3.4 Distance threshold selection
compare the clustering performance across differ-
Grid search helps us to determine the optimal dis- ent time intervals since it normalizes the score by
tance threshold for the dendrogram cut-off. We the number of labeled clusters in each interval.
generated a list of values in the range from 0.1 to Finally, ARI is an alternative way to assess the
1.5 with 0.1 increment step and examine their per- agreement between two clusterings. It counts all
formance with respect to the ground-truth cluster pairs clustered together or separated in different
assignment. We produce flat clusterings for each clusters. ARI also accounts for the chance of an
value of the distance threshold from the grid and overlap in a random label assignment.
compare them with respect to the quality metrics.
Since we also want to be able to select the op- 3.6 Manual Cluster Evaluation
timal distance threshold in absence of the true la-
bels, we examine the scores provided by the mean Our partial labeling covers a small subset of the
Silhouette coefficient (Rousseeuw, 1987). Silhou- data and by design provides the clusters with the
ette is an unsupervised intrinsic evaluation metric high degree of string overlap with the annotated
(cluster validity index) that measures the quality of topics. Therefore, we extend the clustering evalu-
the produced clusters and can be used for unsuper- ation to the rest of the dataset to evaluate whether
vised intrinsic evaluation (i.e., without the ground- the models can uncover less straight-forward se-
truth labels). It was reported to outperform alter- mantic similarities in tweets. We select the results
native methods in a comparative study of 30 valid- for manual evaluation motivated by the cluster la-
ity indices (Arbelaitz et al., 2013). bel (headline) selection task.
The next step in the breaking news detection
3.5 Clustering Evaluation Metrics pipeline after the clustering task is headline se-
We evaluate the clustering results using the stan- lection (cluster labeling task). The most common
dard metrics for extrinsic clustering evaluation: approach to label a cluster of tweets is to select
homogeneity, completeness, V-Measure (Rosen- a single tweet as a representative member for the
berg and Hirschberg, 2007), Adjusted Rand Index whole cluster (Papadopoulos et al., 2014). We de-
(ARI) (Hubert and Arabie, 1985) and Adjusted cided to test this assumption and manually check
Mutual Information (AMI) (Nguyen et al., 2010). how many clusters loose their semantics when rep-
All metrics return a score on the range [0; 1] for resented with a single tweet.
the pair of sets that contain ground truth and clus- Headline selection motivates the coherence as-
ter labels as input. The higher the score the more sessment of the produced clusters since the clus-
similar the two clusterings are. ters discarded at this stage will never make it to the
The Homogeneity score represents the measure final results. To explore coherence of the produced
for purity of the produced clusters. It penalizes clusters we pick several tweets in each cluster and
clustering, where members of different classes get check whether they are semantically similar.
clustered together. Thus, the best homogeneity The tweet selected as a headline (cluster la-
scores are always at the bottom of the dendrogram, bel) can be the first published tweet as in
i.e., at the level of the leaves, where each docu- First Story Detection (FSD) task, also used in
ment belongs to its own cluster. Completeness, Ifrim et al (2014). Alternative approaches include
on the contrary, favors larger clusters and reduces selection of the most recent tweet published on the
the score if the members of the same class are split topic, or the tweet that is semantically most similar
39
Interval Tweets Model Dimensions Distance threshold Clusters Homogeneity Completeness V-Measure ARI AMI
Tweet2Vec 500 1 3026 0.9958 0.9453 0.9699 0.9804 0.9376
18:00 10,344
TweetTerm 433 1-1.3 66-79 0.9277 1 0.9625 0.949 0.9216
Tweet2Vec 500 0.9 5292 1 0.9601 0.9796 0.9922 0.9571
22:00 14,471
TweetTerm 589 0.7-1.3 93-118 0.9385 0.9969 0.9668 0.9859 0.9359
Tweet2Vec 500 0.8 3986 1 0.98 0.9899 0.9948 0.9743
23:15 8,231
TweetTerm 565 0.01-1.3 67-142 0.8062 0.9978 0.8918 0.7344 0.7763
Tweet2Vec 500 0.9 2242 1 0.8877 0.9405 0.8668 0.8327
01:00 5,123
TweetTerm 721 0.8-1.3 71-111 0.8104 1 0.8953 0.8188 0.7666
Tweet2Vec 500 0.9 2091 1 0.8762 0.934 0.8089 0.8129
01:30 4,589
TweetTerm 635 1.2-1.3 64-78 0.8024 1 0.8903 0.7809 0.754
to all other tweets in the cluster, i.e., the tweet clos- Partial and Incorrect labels reflect different
est to the centroid of the cluster (medoid-tweets). types of clustering errors. Partial error is less se-
Therefore, we sample 5 tweets from each cluster: vere indicating that the tweets of the cluster are se-
the first published tweet, the most recent tweet and mantically similar, but they report different news
three medoid-tweets. (events) and should be split into several clusters.
We set up a manual evaluation task as follows: Incorrect clusters indicate a random collection of
tweets with no semantic similarities.
1. Take the top 20 largest clusters sorted by the
number of tweets that belong to the cluster. 4 Results
2. For each cluster:
4.1 Results of Clustering Evaluation
(a) Take the first and the last published
tweet (tweets are previously sorted by Table 1 summarizes the results of our evaluation
the publication date). using the ground-truth partial labeling. The scores
(b) Take three medoid-tweets, i.e., the highlighted with the bold font indicate the best re-
tweets that appear closest to the centroid sult among the two competing approaches for the
of the cluster. same subset of tweets corresponding to the respec-
(c) Add the 5 tweets to the set associated tive time interval.
with the cluster (removing exact dupli- Tweet2Vec exhibits better clustering perfor-
cate tweets) mance comparing to the baseline according to the
3. For all clusters, where the set of selected majority of the evaluation metrics in all the inter-
tweets contains at least two unique tweets: vals. In all cases Tweet2Vec model wins in terms
4 human evaluators independently assess the of Homogeneity score and TweetTerm wins in
coherence of each cluster. Completeness. This result shows that Tweet2Vec
is better at separating tweets that are not similar
According to the evaluation setup each model enough than the baseline model. Tweet2Vec fails
produced 20 top-clusters for each of the 5 in- only once to perfectly separate the ground-truth
tervals, i.e., 20 × 5 = 100 clusters per model. clusters (18:00 interval). This result shows that
We manually evaluate only the clusters that con- Tweet2Vec is able to replicate the results of the
tain more than 1 distinct representative tweet fuzzy string matching algorithm that was used to
(Clusters>1). All other clusters, i.e., the ones generate the ground-truth labeling.
for which all 5 selected tweets are identical
(Clusters=1), are considered correct by default. 4.2 Results of Distance Threshold Selection
Results for all 5 intervals were evaluated to- The rise in V-Measure correlates with the decline
gether in a single pool and the models were of the Silhouette coefficient and the steep drop in
anonymized to avoid biases. Each evaluator inde- the number of produced clusters (see Figure 1).
pendently assigned a single score to each cluster: We observed that the optimal distance threshold
for Tweet2Vec clustering according to V-Measure
• Correct – all tweets report the same news; is on the interval [0.8; 1] (see Table 1: Distance
• Partial – some tweets are not related; threshold), which is also consistent with the find-
• Incorrect – all tweets are not related. ings reported in Ifrim et. al (2014).
40
Correct (%) Errors (%)
Model Dataset Clusters Clusters=1 Clusters>1 Total Partial Incorrect
Tweet2Vec English 100 80 8.3 88.3 10 1.8
TweetTerm English 95 71 17.4 87.9 8.9 3.2
Tweet2Vec Multilingual 100 67 12.5 79.5 13 7.5
Table 2: Results of manual cluster evaluation. Note: the last row shows results on a different dataset and
can not be directly compared with the other models.
41
Sample Cluster Evaluation
video : bitcoin : mtgox exchange goes offline - bitcoin , a virtual currency ...
the slow-motion collapse of mt . gox is bitcoin’s first financial crisis : now bitcoin users ... Correct
Disastro bitcoin : mt . gox cessa ogni attivite ... : mt . gox , il pi grande cambiavalute bitco ...
california couple finds time capsules worth $10 million
Correct
californian couple finds $10 million worth of gold coins in tin can
ukraine puts off vote on new government despite eu pleas for quick action - washington post ...
ukraine truce shattered , death toll hits 67 - kiev (reuters) - ukraine suffered its bloodiest day ... Partial
ukraine fighting leaves at least 18 dead as kiev barricades burn - clashes in ukraine ...
are you going to come on his network and get poor ratings too ?
Incorrect
are you sold on the waffle taco ?
the chromecast app flood has started by
Incorrect
the importance of emotion in design by
Table 3: Tweet2Vec sample results. Rows of the table show sample tweet clusters. Each line within the
row corresponds to a separate tweet (after preprocessing, i.e. usernames and urls removed.)
Sample Cluster Evaluation
obama : michelle and i were saddened to hear of the passing of harold ramis...
touching tribute to ghostbusters star harold ramis from comic artist Correct
on the joyful comedy of harold ramis
major tokyo-based bitcoin exchange mt . gox goes dark
Correct
”bitcoin exchange giant mt . gox goes dark — popular science ”
obesity rate for young children plummets 43 % in a decade
Correct
the national obesity rate for young children dropped 43 % over the past decade
diplomatic pressure is unlikely to reverse uganda’s cruel anti-gay law
provisions of arizona proposed anti-gay law
Partial
even mitt romney wants arizona’s governor to veto the state’s anti-gay bill
icymi : arizona pizzeria response to state anti-gay bill
amazing debate nic ! well done !
well done 4 -0
well done ! i find running so difficult . feel proud ! Incorrect
well done him :-)
well done nicola my money is on you you done it well tonight ??
Table 4: TweetTerm sample results. Rows of the table show sample tweet clusters.
tions and are able to mirror the fuzzy string match- analogous to TF/IDF weighting scheme to avoid
ing performance beyond simple n-gram matching. capturing punctuation and other merely syntactic
patterns.
It becomes apparent from the sample cluster-
ing results (Tables 3 and 4) that both models per- Limitations. Neural networks gain performance
form essentially the same task of unveiling pat- when more data is available. We could use only
terns shared between a group of strings. While 88,148 tweets from the dataset to train the neu-
TweetTerm operates only on the patterns of iden- ral network, which can appear insufficient to un-
tical n-grams, Tweet2Vec goes beyond this limita- fold the potential of the model to recognize more
tion by providing room for a variation within the complex patterns. Also, due to the scarce annota-
n-gram substring similar to fuzzy string matching. tion available we could use only a small subset of
This effect allows to capture subtle variations in the original dataset for our clustering evaluation.
strings, e.g., misspellings, which word-based ap- Since most of the SNOW tweets are in English,
proaches are incapable of. another dataset is needed for comprehensive mul-
Our error analysis also revealed the limitation tilingual clustering evaluation.
of the neural embeddings to distinguish between
6 Conclusion
semantic and syntactic similarity in strings (see
Incorrect samples in Table 3). Tweet2Vec, as a We showed that character-based neural embed-
recurrent neural network approach, represents not dings enable accurate tweet clustering with min-
only the characters but also their order in string imum supervision. They provide fine-grained rep-
that may be a false similarity signal. It is evi- resentations that can help to uncover fuzzy simi-
dent that the neural representations in our example larities in strings beyond simple n-gram matching.
would benefit from the stop-word removal or an We also demonstrated the limitation of the current
42
approach unable to distinguish semantic from syn- International Conference on Knowledge Discovery
tactic patterns in strings, which provides a clear and Data Mining, August 10-13, 2015, Sydney, Aus-
tralia, pages 417–426.
direction for the future work.
Sepp Hochreiter and Jürgen Schmidhuber. 1997.
7 Acknowledgments Long short-term memory. Neural Computation,
9(8):1735–1780.
The presented work was supported by the InVID
Project (http://www.invid-project.eu/), funded by Lawrence Hubert and Phipps Arabie. 1985. Compar-
the European Union’s Horizon 2020 research and ing partitions. Journal of classification, 2(1):193–
218.
innovation programme under grant agreement No
687786. Mihai Lupu was supported by Self- Georgiana Ifrim, Bichen Shi, and Igor Brigadir. 2014.
Optimizer (FFG 852624) in the EUROSTARS Event Detection in Twitter using Aggressive Filter-
programme, funded by EUREKA, the BMWFW ing and Hierarchical Tweet Clustering. In Symeon
Papadopoulos, David Corney, and Luca Maria
and the European Union, and ADMIRE (P25905- Aiello, editors, Proceedings of the SNOW 2014 Data
N23) by FWF. We thank to Bhuwan Dhingra for Challenge co-located with 23rd International World
the support in using Tweet2Vec and Linda Ander- Wide Web Conference (WWW 2014), April 8, 2014,
sson for the review and helpful comments. Seoul, Korea, pages 33–40.
Yoon Kim, Yacine Jernite, David Sontag, and Alexan-
der M. Rush. 2016. Character-aware neural lan-
References guage models. In Proceedings of the Thirtieth AAAI
Olatz Arbelaitz, Ibai Gurrutxaga, Javier Muguerza, Conference on Artificial Intelligence, February 12-
Jesús M. Pérez, and Iñigo Perona. 2013. An ex- 17, 2016, Phoenix, Arizona, USA, pages 2741–2749.
tensive comparative study of cluster validity indices. Vladimir I. Levenshtein. 1966. Binary codes capable
Pattern Recognition, 46(1):243–256. of correcting deletions, insertions and reversals. In
Igor Brigadir, Derek Greene, and Padraig Cunning- Soviet physics doklady, volume 10, page 707.
ham. 2014. Adaptive Representations for Tracking Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.
Breaking News on Twitter. In NewsKDD - Work- Corrado, and Jeffrey Dean. 2013. Distributed Rep-
shop on Data Science for News Publishing at The resentations of Words and Phrases and their Com-
20th ACM SIGKDD International Conference on positionality. In Christopher J. C. Burges, Lon Bot-
Knowledge Discovery and Data Mining, KDD ’14, tou, Zoubin Ghahramani, and Kilian Q. Weinberger,
August 24-27, 2014, New York, NY, USA. editors, Advances in Neural Information Processing
Kyunghyun Cho, Bart van Merrienboer, Çaglar Systems 26: 27th Annual Conference on Neural In-
Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Hol- formation Processing Systems 2013. Proceedings of
ger Schwenk, and Yoshua Bengio. 2014. Learning a meeting held December 5-8, 2013, Lake Tahoe,
phrase representations using RNN encoder-decoder Nevada, United States, pages 3111–3119.
for statistical machine translation. In Proceedings of
Sean Moran, Richard McCreadie, Craig Macdonald,
the 2014 Conference on Empirical Methods in Nat-
and Iadh Ounis. 2016. Enhancing First Story Detec-
ural Language Processing, EMNLP 2014, October
tion using Word Embeddings. In Proceedings of the
25-29, 2014, Doha, Qatar, pages 1724–1734.
39th International ACM SIGIR conference on Re-
Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick, search and Development in Information Retrieval,
Michael Muehl, and William W. Cohen. 2016. SIGIR 2016, July 17-21, 2016, Pisa, Italy, pages
Tweet2vec: Character-based distributed representa- 821–824.
tions for social media. In Proceedings of the 54th
Annual Meeting of the Association for Computa- Daniel Müllner. 2013. fastcluster: Fast hierarchical,
tional Linguistics, ACL 2016, August 7-12, 2016, agglomerative clustering routines for r and python.
Berlin, Germany. Journal of Statistical Software, 53(1):1–18.
Cı́cero Nogueira dos Santos and Bianca Zadrozny. Xuan Vinh Nguyen, Julien Epps, and James Bailey.
2014. Learning character-level representations for 2010. Information theoretic measures for cluster-
part-of-speech tagging. In Proceedings of the ings comparison: Variants, properties, normaliza-
31th International Conference on Machine Learn- tion and correction for chance. Journal of Machine
ing, ICML 2014, 21-26 June, 2014, Beijing, China, Learning Research, 11:2837–2854.
pages 1818–1826.
Symeon Papadopoulos, David Corney, and Luca Maria
Kohei Hayashi, Takanori Maehara, Masashi Toyoda, Aiello. 2014. SNOW 2014 Data Challenge: As-
and Ken-ichi Kawarabayashi. 2015. Real-Time sessing the Performance of News Topic Detection
Top-R Topic Detection on Twitter with Topic Hijack Methods in Social Media. In Symeon Papadopou-
Filtering. In Proceedings of the 21th ACM SIGKDD los, David Corney, and Luca Maria Aiello, editors,
43
Proceedings of the SNOW 2014 Data Challenge co- Jan Vosecky, Di Jiang, Kenneth Wai-Ting Leung, and
located with 23rd International World Wide Web Wilfred Ng. 2013. Dynamic multi-faceted topic
Conference (WWW 2014), April 8, 2014, Seoul, Ko- discovery in twitter. In 22nd ACM International
rea, pages 1–8. Conference on Information and Knowledge Man-
agement, CIKM’13, October 27 - November 1, 2013,
Sasa Petrovic, Miles Osborne, Richard McCreadie, San Francisco, CA, USA, pages 879–884.
Craig Macdonald, Iadh Ounis, and Luke Shrimpton.
2013. Can twitter replace newswire for breaking Soroush Vosoughi, Prashanth Vijayaraghavan, and Deb
news? In Emre Kiciman, Nicole B. Ellison, Bernie Roy. 2016. Tweet2vec: Learning tweet embeddings
Hogan, Paul Resnick, and Ian Soboroff, editors, Pro- using character-level CNN-LSTM encoder-decoder.
ceedings of the Seventh International Conference on In Proceedings of the 39th International ACM SI-
Weblogs and Social Media, ICWSM 2013, July 8-11, GIR conference on Research and Development in In-
2013, Cambridge, Massachusetts, USA. formation Retrieval, SIGIR 2016, July 17-21, 2016,
Pisa, Italy, pages 1041–1044.
Andrew Rosenberg and Julia Hirschberg. 2007. V-
measure: A conditional entropy-based external clus- Dominik Wurzer, Victor Lavrenko, and Miles Osborne.
ter evaluation measure. In EMNLP-CoNLL 2007, 2015. Tracking unbounded Topic Streams. In Pro-
Proceedings of the 2007 Joint Conference on Empir- ceedings of the 53rd Annual Meeting of the Associ-
ical Methods in Natural Language Processing and ation for Computational Linguistics and the 7th In-
Computational Natural Language Learning, June ternational Joint Conference on Natural Language
28-30, 2007, Prague, Czech Republic, pages 410– Processing of the Asian Federation of Natural Lan-
420. guage Processing, ACL 2015, July 26-31, 2015, Bei-
jing, China, pages 1765–1773.
Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid
to the interpretation and validation of cluster analy- Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
sis. Journal of Computational and Applied Mathe- Character-level convolutional networks for text clas-
matics, 20:53 – 65. sification. In Advances in Neural Information Pro-
cessing Systems 28: Annual Conference on Neural
Robert R. Sokal and F. James Rohlf. 1962. The Information Processing Systems 2015, December 7-
comparison of dendrograms by objective methods. 12, 2015, Montreal, Quebec, Canada, pages 649–
Taxon, 11(2):33–40. 657.
44