Character-Based Neural Embeddings For Tweet Clustering

Uploaded by

fatimakanari92

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views9 pages

Character-Based Neural Embeddings For Tweet Clustering

Uploaded by

fatimakanari92

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Character-based Neural Embeddings for Tweet Clustering

Svitlana Vakulenko Lyndon Nixon Mihai Lupu

Vienna University of MODUL Technology GmbH TU Wien
Economics and Business, [email protected] Vienna, Austria
MODUL Technology GmbH [email protected]
[email protected]

Abstract a bag-of-words. These approaches also require

language-specific sentence and word tokenization.
In this paper we show how the perfor- Word-based approaches fall short when applied
mance of tweet clustering can be improved to social media data, e.g., Twitter, where a lot of
by leveraging character-based neural net- infrequent or misspelled words occur within very
works. The proposed approach overcomes short documents. Hence, the document represen-
the limitations related to the vocabulary tation matrix becomes increasingly sparse.
explosion in the word-based models and One way to overcome sparseness in a tweet-
allows for the seamless processing of the term matrix is to consider only the terms that
multilingual content. Our evaluation re- appear frequently across the collection and drop
sults and code are available on-line1 . all the infrequent terms. This procedure effec-
tively removes a considerable amount of informa-
1 Introduction tion content. As a result, all tweets that do not con-
tain any of the frequent terms receive a null-vector
Our use case scenario, as part of the InVID representation. These tweets are further ignored
project2 , originates from the needs of profes- by the model and cannot influence clustering out-
sional journalists responsible for reporting break- comes in the subsequent time intervals, where the
ing news in a timely manner. News often appear frequency distribution may change, which hinders
on social media exclusively or right before they the detection of emerging topics.
appear in the traditional news media. Social me- Artificial neural networks (ANNs) allow to gen-
dia is also responsible for the rapid propagation erate dense vector representation (embeddings),
of inaccurate or incomplete information (rumors). which can be efficiently generated on the word- as
Therefore, it is important to provide efficient tools well as character levels (dos Santos and Zadrozny,
to enable journalists rapidly detect breaking news 2014; Zhang et al., 2015; Dhingra et al., 2016).
in social media streams (Petrovic et al., 2013). The main advantage of the character-based ap-
The SNOW 2014 Data Challenge provided the proaches is their language-independence, since
task of extracting newsworthy topics from Twitter. they do not require any language-specific parsing.
The results of the challenge confirmed that the task The major contribution of our work is the eval-
is ambitious: The best result was 0.4 F-measure. uation of the character-based neural embeddings
Breaking-news detection involves 3 subtasks: on the tweet clustering task. We show how to
selection, clustering, and ranking of tweets. In this employ character-based tweet embeddings for the
paper, we address the task of tweet clustering as task of tweet clustering and demonstrate in the ex-
one of the pivotal subtasks required to enable ef- perimental evaluation that the proposed approach
fective breaking news detection from Twitter. significantly outperforms the current state-of-the-
Traditional approaches to clustering textual art in tweet clustering for breaking news detection.
documents involve construction of a document- The remaining of this paper is structured as fol-
term matrix, which represents each document as lows: Section 2 provides an overview of the re-
1 lated work; we describe the setup of an extensive
https://github.com/vendi12/tweet2vec_
clustering evaluation in Section 3; report and discuss the re-
2
http://www.invid-project.eu sults in Sections 4 and 5, respectively; conclu-

36
Proceedings of the Fifth International Workshop on Natural Language Processing for Social Media , pages 36–44,
Valencia, Spain, April 3-7, 2017. c 2017 Association for Computational Linguistics
sion (Section 6) summarizes our findings and di- autoencoder on the task of paraphrase and seman-
rections for future work. tic similarity detection in tweets.
Our work extends the evaluation of the
2 Related Work Tweet2Vec model (Dhingra et al., 2016) to
2.1 Breaking news detection the tweet clustering task, versus the traditional
document-term matrix representation. To the best
There has been a continuous effort over the re- of our knowledge, this work is the first attempt to
cent years to design effective and efficient algo- evaluate the performance of character-based neu-
rithms capable of detecting newsworthy topics in ral embeddings on the tweet clustering task.
the Twitter stream (Hayashi et al., 2015; Ifrim
et al., 2014; Vosecky et al., 2013; Wurzer et al., 3 Experimental Evaluation
2015). These current state-of-the-art approaches
build upon the bag-of-words document model, 3.1 Dataset
which results in high-dimensional, sparse repre- Description and preprocessing. We use the
sentations that do not scale well and are not aware SNOW 2014 test dataset (Papadopoulos et al.,
of semantic similarities, such as paraphrases. 2014) in our evaluation. It contains the IDs of
The problem becomes evident in case of tweets about 1 million tweets produced within 24 hours.
that contain short texts with a long tail of in- We retrieved 845,626 tweets from the Twitter
frequent slang and misspelled words. The per- API, since other tweets had already been deleted
formance of the such approaches over Twitter from the platform. The preprocessing procedure:
datasets is very low, with F-measure up to 0.2 remove RT prefixes, urls and user mentions, bring
against the annotated Wikipidea articles as refer- all characters to lower case and separate punctua-
ence topics (Wurzer et al., 2015) and 0.4 against tion with spaces (the later is necessary only for the
the curated topic pool (Papadopoulos et al., 2014). word-level baseline).
The dataset is further separated into 5 subsets
2.2 Neural embeddings
corresponding to the 1-hour time intervals (18:00,
Artificial neural networks (ANNs) allow to gen- 22:00, 23:15, 01:00 and 01:30) that are annotated
erate dense vector representations (embeddings). with the list of breaking news topics. In total, we
Word2vec (Mikolov et al., 2013) is by far the have 48,399 tweets for clustering evaluation; the
most popular approach. It accumulates the co- majority of them (42,758 tweets) are in English.
occurrence statistics of words that efficiently sum- The dataset comes with the list of the breaking
marizes their semantics. news topics. These topics were manually selected
Brigadir et al. (2014) demonstrated encour- by the independent evaluators from the topic pool
aging results using the word2vec Skip-gram collected from all challenge participants (external
model to generate event timelines from tweets. topics). The list of topics contains 70 breaking
Moran et al. (2016) achieved an improvement over news headlines extracted from tweets (e.g., “The
the state-of-the-art first story detection (FSD) re- new, full Godzilla trailer has roared online”). Each
sults by expanding the tweets with their semanti- topic is annotated with a few (at most 4) tweet IDs,
cally related terms using word2vec. which is not sufficient for an adequate evaluation
Neural embeddings can be efficiently generated of a tweet clustering algorithm.
on the character level as well. They repeatedly
outperformed the word-level baselines on the tasks Dataset extension. We enrich the topic anno-
of language modeling (Kim et al., 2016), part-of- tations by collecting larger tweet clusters using
speech tagging (dos Santos and Zadrozny, 2014), fuzzy string matching3 for each of the topic labels.
and text classification (Zhang et al., 2015). The Fuzzy string matching uses the Levenstein (edit)
main advantage of the character-based approach is distance (Levenshtein, 1966) between the two in-
its language-independence, since it does not de- put strings as the measure of similarity. Leven-
pend on any language-specific preprocessing. stein distance corresponds to the minimum num-
Dhingra et al. (2016) proposed training a recur- ber of character edits (insertions, deletions, or sub-
rent neural network on the task of hashtag pre- stitutions) required to transform one string into the
diction. Vosoughi et al. (2016) demonstrated an 3
https://github.com/seatgeek/
improved performance of a character-based neural fuzzywuzzy

37
other. We choose only the tweets for which the information flow. The gates (reset and update
similarity ratio with the topic string is greater than gate) regulate how much the previous output state
0.9 threshold. (ht−1 ) influences the current state (ht ).
A sample tweet cluster produced with the fuzzy The two GRUs are identical, but the back-
string matching for the topic “Justin Trudeau apol- ward GRU receives the same sequence of tweet-
ogizes for Ukraine joke”: characters in reverse order. Each GRU computes
its own vector-representation for every substring
• Justin Trudeau apologizes for Ukraine joke:
(ht ) using the current character vector (xt ) and
Justin Trudeau said he’s spoken the head...
the vector-representation it computed a step be-
• Justin Trudeau apologizes for Ukraine com-
fore (ht−1 ). These two representations of the same
ments http://t.co/7ImWTRONXt
tweet are combined in the next layer of the neural
• Justin Trudeau apologizes for Ukraine hockey
network to produce the final tweet embedding (see
joke #cdnpoli
more details in Dhingra et.al. (2016)).
In total, we matched 2,585 tweets to 132 clus- The network is trained in minibatches with an
ters using this approach. The resulting tweet clus- objective function to predict the previously re-
ters represent the ground-truth topics within dif- moved hashtags. A hashtag can be considered as
ferent time intervals. The cluster size varies from the ground-truth cluster label for tweets. There-
1 to 361 tweets with an average of 20 tweets per fore, the network is trained to optimize for the cor-
cluster (median: 6.5). rect tweet classification, which corresponds to a
This simple procedure allows us to automati- supervised version of the tweet clustering task an-
cally generate high-quality partial labeling. We notated with the cluster assignment, i.e. hashtags.
further use this topic assignment as the ground- In order to predict the hashtags the tweet em-
truth class labels to automatically evaluate differ- beddings are passed through the linear layer,
ent flat clustering partitions. which produces the output in the size of the num-
ber of hashtags, which we observed in the training
3.2 Tweet representation approaches dataset. The softmax layer on top normalizes the
TweetTerm. Our baseline is the tweet repre- scores from the linear layer to generate the hashtag
sentation approach that was used in the winner- probabilities for every input tweet.
system of SNOW 2014 Data Challenge4 (Ifrim et Tweet embeddings are produced by passing the
al., 2014). This approach represents a collection tweets through the trained Tweet2Vec model (en-
of tweets as a tweet-term matrix by keeping the bi- coder). In this way we can obtain vector represen-
grams and trigrams that occur at least in 10 tweets. tations for all the tweets including the ones that do
not contain any hashtags. The result is a matrix of
Tweet2Vec. This approach includes two stages:
size n × h, where n is the number of tweets and h
(1) training a neural network to predict hashtags
is the number of hidden states (500).
using the subset of tweets that contain hashtags
(88,148 tweets in our case); (2) encoding: use the
3.3 Clustering
trained model to produce tweet embeddings for all
the tweets regardless whether they contain hash- To cluster tweet vectors (character-based tweet
tags or not. We use Tweet2Vec implementation5 embeddings produced by the neural network for
to produce tweet embeddings. Tweet2Vec evaluation or the document-term ma-
Tweet2Vec is a bi-directional recurrent neural trix for TweetTerm) we employ the hierarchical
network that consumes textual input as a sequence clustering algorithm implementation from fast-
of characters. The network architecture includes cluster library (Müllner, 2013).
two Gated Recurrent Units (GRUs) (Cho et al., Hierarchical clustering includes computing
2014): forward and backward GRUs. GRU is an pairwise distances between the tweet vectors, fol-
optimized version of a Long Short-Term Memory lowed by their linkage into a single dendrogram.
(LSTM) architecture (Hochreiter and Schmidhu- There are several distance metrics (Euclidean,
ber, 1997). It includes 2 gates that control the Manhattan, cosine, etc.) and linkage methods
4 to compare distances (single, average, complete,
https://github.com/heerme/
twitter-topics weighted, etc.). We evaluated the performance of
5
https://github.com/bdhingra/tweet2vec different methods using the cophenetic correlation

38
coefficient (CPCC) (Sokal and Rohlf, 1962) and into different clusters. Therefore, the top of the
found the best performing combination: Euclidean dendrogram, where all the documents reside in a
distance and average linkage method. single cluster always achieves the maximum com-
The hierarchical clustering dendrogram can pleteness score.
produce n different flat clusterings for the same V-Measure is designed to balance out the two
dataset: from n single-member clusters with one extremes of homogeneity and completeness. It is
document per cluster to a single cluster that con- the harmonic mean of the two and corresponds to
tains all n documents. The distance threshold de- the Normalized Mutual Information (NMI) score.
fines the granularity (number and size) of the pro- AMI score is an extension of NMI adjusted for
duced clusters. chance. The more clusters are considered the more
chance the labelings correlate. AMI allows us to
3.4 Distance threshold selection
compare the clustering performance across differ-
Grid search helps us to determine the optimal dis- ent time intervals since it normalizes the score by
tance threshold for the dendrogram cut-off. We the number of labeled clusters in each interval.
generated a list of values in the range from 0.1 to Finally, ARI is an alternative way to assess the
1.5 with 0.1 increment step and examine their per- agreement between two clusterings. It counts all
formance with respect to the ground-truth cluster pairs clustered together or separated in different
assignment. We produce flat clusterings for each clusters. ARI also accounts for the chance of an
value of the distance threshold from the grid and overlap in a random label assignment.
compare them with respect to the quality metrics.
Since we also want to be able to select the op- 3.6 Manual Cluster Evaluation
timal distance threshold in absence of the true la-
bels, we examine the scores provided by the mean Our partial labeling covers a small subset of the
Silhouette coefficient (Rousseeuw, 1987). Silhou- data and by design provides the clusters with the
ette is an unsupervised intrinsic evaluation metric high degree of string overlap with the annotated
(cluster validity index) that measures the quality of topics. Therefore, we extend the clustering evalu-
the produced clusters and can be used for unsuper- ation to the rest of the dataset to evaluate whether
vised intrinsic evaluation (i.e., without the ground- the models can uncover less straight-forward se-
truth labels). It was reported to outperform alter- mantic similarities in tweets. We select the results
native methods in a comparative study of 30 valid- for manual evaluation motivated by the cluster la-
ity indices (Arbelaitz et al., 2013). bel (headline) selection task.
The next step in the breaking news detection
3.5 Clustering Evaluation Metrics pipeline after the clustering task is headline se-
We evaluate the clustering results using the stan- lection (cluster labeling task). The most common
dard metrics for extrinsic clustering evaluation: approach to label a cluster of tweets is to select
homogeneity, completeness, V-Measure (Rosen- a single tweet as a representative member for the
berg and Hirschberg, 2007), Adjusted Rand Index whole cluster (Papadopoulos et al., 2014). We de-
(ARI) (Hubert and Arabie, 1985) and Adjusted cided to test this assumption and manually check
Mutual Information (AMI) (Nguyen et al., 2010). how many clusters loose their semantics when rep-
All metrics return a score on the range [0; 1] for resented with a single tweet.
the pair of sets that contain ground truth and clus- Headline selection motivates the coherence as-
ter labels as input. The higher the score the more sessment of the produced clusters since the clus-
similar the two clusterings are. ters discarded at this stage will never make it to the
The Homogeneity score represents the measure final results. To explore coherence of the produced
for purity of the produced clusters. It penalizes clusters we pick several tweets in each cluster and
clustering, where members of different classes get check whether they are semantically similar.
clustered together. Thus, the best homogeneity The tweet selected as a headline (cluster la-
scores are always at the bottom of the dendrogram, bel) can be the first published tweet as in
i.e., at the level of the leaves, where each docu- First Story Detection (FSD) task, also used in
ment belongs to its own cluster. Completeness, Ifrim et al (2014). Alternative approaches include
on the contrary, favors larger clusters and reduces selection of the most recent tweet published on the
the score if the members of the same class are split topic, or the tweet that is semantically most similar

39
Interval Tweets Model Dimensions Distance threshold Clusters Homogeneity Completeness V-Measure ARI AMI
Tweet2Vec 500 1 3026 0.9958 0.9453 0.9699 0.9804 0.9376
18:00 10,344
TweetTerm 433 1-1.3 66-79 0.9277 1 0.9625 0.949 0.9216
Tweet2Vec 500 0.9 5292 1 0.9601 0.9796 0.9922 0.9571
22:00 14,471
TweetTerm 589 0.7-1.3 93-118 0.9385 0.9969 0.9668 0.9859 0.9359
Tweet2Vec 500 0.8 3986 1 0.98 0.9899 0.9948 0.9743
23:15 8,231
TweetTerm 565 0.01-1.3 67-142 0.8062 0.9978 0.8918 0.7344 0.7763
Tweet2Vec 500 0.9 2242 1 0.8877 0.9405 0.8668 0.8327
01:00 5,123
TweetTerm 721 0.8-1.3 71-111 0.8104 1 0.8953 0.8188 0.7666
Tweet2Vec 500 0.9 2091 1 0.8762 0.934 0.8089 0.8129
01:30 4,589
TweetTerm 635 1.2-1.3 64-78 0.8024 1 0.8903 0.7809 0.754

Table 1: Results of clustering evaluation on the English-language dataset

to all other tweets in the cluster, i.e., the tweet clos- Partial and Incorrect labels reflect different
est to the centroid of the cluster (medoid-tweets). types of clustering errors. Partial error is less se-
Therefore, we sample 5 tweets from each cluster: vere indicating that the tweets of the cluster are se-
the first published tweet, the most recent tweet and mantically similar, but they report different news
three medoid-tweets. (events) and should be split into several clusters.
We set up a manual evaluation task as follows: Incorrect clusters indicate a random collection of
tweets with no semantic similarities.
1. Take the top 20 largest clusters sorted by the
number of tweets that belong to the cluster. 4 Results
2. For each cluster:
4.1 Results of Clustering Evaluation
(a) Take the first and the last published
tweet (tweets are previously sorted by Table 1 summarizes the results of our evaluation
the publication date). using the ground-truth partial labeling. The scores
(b) Take three medoid-tweets, i.e., the highlighted with the bold font indicate the best re-
tweets that appear closest to the centroid sult among the two competing approaches for the
of the cluster. same subset of tweets corresponding to the respec-
(c) Add the 5 tweets to the set associated tive time interval.
with the cluster (removing exact dupli- Tweet2Vec exhibits better clustering perfor-
cate tweets) mance comparing to the baseline according to the
3. For all clusters, where the set of selected majority of the evaluation metrics in all the inter-
tweets contains at least two unique tweets: vals. In all cases Tweet2Vec model wins in terms
4 human evaluators independently assess the of Homogeneity score and TweetTerm wins in
coherence of each cluster. Completeness. This result shows that Tweet2Vec
is better at separating tweets that are not similar
According to the evaluation setup each model enough than the baseline model. Tweet2Vec fails
produced 20 top-clusters for each of the 5 in- only once to perfectly separate the ground-truth
tervals, i.e., 20 × 5 = 100 clusters per model. clusters (18:00 interval). This result shows that
We manually evaluate only the clusters that con- Tweet2Vec is able to replicate the results of the
tain more than 1 distinct representative tweet fuzzy string matching algorithm that was used to
(Clusters>1). All other clusters, i.e., the ones generate the ground-truth labeling.
for which all 5 selected tweets are identical
(Clusters=1), are considered correct by default. 4.2 Results of Distance Threshold Selection
Results for all 5 intervals were evaluated to- The rise in V-Measure correlates with the decline
gether in a single pool and the models were of the Silhouette coefficient and the steep drop in
anonymized to avoid biases. Each evaluator inde- the number of produced clusters (see Figure 1).
pendently assigned a single score to each cluster: We observed that the optimal distance threshold
for Tweet2Vec clustering according to V-Measure
• Correct – all tweets report the same news; is on the interval [0.8; 1] (see Table 1: Distance
• Partial – some tweets are not related; threshold), which is also consistent with the find-
• Incorrect – all tweets are not related. ings reported in Ifrim et. al (2014).

40
Correct (%) Errors (%)
Model Dataset Clusters Clusters=1 Clusters>1 Total Partial Incorrect
Tweet2Vec English 100 80 8.3 88.3 10 1.8
TweetTerm English 95 71 17.4 87.9 8.9 3.2
Tweet2Vec Multilingual 100 67 12.5 79.5 13 7.5

Table 2: Results of manual cluster evaluation. Note: the last row shows results on a different dataset and
can not be directly compared with the other models.

tain identical tweets (Clusters=1). Tweet2Vec also

produced the least number of incorrect clusters: at
most 2 incorrect clusters per 100 clusters (Preci-
sion: 0.98).
The results of Tweet2Vec on the multilingual
dataset are lower than on the English-language
tweets. However, we do not have alternative
results to compare since the baseline approach
is not language-independent and requires addi-
tional functionality (word-level tokenizers) to han-
dle tweets in other languages, e.g., Arabic or Chi-
nese. We provide this evaluation results to demon-
strate that Tweet2Vec overcomes this limitation
and is able to cluster tweets in different languages.
Figure 1: Correlation between the V-Measure, Sil-
In particular, we obtained correct clusters of Rus-
houette coefficient and the number of clusters per
sian and Arabic tweets.
tweet (Tweet2Vec 22:00 interval). The vertical red
We observed that leaving the urls does not sig-
line indicates the maximum V-Measure score.
nificantly affect clustering performance, i.e., the
model tolerates noise. However, replacement of
4.3 Results of Manual Cluster Evaluation the urls and user mentions with placeholders as in
Results of the manual cluster evaluation by four Dhingra et. al. (2016) generates syntactic patterns
independent evaluators are summarized in Table 2. in text, such as @user @user @user, which causes
Bold font indicates the maximum scores achieved semantically unrelated tweets appear within the
across the competing representation approaches. same cluster.
Tables 3 and 4 show sample clusters produced by
5 Discussion
both models alongside their average score.
TweetTerm assigns a 0-vector representation to Our experimental evaluation showed that the
tweets that do not contain any of the frequent character-based embeddings produced with a neu-
terms. Hence, all these tweets end up in a sin- ral network outperform the document-term base-
gle “garbage” cluster. Therefore, we discount the line on the tweet clustering task. The baseline
number of the expected “garbage” clusters (1 clus- approach (TweetTerm) shows a very good perfor-
ter per interval = 5 clusters) from the score count mance in comparison with the simplicity of its im-
for TweetTerm (Table 2). plementation, but it naturally falls short in recog-
Tweet2Vec model produces the largest number nizing patterns beyond simple n-gram matching.
of perfectly homogeneous clusters for which all 5 We attribute this result to the inherent limitation
selected tweets are identical (see Table 2 column of the document-term model retaining only the fre-
Clusters=1). The percentage of correct results quent terms and disregarding the long tail of infre-
among the manually evaluated clusters is higher quent patterns. This limitation appears crucial in
for the TweetTerm model, but the number of er- the task of emergent news detection, in which the
rors (Incorrect) is higher as well. Tweet2Vec pro- topics need to be detected long before they become
duced the highest total % of correct clusters due to popular. Neural embeddings, in contrast, can re-
the larger proportion of detected clusters that contain a sufficient level of detail in their representa-

41
Sample Cluster Evaluation
video : bitcoin : mtgox exchange goes offline - bitcoin , a virtual currency ...
the slow-motion collapse of mt . gox is bitcoin’s first financial crisis : now bitcoin users ... Correct
Disastro bitcoin : mt . gox cessa ogni attivite ... : mt . gox , il pi grande cambiavalute bitco ...
california couple finds time capsules worth $10 million
Correct
californian couple finds $10 million worth of gold coins in tin can
ukraine puts off vote on new government despite eu pleas for quick action - washington post ...
ukraine truce shattered , death toll hits 67 - kiev (reuters) - ukraine suffered its bloodiest day ... Partial
ukraine fighting leaves at least 18 dead as kiev barricades burn - clashes in ukraine ...
are you going to come on his network and get poor ratings too ?
Incorrect
are you sold on the waffle taco ?
the chromecast app flood has started by
Incorrect
the importance of emotion in design by

Table 3: Tweet2Vec sample results. Rows of the table show sample tweet clusters. Each line within the
row corresponds to a separate tweet (after preprocessing, i.e. usernames and urls removed.)
Sample Cluster Evaluation
obama : michelle and i were saddened to hear of the passing of harold ramis...
touching tribute to ghostbusters star harold ramis from comic artist Correct
on the joyful comedy of harold ramis
major tokyo-based bitcoin exchange mt . gox goes dark
Correct
”bitcoin exchange giant mt . gox goes dark — popular science ”
obesity rate for young children plummets 43 % in a decade
Correct
the national obesity rate for young children dropped 43 % over the past decade
diplomatic pressure is unlikely to reverse uganda’s cruel anti-gay law
provisions of arizona proposed anti-gay law
Partial
even mitt romney wants arizona’s governor to veto the state’s anti-gay bill
icymi : arizona pizzeria response to state anti-gay bill
amazing debate nic ! well done !
well done 4 -0
well done ! i find running so difficult . feel proud ! Incorrect
well done him :-)
well done nicola my money is on you you done it well tonight ??

Table 4: TweetTerm sample results. Rows of the table show sample tweet clusters.

tions and are able to mirror the fuzzy string match- analogous to TF/IDF weighting scheme to avoid
ing performance beyond simple n-gram matching. capturing punctuation and other merely syntactic
patterns.
It becomes apparent from the sample cluster-
ing results (Tables 3 and 4) that both models per- Limitations. Neural networks gain performance
form essentially the same task of unveiling pat- when more data is available. We could use only
terns shared between a group of strings. While 88,148 tweets from the dataset to train the neu-
TweetTerm operates only on the patterns of iden- ral network, which can appear insufficient to un-
tical n-grams, Tweet2Vec goes beyond this limita- fold the potential of the model to recognize more
tion by providing room for a variation within the complex patterns. Also, due to the scarce annota-
n-gram substring similar to fuzzy string matching. tion available we could use only a small subset of
This effect allows to capture subtle variations in the original dataset for our clustering evaluation.
strings, e.g., misspellings, which word-based ap- Since most of the SNOW tweets are in English,
proaches are incapable of. another dataset is needed for comprehensive mul-
Our error analysis also revealed the limitation tilingual clustering evaluation.
of the neural embeddings to distinguish between
6 Conclusion
semantic and syntactic similarity in strings (see
Incorrect samples in Table 3). Tweet2Vec, as a We showed that character-based neural embed-
recurrent neural network approach, represents not dings enable accurate tweet clustering with min-
only the characters but also their order in string imum supervision. They provide fine-grained rep-
that may be a false similarity signal. It is evi- resentations that can help to uncover fuzzy simi-
dent that the neural representations in our example larities in strings beyond simple n-gram matching.
would benefit from the stop-word removal or an We also demonstrated the limitation of the current

42
approach unable to distinguish semantic from syn- International Conference on Knowledge Discovery
tactic patterns in strings, which provides a clear and Data Mining, August 10-13, 2015, Sydney, Aus-
tralia, pages 417–426.
direction for the future work.
Sepp Hochreiter and Jürgen Schmidhuber. 1997.
7 Acknowledgments Long short-term memory. Neural Computation,
9(8):1735–1780.
The presented work was supported by the InVID
Project (http://www.invid-project.eu/), funded by Lawrence Hubert and Phipps Arabie. 1985. Compar-
the European Union’s Horizon 2020 research and ing partitions. Journal of classification, 2(1):193–
218.
innovation programme under grant agreement No
687786. Mihai Lupu was supported by Self- Georgiana Ifrim, Bichen Shi, and Igor Brigadir. 2014.
Optimizer (FFG 852624) in the EUROSTARS Event Detection in Twitter using Aggressive Filter-
programme, funded by EUREKA, the BMWFW ing and Hierarchical Tweet Clustering. In Symeon
Papadopoulos, David Corney, and Luca Maria
and the European Union, and ADMIRE (P25905- Aiello, editors, Proceedings of the SNOW 2014 Data
N23) by FWF. We thank to Bhuwan Dhingra for Challenge co-located with 23rd International World
the support in using Tweet2Vec and Linda Ander- Wide Web Conference (WWW 2014), April 8, 2014,
sson for the review and helpful comments. Seoul, Korea, pages 33–40.
Yoon Kim, Yacine Jernite, David Sontag, and Alexan-
der M. Rush. 2016. Character-aware neural lan-
References guage models. In Proceedings of the Thirtieth AAAI
Olatz Arbelaitz, Ibai Gurrutxaga, Javier Muguerza, Conference on Artificial Intelligence, February 12-
Jesús M. Pérez, and Iñigo Perona. 2013. An ex- 17, 2016, Phoenix, Arizona, USA, pages 2741–2749.
tensive comparative study of cluster validity indices. Vladimir I. Levenshtein. 1966. Binary codes capable
Pattern Recognition, 46(1):243–256. of correcting deletions, insertions and reversals. In
Igor Brigadir, Derek Greene, and Padraig Cunning- Soviet physics doklady, volume 10, page 707.
ham. 2014. Adaptive Representations for Tracking Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S.
Breaking News on Twitter. In NewsKDD - Work- Corrado, and Jeffrey Dean. 2013. Distributed Rep-
shop on Data Science for News Publishing at The resentations of Words and Phrases and their Com-
20th ACM SIGKDD International Conference on positionality. In Christopher J. C. Burges, Lon Bot-
Knowledge Discovery and Data Mining, KDD ’14, tou, Zoubin Ghahramani, and Kilian Q. Weinberger,
August 24-27, 2014, New York, NY, USA. editors, Advances in Neural Information Processing
Kyunghyun Cho, Bart van Merrienboer, Çaglar Systems 26: 27th Annual Conference on Neural In-
Gülçehre, Dzmitry Bahdanau, Fethi Bougares, Hol- formation Processing Systems 2013. Proceedings of
ger Schwenk, and Yoshua Bengio. 2014. Learning a meeting held December 5-8, 2013, Lake Tahoe,
phrase representations using RNN encoder-decoder Nevada, United States, pages 3111–3119.
for statistical machine translation. In Proceedings of
Sean Moran, Richard McCreadie, Craig Macdonald,
the 2014 Conference on Empirical Methods in Nat-
and Iadh Ounis. 2016. Enhancing First Story Detec-
ural Language Processing, EMNLP 2014, October
tion using Word Embeddings. In Proceedings of the
25-29, 2014, Doha, Qatar, pages 1724–1734.
39th International ACM SIGIR conference on Re-
Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick, search and Development in Information Retrieval,
Michael Muehl, and William W. Cohen. 2016. SIGIR 2016, July 17-21, 2016, Pisa, Italy, pages
Tweet2vec: Character-based distributed representa- 821–824.
tions for social media. In Proceedings of the 54th
Annual Meeting of the Association for Computa- Daniel Müllner. 2013. fastcluster: Fast hierarchical,
tional Linguistics, ACL 2016, August 7-12, 2016, agglomerative clustering routines for r and python.
Berlin, Germany. Journal of Statistical Software, 53(1):1–18.

Cı́cero Nogueira dos Santos and Bianca Zadrozny. Xuan Vinh Nguyen, Julien Epps, and James Bailey.
2014. Learning character-level representations for 2010. Information theoretic measures for cluster-
part-of-speech tagging. In Proceedings of the ings comparison: Variants, properties, normaliza-
31th International Conference on Machine Learn- tion and correction for chance. Journal of Machine
ing, ICML 2014, 21-26 June, 2014, Beijing, China, Learning Research, 11:2837–2854.
pages 1818–1826.
Symeon Papadopoulos, David Corney, and Luca Maria
Kohei Hayashi, Takanori Maehara, Masashi Toyoda, Aiello. 2014. SNOW 2014 Data Challenge: As-
and Ken-ichi Kawarabayashi. 2015. Real-Time sessing the Performance of News Topic Detection
Top-R Topic Detection on Twitter with Topic Hijack Methods in Social Media. In Symeon Papadopou-
Filtering. In Proceedings of the 21th ACM SIGKDD los, David Corney, and Luca Maria Aiello, editors,

43
Proceedings of the SNOW 2014 Data Challenge co- Jan Vosecky, Di Jiang, Kenneth Wai-Ting Leung, and
located with 23rd International World Wide Web Wilfred Ng. 2013. Dynamic multi-faceted topic
Conference (WWW 2014), April 8, 2014, Seoul, Ko- discovery in twitter. In 22nd ACM International
rea, pages 1–8. Conference on Information and Knowledge Man-
agement, CIKM’13, October 27 - November 1, 2013,
Sasa Petrovic, Miles Osborne, Richard McCreadie, San Francisco, CA, USA, pages 879–884.
Craig Macdonald, Iadh Ounis, and Luke Shrimpton.
2013. Can twitter replace newswire for breaking Soroush Vosoughi, Prashanth Vijayaraghavan, and Deb
news? In Emre Kiciman, Nicole B. Ellison, Bernie Roy. 2016. Tweet2vec: Learning tweet embeddings
Hogan, Paul Resnick, and Ian Soboroff, editors, Pro- using character-level CNN-LSTM encoder-decoder.
ceedings of the Seventh International Conference on In Proceedings of the 39th International ACM SI-
Weblogs and Social Media, ICWSM 2013, July 8-11, GIR conference on Research and Development in In-
2013, Cambridge, Massachusetts, USA. formation Retrieval, SIGIR 2016, July 17-21, 2016,
Pisa, Italy, pages 1041–1044.
Andrew Rosenberg and Julia Hirschberg. 2007. V-
measure: A conditional entropy-based external clus- Dominik Wurzer, Victor Lavrenko, and Miles Osborne.
ter evaluation measure. In EMNLP-CoNLL 2007, 2015. Tracking unbounded Topic Streams. In Pro-
Proceedings of the 2007 Joint Conference on Empir- ceedings of the 53rd Annual Meeting of the Associ-
ical Methods in Natural Language Processing and ation for Computational Linguistics and the 7th In-
Computational Natural Language Learning, June ternational Joint Conference on Natural Language
28-30, 2007, Prague, Czech Republic, pages 410– Processing of the Asian Federation of Natural Lan-
420. guage Processing, ACL 2015, July 26-31, 2015, Bei-
jing, China, pages 1765–1773.
Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid
to the interpretation and validation of cluster analy- Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.
sis. Journal of Computational and Applied Mathe- Character-level convolutional networks for text clas-
matics, 20:53 – 65. sification. In Advances in Neural Information Pro-
cessing Systems 28: Annual Conference on Neural
Robert R. Sokal and F. James Rohlf. 1962. The Information Processing Systems 2015, December 7-
comparison of dendrograms by objective methods. 12, 2015, Montreal, Quebec, Canada, pages 649–
Taxon, 11(2):33–40. 657.

Transforming Education with AI: Guide to Understanding and Using ChatGPT in the Classroom
From Everand
Transforming Education with AI: Guide to Understanding and Using ChatGPT in the Classroom
Shane Snipes, PhD
No ratings yet
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Clustering Thesis
No ratings yet
Clustering Thesis
55 pages
Titov Bunker
No ratings yet
Titov Bunker
8 pages
Live Trace Visualization for System and Program Comprehension in Large Software Landscapes
From Everand
Live Trace Visualization for System and Program Comprehension in Large Software Landscapes
Florian Fittkau
No ratings yet
Ikm at Semeval-2017 Task 8: Convolutional Neural Networks For Stance Detection and Rumor Verification
No ratings yet
Ikm at Semeval-2017 Task 8: Convolutional Neural Networks For Stance Detection and Rumor Verification
5 pages
Conceptual Dependency Theory: Fundamentals and Applications
From Everand
Conceptual Dependency Theory: Fundamentals and Applications
Fouad Sabry
No ratings yet
Knowledge Reasoning: Fundamentals and Applications
From Everand
Knowledge Reasoning: Fundamentals and Applications
Fouad Sabry
No ratings yet
TensorFlow Developer Certification Guide: Crack Google's official exam on getting skilled with managing production-grade ML models
From Everand
TensorFlow Developer Certification Guide: Crack Google's official exam on getting skilled with managing production-grade ML models
Patrick J
No ratings yet
TensorFlow Developer Certification Guide
From Everand
TensorFlow Developer Certification Guide
Patrick J
No ratings yet
Foundational Models and Architectures S1: Generative AI, #1
From Everand
Foundational Models and Architectures S1: Generative AI, #1
Leaster Startx
No ratings yet
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
Gensim for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Analyzing and Ranking Prevalent News Over Social Media
No ratings yet
Analyzing and Ranking Prevalent News Over Social Media
12 pages
Wicked, Incomplete, and Uncertain: User Support in the Wild and the Role of Technical Communication
From Everand
Wicked, Incomplete, and Uncertain: User Support in the Wild and the Role of Technical Communication
Jason Swarts
No ratings yet
Mama Edha at Semeval-2017 Task 8: Stance Classification With CNN and Rules
No ratings yet
Mama Edha at Semeval-2017 Task 8: Stance Classification With CNN and Rules
5 pages
SGSG: Semantic Graph-Based Storyline Generation in Twitter: Nazanin Dehghani
No ratings yet
SGSG: Semantic Graph-Based Storyline Generation in Twitter: Nazanin Dehghani
18 pages
Sentence Embedding To Improve Rumour Detection Performance Model
No ratings yet
Sentence Embedding To Improve Rumour Detection Performance Model
7 pages
Ultimate Enterprise Data Analysis and Forecasting using Python
From Everand
Ultimate Enterprise Data Analysis and Forecasting using Python
Shanthababu Pandian
No ratings yet
Unveiling the Secrets of ChatGPT Inside the Mind of an AI
From Everand
Unveiling the Secrets of ChatGPT Inside the Mind of an AI
Nelson Ambrose
No ratings yet
2020.findings Emnlp.344
No ratings yet
2020.findings Emnlp.344
11 pages
Report Rohun Sjmoon
No ratings yet
Report Rohun Sjmoon
6 pages
CHATGPT DALL.E 3: Complete Guide. Third Edition
From Everand
CHATGPT DALL.E 3: Complete Guide. Third Edition
Hesham Mohamed Elsherif
No ratings yet
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition)
From Everand
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition)
Dr. Gypsy Nandi
No ratings yet
Improving Twitter Named Entity Recognition Using Word Representations
No ratings yet
Improving Twitter Named Entity Recognition Using Word Representations
5 pages
A Deep-Word and Character Based Approach To Offensive Language Identification
No ratings yet
A Deep-Word and Character Based Approach To Offensive Language Identification
5 pages
Communication Nets: Stochastic Message Flow and Delay
From Everand
Communication Nets: Stochastic Message Flow and Delay
Leonard Kleinrock
3/5 (1)
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
Mastering Computer Programming: A Comprehensive Guide
From Everand
Mastering Computer Programming: A Comprehensive Guide
Kondwani Hara
No ratings yet
Using OpenRefine
From Everand
Using OpenRefine
Ruben Verborgh
4/5 (1)
IGNOU Software Engineering Previous 10 Years Solved Papers
From Everand
IGNOU Software Engineering Previous 10 Years Solved Papers
Manish Soni
No ratings yet
CP5074 - SNA Unit V Notes
No ratings yet
CP5074 - SNA Unit V Notes
21 pages
Cs533 Clustering Tweet Presentation
No ratings yet
Cs533 Clustering Tweet Presentation
19 pages
Topic Models From Twitter Hashtags: 1 Problem Definition
No ratings yet
Topic Models From Twitter Hashtags: 1 Problem Definition
2 pages
IGNOU BCA System Analysis and Design Previous Year Solved Papers MCS 014
From Everand
IGNOU BCA System Analysis and Design Previous Year Solved Papers MCS 014
Manish Soni
No ratings yet
Applied HuggingSound for Speech Recognition: The Complete Guide for Developers and Engineers
From Everand
Applied HuggingSound for Speech Recognition: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
UTOPIC 2023.eacl-Main.132
No ratings yet
UTOPIC 2023.eacl-Main.132
16 pages
Untrapped Value:: Software Reuse Powering Future Prosperity
From Everand
Untrapped Value:: Software Reuse Powering Future Prosperity
David Erickson
No ratings yet
Tacit and Explicit Understanding in Computer Support: Gerry Stahl's eLibrary, #2
From Everand
Tacit and Explicit Understanding in Computer Support: Gerry Stahl's eLibrary, #2
Gerry Stahl
No ratings yet
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
From Everand
Automatic Image Annotation: Enhancing Visual Understanding through Automated Tagging
Fouad Sabry
No ratings yet
Transportation System 上台報告ppt
No ratings yet
Transportation System 上台報告ppt
47 pages
Cloud Computing: Master the Concepts, Architecture and Applications with Real-world examples and Case studies
From Everand
Cloud Computing: Master the Concepts, Architecture and Applications with Real-world examples and Case studies
Ruchi Doshi
No ratings yet
Group08 - BDM01 - Topic Modelling in Text Classification
No ratings yet
Group08 - BDM01 - Topic Modelling in Text Classification
19 pages
A Review of Approaches For Topic Detection in Twitter
No ratings yet
A Review of Approaches For Topic Detection in Twitter
28 pages
Statistical Semantics: Fundamentals and Applications
From Everand
Statistical Semantics: Fundamentals and Applications
Fouad Sabry
No ratings yet
Automatic Image Annotation: Fundamentals and Applications
From Everand
Automatic Image Annotation: Fundamentals and Applications
Fouad Sabry
No ratings yet
Internet of Things (IoT) A Quick Start Guide: A to Z of IoT Essentials
From Everand
Internet of Things (IoT) A Quick Start Guide: A to Z of IoT Essentials
Chitra Lele
No ratings yet
Chapter 26 Text Mining - Introduction To Data Science
No ratings yet
Chapter 26 Text Mining - Introduction To Data Science
20 pages
Tcl Language Essentials: Definitive Reference for Developers and Engineers
From Everand
Tcl Language Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Remoting Patterns: Foundations of Enterprise, Internet and Realtime Distributed Object Middleware
From Everand
Remoting Patterns: Foundations of Enterprise, Internet and Realtime Distributed Object Middleware
Markus Völter
3.5/5 (3)
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Text Mining: Fundamentals and Applications
From Everand
Text Mining: Fundamentals and Applications
Fouad Sabry
No ratings yet
Towards Data and Compute Efficient Fake News Detection An Approach Combining Active Learning and Pre Trained Language Models
No ratings yet
Towards Data and Compute Efficient Fake News Detection An Approach Combining Active Learning and Pre Trained Language Models
18 pages
Clustering Tweets Via Tweet Embeddings Thesis
No ratings yet
Clustering Tweets Via Tweet Embeddings Thesis
48 pages
Learn Penetration Testing with Python 3.x: Perform Offensive Pentesting and Prepare Red Teaming to Prevent Network Attacks and Web Vulnerabilities (English Edition)
From Everand
Learn Penetration Testing with Python 3.x: Perform Offensive Pentesting and Prepare Red Teaming to Prevent Network Attacks and Web Vulnerabilities (English Edition)
Yehia Elghaly
5/5 (1)
Mastering Smalltalk Programming: From Basics to Expert Proficiency
From Everand
Mastering Smalltalk Programming: From Basics to Expert Proficiency
William Smith
No ratings yet
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
From Everand
Neural Networks: A Practical Guide for Understanding and Programming Neural Networks and Useful Insights for Inspiring Reinvention
Steven Cooper
No ratings yet
Deep Learning for Beginners: A Comprehensive Introduction of Deep Learning Fundamentals for Beginners to Understanding Frameworks, Neural Networks, Large Datasets, and Creative Applications with Ease
From Everand
Deep Learning for Beginners: A Comprehensive Introduction of Deep Learning Fundamentals for Beginners to Understanding Frameworks, Neural Networks, Large Datasets, and Creative Applications with Ease
Steven Cooper
2.5/5 (2)
Building LLM Powered Applications: Create intelligent apps and agents with large language models
From Everand
Building LLM Powered Applications: Create intelligent apps and agents with large language models
Valentina Alto
No ratings yet
Methodology for the development of systems based on learning objects
From Everand
Methodology for the development of systems based on learning objects
Dougglas Hurtado Carmona
No ratings yet
Introduction to LLMs for Business Leaders: Responsible AI Strategy Beyond Fear and Hype: Byte-Sized Learning Series
From Everand
Introduction to LLMs for Business Leaders: Responsible AI Strategy Beyond Fear and Hype: Byte-Sized Learning Series
I. Almeida
No ratings yet
Up-Down Versus Left-Right The Effect of Writing Direction Change in
No ratings yet
Up-Down Versus Left-Right The Effect of Writing Direction Change in
21 pages
7 Sponsorship
No ratings yet
7 Sponsorship
13 pages
Nuromarketing Subliminal Messages
No ratings yet
Nuromarketing Subliminal Messages
9 pages
The Effect of Brand Personality On Ewom
No ratings yet
The Effect of Brand Personality On Ewom
9 pages
Financial Integrity
No ratings yet
Financial Integrity
29 pages
Csta Standards Mapped To Commoncorestandards
No ratings yet
Csta Standards Mapped To Commoncorestandards
6 pages
Fraser Parker - Occlus
100% (4)
Fraser Parker - Occlus
24 pages
Case Daka
No ratings yet
Case Daka
7 pages
4452756321762954@4364258&425681233248 - 4288232132658965554 TH Application Form
No ratings yet
4452756321762954@4364258&425681233248 - 4288232132658965554 TH Application Form
3 pages
2024 Accounting Grade 10 Project - QP
No ratings yet
2024 Accounting Grade 10 Project - QP
5 pages
(Case) - Honda B
No ratings yet
(Case) - Honda B
5 pages
Heat Exchanger Formulas
No ratings yet
Heat Exchanger Formulas
2 pages
12 Questions That Will Change Your
100% (1)
12 Questions That Will Change Your
4 pages
Psycholinguistics View of Language
No ratings yet
Psycholinguistics View of Language
15 pages
Brochure Antech Type C
No ratings yet
Brochure Antech Type C
2 pages
Date: English (Set - A) Time: 3 Hrs. Class: VII M. M: 70: General Instructions
No ratings yet
Date: English (Set - A) Time: 3 Hrs. Class: VII M. M: 70: General Instructions
6 pages
A Guilted Age Apologies For The Past Ashraf A H Rushdy PDF Download
No ratings yet
A Guilted Age Apologies For The Past Ashraf A H Rushdy PDF Download
77 pages
Urban Designer, Urban Planner & Architect
No ratings yet
Urban Designer, Urban Planner & Architect
2 pages
Final GR 7 Tech Term 3 Task 5 Memo
No ratings yet
Final GR 7 Tech Term 3 Task 5 Memo
4 pages
Androstenedione
No ratings yet
Androstenedione
32 pages
Internal Audit Report: 1. Summary of Findings
No ratings yet
Internal Audit Report: 1. Summary of Findings
7 pages
Michael Jackson Dissertation
100% (2)
Michael Jackson Dissertation
4 pages
C Series Product Guide PDF
No ratings yet
C Series Product Guide PDF
112 pages
60N3LH5 STMicroelectronics
No ratings yet
60N3LH5 STMicroelectronics
16 pages
Hong Kong History
No ratings yet
Hong Kong History
5 pages
TLV - Riduttore Is
No ratings yet
TLV - Riduttore Is
5 pages
Unit Study Guide (Unit 4)
No ratings yet
Unit Study Guide (Unit 4)
2 pages
Housekeeping NC Ii: Course Structure Basic Competencies
No ratings yet
Housekeeping NC Ii: Course Structure Basic Competencies
2 pages
Villarba Vs Court of Appeals
No ratings yet
Villarba Vs Court of Appeals
16 pages
Oracle HCM Cloud Training Outline
No ratings yet
Oracle HCM Cloud Training Outline
6 pages
Soal PTS Jawa 4 Genap 2022
No ratings yet
Soal PTS Jawa 4 Genap 2022
2 pages
Classifiedrecords
No ratings yet
Classifiedrecords
18 pages
0-02-Oct-2017-05-10-50English Self Learning Material PDF
No ratings yet
0-02-Oct-2017-05-10-50English Self Learning Material PDF
258 pages
Algebra P4
No ratings yet
Algebra P4
95 pages

Character-Based Neural Embeddings For Tweet Clustering

Uploaded by

Character-Based Neural Embeddings For Tweet Clustering

Uploaded by

Character-based Neural Embeddings for Tweet Clustering

Svitlana Vakulenko Lyndon Nixon Mihai Lupu

Abstract a bag-of-words. These approaches also require

Table 1: Results of clustering evaluation on the English-language dataset

tain identical tweets (Clusters=1). Tweet2Vec also

You might also like