Dissertation
Dissertation
Dissertation
By
2020
Dissertation Committee:
Rajiv Ramnath, Gagan Agrawal, Advisor
Eric Fosler-Lussier
Ping Zhang
© Copyright by
Renhao Cui
2020
Abstract
In the past decade, social media has become the dominating platform for advertisement.
The broad accessibility, rich message types, large audience size, and accurate customer
targeting allow for efficient propagation of commercial posts. In addition to helping com-
panies disseminate their advertisements more effectively, social media also provides the
opportunity for rapid receipt of customer feedback. However, many companies still rely
In this work, we demonstrate multiple methods to help companies utilize social media
for a better advertising and marketing experience by drawing from and extending machine
We apply ensemble models to classify user feedback based on a mixed set of label
requirements. Then we build a set of linguistics features to predict the potential of drawing
a hybrid Long Short Term Memory network (LSTM) to recognize the activities of users
when posting tweets. As the last step, we propose a constrained generation framework to
help rephrase commercial posts that are more diverse in terms of text and that preserve
the key information. Our work covers multiple areas of advertising on social media, from
ii
This is dedicated to my beloved mom.
iii
Acknowledgments
I would like to express the sincerest appreciation to my advisors, Prof. Rajiv Ramnath
and Prof. Gagan Agrawal, for their insightful, encouraging, and constant help in both my
research and my life. I also wish to say thank you to my committee members, Prof. Eric
Fosler-Lussier and Prof. Ping Zhang, for their time, guidance, and goodwill.
I would like to express my gratitude to Astute Global for their support, assistance, and
recognition to my research.
Most importantly, I am eternally grateful to the greatest family of mine for their uncon-
ditional love. They are the ones who keep me going forward.
Although I have had a tough and unpredictable time in the past few years, I want to
thank all the people I ever talked and listened to. You made my days brighter and better.
iv
Vita
Publications
Cui, Renhao, Gagan Agrawal, and Rajiv Ramnath. “Tweets can tell: activity recognition
using hybrid gated recurrent neural networks.” Social Network Analysis and Mining, 10.1
(2020): 1-15
Cui, Renhao, Gagan Agrawal, and Rajiv Ramnath. “Tweets can tell: Activity recogni-
tion using hybrid long short-term memory model.” Proceedings of the 2019 IEEE/ACM
international conference on advances in social networks analysis and mining, 2019.
Cui, Renhao, Gagan Agrawal, and Rajiv Ramnath. “Towards Successful Social Media
Advertising: Predicting the Influence of Commercial Tweets.” arXiv preprint, arXiv:1910.
12446 (2019)
Cui, Renhao, et al. “Ensemble of heterogeneous classifiers for improving automated tweet
classification.” 2016 IEEE 16th International Conference on Data Mining Workshops
(ICDMW). IEEE, 2016
v
Das, Manirupa, Renhao Cui, David R. Campbell, Gagan Agrawal, and Rajiv Ramnath.
“Towards methods for systematic research on big data.” 2015 IEEE International Confer-
ence on Big Data (Big Data). IEEE, 2015
Fields of Study
vi
Table of Contents
Page
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Domain Conversion of Probabilistic Output . . . . . . . . . . . . . . . . 11
2.3.1 Mapping labels . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Mapping probabilities . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Ensemble of Probabilistic Models . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Dynamic Weighting . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 Stacking+ Ensemble . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
vii
2.5.3 Experiment design . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.4 Experiment results and analysis . . . . . . . . . . . . . . . . . . 23
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Influence Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Data labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 Classification model . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.2 Experiment design . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.3 Group label analysis . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.4 Experiment results and analysis . . . . . . . . . . . . . . . . . . 49
3.5 Demonstrating Use of the Framework: A Case Study . . . . . . . . . . . 53
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Working with Contextual Features using LSTM . . . . . . . . . . . . . . 61
4.3.1 Activity labeling . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.2 Contextual learning with LSTM . . . . . . . . . . . . . . . . . . 62
4.4 Our Proposed Hybrid-LSTM Model . . . . . . . . . . . . . . . . . . . . 67
4.4.1 Including historical tweets . . . . . . . . . . . . . . . . . . . . . 67
4.4.2 Including direct contextual features . . . . . . . . . . . . . . . . 69
4.4.3 Hybrid-LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.4 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.2 Experiment design . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5.3 Experiment results and analysis . . . . . . . . . . . . . . . . . . 76
4.6 Demonstrating Use of the Approach: A Case Study . . . . . . . . . . . . 78
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
viii
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 Constraint-Embedded Language Modeling (CELM) for
Paraphrase Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.1 Constraint identification . . . . . . . . . . . . . . . . . . . . . . 87
5.3.2 Constraint embedding . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.3 Causal language modeling . . . . . . . . . . . . . . . . . . . . . 89
5.3.4 Decoding and generation . . . . . . . . . . . . . . . . . . . . . 90
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.2 Experiment design . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4.3 Experiment results and analysis . . . . . . . . . . . . . . . . . . 95
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Appendices 105
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
ix
List of Tables
Table Page
3.5 Prediction samples from different models where the true labels are positive 51
3.6 Commercial tweets about a raffle event for winning console controllers . . . 53
x
5.3 Model performance on MULTI set . . . . . . . . . . . . . . . . . . . . . . 96
xi
List of Figures
Figure Page
2.3 Performance comparison between three individual models and four ensem-
ble models (five datasets) . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Comparison between two models: using the probability input and adding
tweet vector to the input (Stacking Classifier) . . . . . . . . . . . . . . . . 27
5.1 Commercial tweets that are posted for the same product . . . . . . . . . . . 83
xii
5.2 Overview of the constrained generation process . . . . . . . . . . . . . . . 86
xiii
Chapter 1: Introduction
Social media has grown dramatically in the past decade, in terms of both its user base
and influence [68]. The various types of resources, regions, platforms, languages, and
domains enrich the ability of social media to generate abundant information about innu-
merable topics. The user base spans an array of people, from everyday users to official
organizations, from famous artists to the heads of countries [35]. The ease of use and wide
range of users has standardized the world such that it is possible for nearly everyone to
obtain useful information. The speed with which information is delivered further enlarges
the impact on all platforms. However, one thing does not change too much – the major
Advertising has a long history and has proved to be necessary and inevitable for all
1
types of businesses. A recent survey shows that 73% of marketers think social media
marketing is effective and 89% of marketers believe social media is important to their
marketing strategy. With the growing population and expanding information networks,
advertising must be more effective and efficient. Facing these challenges, social media
1
https://buffer.com/state-of-social-2019
1
platforms have become the best place to post commercial information and advertisements
[152].
Various forms of advertisements are used through social media platforms, including
text, audio, images, and videos. Rich messages such as audio and video can provide more
comprehensive, useful and interesting information to social media users [110, 173]. Rich
messages also have a higher information bandwidth, that is, they express more ideas more
fully than simple text messages. Nevertheless, rich messages have disadvantages. They
consume more resources and all devices may not have the ability to compose and send rich
messages. Readers look mostly at the text content for detailed and critical information. Fi-
nally, it is easier to generate an effective advertisement using text than using audio or video.
For all these reasons, text messages are still the most widely used means of delivering ad-
vertisements. Given the above, this dissertation focuses on text as the only advertising
medium. We also focus on Twitter data in all the projects described. We believe that the
way text data are used in most social media platforms is similar, therefore our research and
To form a successful social media advertising system, several features are required for
the system:
The core of an advertisement is the information about the product or promotion that
needs to be disseminated. Given that the primary requirement of any advertisement system
2
is to help improve the effectiveness of social media advertising of the product being pro-
moted, this dissertation examines the several parts of the social media advertising process,
in order to fill the gaps in improved utilization and impact of advertising through social
media.
single tweets, tweet streams, or authors. Sentiment analysis [27, 140, 111, 45] is the most
popular task. Such analysis is relatively accurate and efficient because of the large user base
and their intention of expressing personal opinions. As a standard classification task, the
work on sentiment analysis on tweets help setup the basic knowledge for many following
tasks. Exploring the political views of people [26, 25, 9, 151] is another practical use of
tweet analysis. In addition, tweet analysis has become an important source of feedback for
companies to monitor the opinions of their users. In addition to Twitter’s own analytics
system,2 a wide range of tools such as Hootsuite,3 Klear,4 and SocialBakers5 can help
Analysis done on the large population of Twitter users, also known as user profiling,
is another area that can provide insights and information. User interest [175, 10, 64], as
the most popular attribute, has been explored widely based on the content of tweets as
well as the relationship with other users. Some work [1, 93] has focused on modeling
Twitter users for different recommendation tasks. Other work [123, 61, 61] has been built
2
analytics.twitter.com
3
hootsuite.com
4
klear.com
5
socialbakers.com
3
to profile Twitter users for a variety of purposes. Analysis of users requires analysis on a
large volume of tweets as the base, but it also includes other information such as the meta
data of the users and the relationship across users. In our work, however, we focus on
user-independent features, so that the models can be applied without knowing any of the
Information extraction is another useful application utilizing tweets. The real-time as-
pect and large user base of Twitter allow for the extraction of information accurately and
efficiently. Event detection and extraction [138, 126, 84, 166] aim to extract key infor-
mation regarding events such as the time, place, and description. In addition to regular
social events, the detections of earthquakes [36, 141], crimes [160], and festivals [80] are
also practical in real life. Events and news are served as common pivots to locate poten-
tial paraphrases in Twitter. However, it still requires manual work to create a high-quality
paraphrase dataset for tweets. Therefore, we use a large sentential paraphrase dataset con-
structed from a general domain and transfer the knowledge to a tweet paraphrase task.
Unlike analysis and information extraction through Twitter, the work on generating
tweets falls behind because of the lack of appropriate model, data, and use case. News
tweets have been the first to utilize automatic tweet-generation models. This requires the
support of traditional news articles, and generates the tweet using a summarization of the
a link in the tweet content [144]. In summary, the generation of tweets usually requires
a strong support resource or a clear purpose of the generation. On the other hand, the
paraphrase generation of tweets relies on existing tweets. Given some properties of tweets
4
1.3 Summary of Contributions
In order to improve the utilization of advertising through social media, our work focuses
on multiple aspects to fulfill certain tasks automatically or assist human agents for better
decision-making. To this end, the following efforts are made to fulfill the needs.
quirement. The designated labels are formed across different domains, and typical classi-
fication models work well on labels that belong to the same domain. We use an ensemble
the performance of the mixed-classification task. In addition, we propose a way to map the
features and generate the prediction using certain classifiers. These features do not contain
the inherent meaning of the tweet, therefore the model is generalized for most commercial
tweets. To complete the study, we collect a dataset of commercial tweets that contains orig-
inal tweets posted by multiple official accounts of popular companies in different fields. We
conduct an ablation analysis on the features and reveal the importance of these features to
have a successful commercial post. Relying on the use of linguistic features, we demon-
strate that a commercial post can be modified accordingly to draw more attention from its
audiences.
We introduce a system to profile users based on their offline activities. Based on the
reported location, we create a tweet dataset labeled with the related user offline activities for
the experiment. We research several existing methods that include contextual information
in LSTM-based models, and propose a hybrid-LSTM model that can take different types of
5
this model, we justify the relation between the characteristics of a company and the major
phrase generation framework for commercial tweets. The constraints cover key elements
in the content which should be kept in the generated paraphrase. We utilize a sentential
paraphrase dataset constructed on a general domain and apply the trained model to a col-
lection of commercial tweets. The hard constraints are identified and embedded directly
into the content data. Language models are used to learn from the constraint-embedded
data, and the framework can provide solid improvement using the constraint information
Our proposed models and datasets demonstrate a meaningful attempt for systems that
can improve the effectiveness of social media marketing and advertising. The models and
datasets cover different steps of successful social media advertising, therefore our contri-
6
Chapter 2: Mixed Domain Tweet Classification
2.1 Introduction
The initial step we consider in this chapter is better understanding user feedback on
social media. Twitter has become an invaluable resource for businesses because it can
provide high-volume opinion streams in real-time from real users. Extremely useful infor-
mation can be derived from tweets, such as complaints or compliments about a service, or
interest in purchasing a product. For example, a tweet, “I just had one of your grass-fed
ribeyes. I have only one thing to say: blah,” expresses disappointment about the steak,
whereas, “When does the Wild Madagascar Vanilla come out? Can’t wait,” shows inter-
effective replacement for marketing surveys, with the added benefit that this feedback is
However, Twitter streams are uncategorized and noisy and, practically speaking, useless
in its raw form. In order to be useful, tweet data should be cleansed and categorized.
Businesses seek to have the tweets classified with a set of specific, predetermined topics.
Manual processing is not possible because of the complexities of the classification task and
7
For the task of assigning labels to a text, many automated classification and topic mod-
eling systems have been proposed and applied. Latent Dirchlet Allocation (LDA) [13]
along with its variations Labeled-LDA (LLDA) [133] and Linked-LDA [12], are popular
topic-modeling approaches that generate a probability distribution over topics for each doc-
ument. Classical classifiers such as Naive Bayes [97] and Maximum Entropy [114] have
also been applied, and these classifiers generate a probabilistic output over a list of labels.
However, these probabilistic outputs are not understandable for use in real business cases;
what is needed is a discriminating output that simply presents the most appropriate label –
not necessarily the label with the highest probability. Moreover, another challenge is that
the classification categories or purposes are not defined over the same domain (e.g., when
the need is to classify messages that are about new promotions as well as customer feed-
classifier does not work well for the mixed-classification problem because of the limitation
that most classifiers are designed and trained to work in a specific setting or for a specific
purpose. For example, features and models selected to classify product feedback often
do not work well for the classification of new promotions. Instead, building two separate
This chapter addresses the problems mentioned above. First, we build a mapping
method that converts the probabilistic output of a third-party application programming in-
terface (API) with its own predefined (but hidden) universal label corpus into the domain
of the dataset being investigated. Along with mapping the labels, the method also gener-
ates a new probability distribution that integrates the probability distribution of the original
labels with the confidence of the mapping function. Second, to improve the performance
8
methods that combine the probabilistic output from multiple heterogeneous classifiers. We
do this by considering the probabilities associated with the outputs as the representation
of the document or the reliability of the classification from each individual model. The
stacking ensemble model used can further benefit from the addition of tweet vectors in its
learning step.
specific approaches we combine include the LLDA model, Naive Bayes classifier, and the
third-party text classification API. The proposed Stacking+ Ensemble methods improves
the accuracy to against the best individual model in the ensemble as well as the baseline
ensemble models. As an average across all datasets, we noted a 29.1% reduction in the
number of inaccurate predictions compared to using the best baseline ensemble method for
We first compare our work with other efforts on classifying social media postings and
ensemble methods. Many efforts have attempted to apply traditional models to social media
data. However, difficulties arise in applying these models directly to this special domain
and results largely have not been satisfactory [188]. Therefore, many research efforts have
Latent Dirchlet Allocation (LDA) is a well-established model built for structured data.
To solve the problem of the limited size of any tweet post, one popular solution is to ag-
gregate tweets together into a macro-document [104], whereas the Author-Topic model
merges author meta data into the model [139]. Moreover, a Temporal-LDA (EM-LDA)
model brings temporal influence to tracking the transition of topics in social media [163].
9
Multiple variations also exist that incorporate supervision into the LDA model, such as
LLDA and supervised LDA (sLDA) [100]. LLDA allows multiple labels for each docu-
ment by a one-to-one mapping between labels and the latent topics, whereas sLDA only
gives a single label to each document based on the mixture of the latent topics. LLDA has
been used as a supervised classifier, for example, Ramage et al. [132] used it to characterize
microblogs.
Bayesian classifiers have been explored for many years, and their simplicity and effi-
ciency make them popular in many classification tasks. A multivariate Bernoulli model
using binary word features [76, 69], and a multinomial model with unigram word count
features [82, 108, 101] are early approaches to classify text documents. For the case of
unstructured data, many works utilize the Naive Bayes model for sentiment classification
Ensemble methods have been long studied as a mechanism to improve the performance
methods, covering manipulation of the output hypotheses, the training data, the input fea-
tures, the output targets, and adding randomness to the problem. Randomness is the basis
for certain ensemble models such as RandomForest [89], whereas Adaboost [40] constructs
ensembles by manipulating the training data. However, these methods do not benefit from
combining the capabilities of different individual models, which can be critical when fac-
ing the problem of mixed classification. Thus, combining heterogeneous classifiers can be
beneficial, and has been explored in the past, for example, by Tsoumakas et al. [150] and
Klein et al. [67]. In addition, Product of Experts (PoE) is developed to combine multiple
using individual ones [51, 52]. Raftery et al. [131] combine different probabilistic models
10
by extending the Bayesian Model Averaging (BMA) method and learning the weight from
the data, with the assumption that the outputs of involved models follow a predefined prob-
ability distribution. This approach has been applied to tasks like the prediction of political
events [109], stock prices [11], weather [44], and exchange rates [171]. Moreover, Kolter et
al. use an ensemble method, Dynamic Weighted Majority, to track concept drift [71, 70].
The ensemble of classifiers involves using a third-party API and mapping its output to
a certain domain. Specifically, we have chosen the taxonomy function from the Alchemy
API6 suggested by Quercia et al. [129] to categorize documents into different topic classes
with associated probabilities. The Alchemy taxonomy function is built on a deep neural
network model, with its predefined universal topic corpus. However, the predetermined
nature of its domain introduces a challenge in using it for our goals, which is to classify
tweets over a set of domain-specific topics. To address this mismatch, we have developed
a mapping methodology that maps the probabilistic output of Alchemy from the universal
Our goal is to find the pattern of relationships between domains based on the idea of co-
occurrence of different labels given the same document. The overall process of the domain
conversion is shown in Figure 2.1. Assume that the API gives a label corpus of size n
D, assume that the API outputs a list of k labels xd,1 , xd,2 , . . . , xd,k (in domain X) with the
6
http://www.alchemyapi.io/
11
Figure 2.1: Label mapping across different domains
associated normalized probabilities pxd,1 , pxd,2 , . . . , pxd,k . Let the actual h labels (assigned by
a domain expert for training purposes) for the document be yd,1 , yd,2 , . . . , yd,h (in domain
Y ).
Using the above information, we build a mapping relation between all possible pairs
of labels from the two domains. A mapping confidence score is given to each mapping
C(xi , yj )
conf idence(xi , yj ) = (2.1)
C(xi )
Here, C(xi ) is defined as the partial document count for label xi , and C(xi , yj ) is the
partial document count where xi and yj are assigned as labels for the same document in
their domains.
X
C(xi ) = C(xd,i ) (2.2)
d∈D
X
C(xi , yj ) = min(C(xd,i ), C(yd,j )) (2.3)
d∈D
12
where the partial count C is calculated as:
To clarify, in the above calculations, we use the normalized probability associated with
each assigned label for each document as the partial count. However, for domain Y , be-
cause the labels do not have a probability distribution associated with them, we assume
they are equally important, so we give a uniform count (1/k) to all the assigned labels.
After building all possible relations seen in the training data from label xi to label yj ,
where the mapping for xi is chosen with the highest mapping confidence score between xi
and the target label yj . For an unknown label xi in domain X during inference, we use the
candidate label yc that has the largest total partial count as the mapped label in domain Y .
Thus, this label mapping process generates a many-to-one mapping function from domain
X to domain Y .
Let score(xi , yj ) denote the highest confidence score associated with the mapping from
13
The mapped probability pyd,i for the mapped label yi from document d is calculated as:
1
pyd,i = × (pxd,i × score(xi , yi )) (2.7)
Z
Xk
Z = (pxd,i × score(xi , yi )) (2.8)
i=1
The new mapped probability combines the original distribution in domain X with the
score of the applied mapping function as the confidence of the mapping. In this case, the
mapping function transfers the probabilistic output from one domain to another domain
Considering the rare cases where an unknown label xi is encountered, we assign a very
small new mapped probability to the candidate label yc in domain Y before normalization.
However, if no mapping function can be found for all the labels for a document, the small
probability is assigned to the candidate without normalization, which indicates the unrelia-
bility of this mapped label. For the cases in which the mapping function converts different
labels from domain X to the same label in domain Y , the new probability score for that
label yi would be the summation of the score pyd,i over the multiple mapped labels.
We first describe an initial study where we apply LLDA and Naive Bayes classifier
to the dataset. Table 2.1 shows an example of the results obtained – the dataset contains
14
five labels and the numbers shown are the probability distribution outputs for each model
given the same tweet. If we use these models individually, the natural deterministic outputs
will be Hardware for LLDA and Craft for NaiveBayes. However, the actual label is
XACTO, and it can be derived by the highest sum of probabilities from the two models. In
general, it is intuitive that different models could be suitable for different task domains.
Therefore, we can improve the classification process by combining multiple models, and it
We propose a set of practical approaches to create ensemble models. Our starting point
is the ideal Bayesian voting model, which has been shown to be theoretically optimal [31].
X
y = argmax P (cj |mi )P (mi |T ) (2.9)
cj ∈C
mi ∈M
X
= argmax P (cj |mi )P (T |mi )P (mi ) (2.10)
cj ∈C
mi ∈M
where C is the collection of all classes, M is the collection of all involved models, T is the
The problem with the above voting model is that it is not feasible in practice because
the weights P (T |mi )P (mi ) associated with each model output related to a certain class
are impossible to determine. More precisely, P (T |mi ), as the local performance, shows
how an individual model fits the training data, whereas the prior P (mi ), as the global
performance, represents the general fit of the individual model. P (T |mi ) can be estimated
by the performance of the model on a certain dataset, however, P (mi ) cannot be calculated
or estimated.
To address this problem, we developed a different weighting method that learns the
weights from the provided data. In addition, we proposed that another method, Stacking+
15
Ensemble, builds on the well-known ensemble method [170], but improves it further by
Instead of using model-specific weights like BMA [65, 54], we borrow the idea of
dynamic weights from Kolter et al. [71, 70], which learns the weight directly from the data.
However, Kolter’s method has certain limitations for this task, specifically it jointly trains
the weights and the involved models, and also drops the model that does not contribute
to the correct final output, while adding a new model to the ensemble to optimize the
fixed in our case, and it also rely on a fixed set of individual models to create the ensemble
system.
Therefore, our approach is as follows. We assume that there is a real value weight wi
exists for each model mi , which works as a multiplier to the probabilities P (cj |mi ) for
cj ∈ C, and the final output class is the one with the highest weighted sum. The overall
process for Dynamic Weighting is listed as Algorithm 1. Unlike Kolter’s method, which
triggers the update based on the local prediction, the weight update happens only when
the ensemble model makes an incorrect prediction on a training case. The update is then
applied to the weight where its corresponding model gives an incorrect local prediction.
We treat the update as a multiplication with a single constant learning rate – note that a
subtraction will not ensure that the weights are always positive. Because the update only
decreases some weights, we renormalize all the weights to prompt the other weights and
ensure consistency during training. Finally, the output weights are the average weights seen
16
Algorithm 1 Dynamic weighting algorithm
mi , wi : individual model i and its weight
yd : true class for case d
cj : the class j
u: update count
si : summation for weight mi
Λd : ensemble prediction for case d
λdi : local prediction of model i for case d
β: learning rate, 0 ≤ β < 1
n, e: number of models, epochs
initialize wi ← n1 , i = 1, . . . , n
for count = 1, . . . , e do
for all case d in thePtraining set do
Λ ← argmax ni=1 P (cj |mi )wi
d
cj ∈C
d
if Λ 6= yd then
for i = 1, . . . , n do
λdi ← argmax P (cj |mi )
cj ∈C
if λdi
6 yd then
=
wi ← wi β
end if
end for
W ← re − normalize(W )
for i = 1, . . . , n do
si ← si + wi
u←u+1
end for
end if
end for
end for
return weight wi = sui , i = 1, . . . , n
17
Because of the limited size of the training data, we use a single weight instead of a
vector weight as a general case, which demotes or promotes all the classes at the same
time. Our test also shows that adding a prior P (T |mi ) to the corresponding model does not
treats the output distributions as the likelihood of the labels provided by each individ-
ual model. However, results from our initial evaluation show that a simple arithmetical
combination of the probability distributions cannot generate the correct output. The arith-
metical combination is sensitive to the type of probability distribution from the individual
models, and one irregular distribution could have a huge impact on the overall system. In
other words, balanced distributions and sparse distributions require very different ensemble
methods.
To make the probabilistic-based ensemble model more general and adaptive to different
kinds of distributions, we introduce the stacking classifier to the ensemble process. This
classifier is used to combine the output from the individual models and generate the final
decision. Unlike the previous ensemble methods that treat the probability from the indi-
vidual models as an indicator for reliability, the Stacking Ensemble method takes it as a
representation of the document provided by the individual model and maps it to the output
label.
Figure 2.2 shows the overall design for the proposed Stacking+ Ensemble model, which
builds on the idea of stacked generalization [170]. There are two layers of classifiers in
the model. In the first layer, the classifiers are the involved individual ones that generate
18
Figure 2.2: Process of Stacking+ ensemble model
different kinds of probability distributions. In the second layer, the classifier takes the
concatenation of the probability distributions from the first layer as the input and generate
the final output. We first trained the individual classifiers independently. Then the stacking
classifier is trained using the output distributions from the first layer and the correct label
for each case. This ensures the generality of the ensemble method and that all the classifiers
In order to further improve the differentiation and mapping ability of the traditional
stacking model, we add an n-gram vector of the document to the input of the stacking clas-
sifier. This forms the Stacking+ Ensemble model, in which the stacking classifier takes the
appended feature of the probability distribution and the document vector. On top of the tra-
ditional stacking model, Stacking+ Ensemble enriches the representation of the document
19
The stacking classifier can be chosen from a wide range of machine learning models,
and the input probabilities can be handled in different ways inside the models. This so-
phisticated process ensures the ability of this model to deal with more complicated input
situations. Thus, the Stacking+ Ensemble method is less sensitive to the individual models
than most existing probability-based ensemble models. In addition to thinking of the two-
layer system as an ensemble model, the classifiers along with their output in the first layer
can also be viewed as a special process of feature extraction for the document, which then
2.5 Experiments
The experiment is implemented with a real-world tweet dataset that contains tweets
collected from normal consumers mentioning certain brands over a seven-month period.
The brands, the number of labeled tweets, and the size of their topic corpus are shown
in Table 2.2. Because the data are obtained from a social media platform, the content of
the tweet is normalized by removing the mentioned usernames, tokenizing the links, and
removing redundant punctuation. We keep the stop words because we find that removing
20
Each of the vendors has provided a predetermined topic corpus used for classification
on its tweet stream. The topic corpus represents the interests of the brand and classifying a
tweet into topics helps in processing the feedback received from the customers. The ven-
dors also provided a set of keywords and logic rules to perform the labeling/classification
process. However, the limitation of such strict labeling is that only a small fraction of
tweets can be assigned a label – in fact, this is what motivates the need for the automated
tweet classification. At the same time, the tweets that are labeled using the client- or brand-
specific rules can now be used for training and evaluating the quality of the classifiers. One
complication with the labeled tweets is that a single tweet can be assigned more than one
2.5.2 Baselines
Several existing and popular ensemble methods take probability distributions from in-
dividual models as the input and generate the result. Some of the well-known examples are
BMA and Mixture of Gaussians. These methods are similar in that they combine the prob-
ability distributions using summation but differ in the specifics of how they archive this.
Another approach multiplies the distributions and renormalizes, and it is called Products of
Experts (PoE) [52]. Each individual model is considered as an expert, and the distributions
where d is a data vector, c covers all possible vectors in the data space, θm is all the
parameters of individual model m, and pm (d|θm ) stands for the probability of d given by
model m.
21
Because the probabilities are combined by multiplication, PoE can generate much
“sharper” distributions than the individual expert models. Therefore, the correct output is
easier to extract from such a combination of individual distributions. Fitting a PoE model
requires optimizing the likelihood of the data, which involves tuning the settings of the
individual models. To compare the performance of different ensemble methods, we fix the
setting of all involved individual models, i.e., all the ensemble methods share the same set
In addition to the PoE model, we also implemented a simple ensemble model that gen-
erates the final output based on the weighted summation of the input distributions (Weighted
Sum). In this method, the weights are (statically) determined by normalizing the individual
For experimenting with different ensemble approaches, we use three individual models:
Labeled LDA (LLDA), Naive Bayes classifier (Naive Bayes/NB), and Alchemy API with
the mapping function for output conversion (Alchemy/Alc) that we have introduced. The
individual models and the ensemble methods are evaluated with a five-fold cross validation.
LLDA is typically considered a topic modeling method, but because it generates a prob-
ability distribution across topics for each document, it can also be used as an individual
model for this task. We use the Stanford Topic Modeling Toolkit7 for both the training
and inference steps of the LLDA model, with Gibbs Sampling set to run for 1000 itera-
tions. The Naive Bayes classifier is shown to work well on small datasets [63] or short
documents [157], and it fits the situation of our experiments. Alchemy API returns three
7
http://nlp.stanford.edu/software/tmt/tmt-0.4/
22
topics for each document; thus the output conversion generates exactly three labels with
Given the casual writing style of tweets, we trim the word vector for the Naive Bayes
classifier to eliminate the most and least frequent words in the corpus. With this extra step,
we can reduce the influence from nondictionary words, typos, or words that are too com-
mon or too rare to carry much differentiation information. The tweets are then featurized
using a combination of unigram and bigram models for the NaiveBayes classifier. We set
the learning rate β for the Dynamic Weighting model to be 0.9 and stop the learning pro-
cess after 30 epochs. These existing classifiers are implemented using scikit-learn packages
[17], and all the codes and data for the experiment are accessible publicly 8 .
In order to make an ensemble method effective, the involved individual models need
to be diverse [31]. In other words, there should be certain cases where different individual
models do not give the same output (and among such cases, no single model is always
correct). We validate this requirement for the experiment by building an ideal ensemble
system that marks a case correct whenever at least one involved model is correct. The
accuracy of the ideal system is considerably better than that of any individual model, which
Because the choice of the stacking classifier is flexible, we use the Maximum Entropy
is simply the fraction of the test data records where the model predicts the correct class.
In our evaluation, all individual and ensemble models output one deterministic label for
8
https://goo.gl/RfjjBH
23
each tweet. Because all the true labels assigned to the same tweet are considered equally
possible, we simplify the evaluation process by considering our system to be correct if the
predicted label is among the actual true labels. We found that 71.2% of the tweets in the
dataset have only one label, so the simplification does not have a significant impact on the
results we report here. We first build each of the three individual models and report the
accuracy of the predictions. Next, we apply the two baseline ensemble models – Weighted
Sum and PoE, along with Stacking Ensemble, the proposed Dynamic Weighting and the
Stacking+ Ensemble model, for all possible combinations of the three individual models.
Figure 2.3 shows the performance for the three individual models and four ensemble
methods across five datasets. For the ensemble methods, we report results for the three
possible pairwise combinations of the individual methods and the combination of all three
methods (all).
Overall, the Stacking Ensemble method generates the best performance, which im-
proves over the individual models and outperforms the other ensemble methods. Among
all three individual models, LLDA and NaiveBayes have comparable performance, while
Alchemy with the mapping function has lower performance (we hypothesize that this is
probability to those labels that have probabilities of 0 and renormalize the distribution.
If Alchemy is involved in the pairwise combination, PoE performs worse than the better
individual model. Because Alchemy generates only three labels for each document, the
24
Figure 2.3: Performance comparison between three individual models and four ensemble
models (five datasets)
25
probabilities for the other labels are smoothed to a very small value. Thus, the multipli-
cation of the distributions indicates that the probability of most labels is still a very small
number, except for the three labels generated by Alchemy. As the consequence, if the other
involved models do not have relatively high probabilities for the three labels generated
by Alchemy, the final distribution of probabilities is not helpful in labeling. On the other
hand, LLDA and NaiveBayes provide better distributions, and thus PoE can improve the
For most cases, the baseline of Weighted Sum and the proposed Dynamic Weighting
method both improve the performance of individual models by combining their outputs.
However, the combination of Naive Bayes and Alchemy using the Weighted Sum model
is the only case where the performance is worse than NaiveBayes alone. The special dis-
tribution generated by Alchemy is the main reason. In general, the Dynamic Weighting
method outperforms the Weighted Sum method, especially when an unreliable model such
as Alchemy is involved. The weights of the Weighted Sum method are limited to the per-
formance of the individual models – the difference of the weights would not be too large in
most cases. On the other hand, as a data-adapted model, Dynamic Weighting can learn a
more precise and adaptive weight for each model, leading to a wider range for the weights.
Unlike the previous three ensemble methods, Stacking Ensemble does not rely on a
direct combination of the probability distributions. This ensemble method utilizes another
classifier and the classifier allows the capability to bypass or reflect the true effect of in-
corporating a special distribution. It has the best overall performance among all compared
ensemble methods, and it also has a reasonable improvement over any involved individual
model. In addition, the improvement is stable across all combinations, which shows that
26
Figure 2.4: Comparison between two models: using the probability input and adding tweet
vector to the input (Stacking Classifier)
this method is less sensitive to the individual models and their distributions. Especially for
the cases where Alchemy is involved, it was still able to generate reasonable improvement.
For all models or their combinations, lower accuracy is seen for two brands: BathAnd-
BodyWorks and Triclosan, and we think large number of topics/classes and small number
of cases for each topic class are the main reasons. It is natural that the prediction is harder
when there are more classes to choose from and, even more so, when sufficient training
In addition to the probability input feature, we further append the tweet vector to the
input of the stacking classifier and report a performance comparison between the Stacking
and Stacking+ Ensemble model in Figure 2.4. The additional tweet vector is the same as
the one featurized for the training of the Naive Bayes classifier.
27
The result shows that the addition of the tweet vector can help the Stacking+ Ensem-
ble model further improve performance. The improvement by adding the tweet vector is
solid across all ensemble combinations and datasets, and it reaches as much as 20.9% over
the Stacking Ensemble. It is clear that this improvement is obtained with the cost of an
Overall, compared against the baseline Weighted Sum ensemble method applied to
three individual models, the Stacking+ Ensemble model had a reduction of 33.9%, 49.4%,
22.1%, 37.5%, and 2.7%, respectively, in the number of inaccurate predictions, for Elmer’s,
Chili’s, Bath and Body Works, Domino’s, and Triclosan, respectively. The Stacking+ En-
semble model results in significant improvements for four of the five datasets in the experi-
ment. Taken as an average across Elmer’s, Chili’s, and Domino’s (three datasets where we
had sufficient training data for each label), we have an average accuracy of 0.8339 when
combining all three models with weighted sum, whereas the Stacking Ensemble model im-
proved the accuracy to 0.8702, and Stacking+ Ensemble increased the accuracy to 0.8995.
Thus, on the average, there was a 39.5% reduction in the number of inaccurate predictions.
2.6 Summary
This work introduces new ensemble classifiers – a Dynamic Weighting system, a Stack-
ing+ Ensemble model with additional tweet vectors. These ensemble systems can combine
strate the effectiveness of this approach on a real-world tweet classification task, which is a
mixed-classification problem that benefits from combining classifiers that are designed for
different domains or purposes. The unique characteristic of the stacking ensemble method
eases the requirement that the distributions generated by the involved models need to be
28
compatible and thus it is more adaptive and flexible. Through detailed evaluations with
29
Chapter 3: Commercial Tweet Influence Prediction
3.1 Introduction
Our next work focuses on the success of commercial tweets, specifically, the influence
of commercial tweets. The rapid growth of social media is driving the increased use of
social platforms for advertising. Many companies have official accounts on social media
platforms to maintain customer relationships, spread news, and attract more attention. In
fact, companies have used their official accounts on Twitter to post commercial tweets
that are primarily visible to their followers. For example, “Is your New Year’s resolution
Analysis of tweets has attracted significant attention from the data-mining community
in recent years. The massive volume, real-time nature, large geographical coverage, and
public availability of Twitter data have led to this heightened interest. Mining Twitter data
has been demonstrated to be useful for tasks such as earthquake detection [141], stock
market prediction [15], public health applications [121], and open-domain event extrac-
tion [138].
In using social media to advertise their products and promotions, companies are seek-
ing more engagement from their readers as part of maintaining an effective online strategy.
30
This need has led to a new class of services being offered to help the companies build
(CRM), compared with traditional CRM, aims to provide a closer and more direct commu-
nication between the company and its customers through different social platforms. More
intriguing and open question for various corporations. Many approaches exist to measure
the influence of an individual account on a social platform, such as the widely used Klout
Score [134]. For a single company, the focus is more often on the effectiveness of its mes-
sages propagation. Thus, there is now a need to understand how to raise the influence of a
particular post.
This chapter shows that the influence of a particular commercial post can be measured,
predicted, and made more effective. The techniques presented here may be used within a
system to help companies craft commercial messages for social platforms, with the goal of
maximizing the influence of the posts on specific audiences. In order to improve the writing
of a commercial post, predicting the potential influence and effectiveness of a given text is
The primary contribution is to answer the following question: “Can we learn what
makes an effective advertising post on Twitter?” In doing so, we address the following
challenges:
• How can we quantify whether a given commercial post on Twitter has been success-
ful?
31
• What features best model the composition of a tweet with respect to its influence?
• What is the specific effect of each of the specific features in improving the influence
of the tweet?
To measure influence, we use the direct reactions that a tweet gets from its readers – such
as retweeting and marking as favorite. We focus on commercial posts from the official
accounts of various companies (brands) in different fields. Although pictures and/or videos
can be included to enrich a post, we believe that the text content provides the most important
and straightforward information. Note that the product or promotion information is usually
determined before crafting the post, so we focus on the other influencing elements of the
post. First, we label the commercial tweet based on the influence it generated. Then, we
extract a small set of features to capture the structure and comprehensive representation of a
commercial post. More specifically, the feature set is designed to show the construction of a
post and it does not include the core information that is related to the promotion or product.
Next, we address the problem of predicting whether a commercial post would be successful
through a binary classification model given the feature set. In addition, we conduct a feature
analysis to determine which features have the most impact on the influence prediction. We
provide a case study that shows the potential usage of the prediction system.
To the best of our knowledge, this is the first model that seeks to analyze and predict
the performance of commercial social media posts. We believe that this work will serve as
32
3.2 Related Work
Advertising through social media is growing rapidly and is drawing more attention.
Yin et al. [183] used the concept of a propagation tree to reveal patterns of advertisement
terms of extent of its propagation. In contrast, our focus is on tweet content and measuring
the advertisement by its influence on the readers. Li et al. [85] proposed a diffusion mecha-
nism for advertisement delivery through microblog platforms, based on a set of user-related
features. Our goal is to model the influence of a commercial tweet based on static textual
In the last decade, researchers have examined the influence of specific users and their
posts through social media and attempted to understand how to quantify such influence.
Anger et al. [3] looked into the indicators of influence on Twitter for an individual user.
Bakshy et al. [7] conducted a study quantifying the influence of Twitter users by tracking
the diffusion of their posts through reposts. Cha et al. [19] also focused on user influence
and proposed a link-based model to measure influence. Ye and Wu [181] proposed a model
to measure message propagation and its social influence through Twitter, as well as the
longitudinal influence over time and across users. Unlike these network-based models or
diffusion models that track the spread of tweets, we construct a simple matrix that checks
only direct reactions to tweets by their readers (such as favorites and retweets). Moreover,
in order to improve a given post, we focus on the specific tweet, instead of the identity of
the author.
Popularity prediction also has attracted much interest and among many models, pre-
dicting retweets has been the most common one. Most efforts (such as [125] and [112])
utilized simple surface features of tweets to predict retweeting. Peng et al. [122] also
33
included relationship features, and Yang et al. [179] added trace and temporal features
to build a factor graph model. Zaman et al. [184] and Gao et al. [41] predicted future
popularity by observing the dynamics of retweeting, while Xu et al. [176] and Lee et al.
[79] focused on retweet activities on certain users. In addition to retweet prediction, Artzi
et al. [4] predicted whether a tweet will receive replies from its readers, and Suh et al.
[148] showed the relation between certain features and the retweet rate. In contrast to these
efforts, which focused on a single reaction (such as retweeting) as the measurement, our
tweets. Furthermore, our model captures only the structural and style elements of the post,
[107] initiated the work on representing words with lower dimension vectors, which are
trained to predict context words given the current word. Le et al. [77] leveraged [107] to
represent a paragraph using a dense vector, which is trained to predict words in the para-
graph given the paragraph itself. The idea of dense representation also has been brought
to tweet-related tasks. Tang et al. [149] built a word embedding for Twitter sentiment
classification. Given the informal use of words in tweets, two character-based tweet2vec
models have been introduced: Dhingra et al. [29] constructed a tweet vector representation
to predict hashtags, and Vosoughi et al. [156] use a CNN-LSTM encoder-decoder model
This section describes the process of labeling the influence of commercial tweets, ex-
34
3.3.1 Data labeling
quantify the influence of a tweet. The influence of a commercial tweet can be represented by
the level of engagement from the readers. Retweets and marking tweets as favorites are the
most widely used functions that allow a reader to express his or her interest or excitement
about the tweet. Thus, counts of retweets and favorites can be used as direct measurements
of reader engagement and they have been used in some models as the indicator for tweet
Influence Score
Our work combines the count of retweets and the count of favorites in order to measure
tweet influence. Both reactions reflect the interest of the user after reading the tweet. We
want the counts of each reaction to have equal impact in determining the influence. How-
ever, users retweet and mark as favorite with different frequencies, which leads to different
scales for the two counts. To balance the scale of the two counts, we compute the ratio of
favorite-to-retweet counts across all tweets in the dataset, as shown in Figure 3.1.
Most tweets have favorite-to-retweet ratios around 2, and the mean of the ratios across
the dataset is 2.5. Thus, we empirically weight retweet count by 2 to ensure that retweets
and favorites have equal influence in the final measurement. In fact, we think this multiplier
may reflect the fact that marking a favorite requires one click, while retweeting requires two
clicks.
However, retweet and favorite counts are highly influenced by the popularity of the
author account, which can be identified as the number of followers. Note that we are
not interested in the absolute influence created by a post, but the relative influence that
35
Figure 3.1: Favorite-to-retweet ratio
the author account is able to generate. To create a normalized influence for all general
commercial posts, we eliminate the impact of account popularity by normalizing the score
by the number of followers of the account. Therefore, the influence score in our work is
calculated as:
2 × RetweetCount + F avoriteCount
(3.1)
F ollowerCount
Because the influence score is normalized by the number of followers, we include only
direct reactions of a tweet from its readers for influence measurement, rather than tracking
Retweet, favorite, and follower counts are all dynamic attributes in the sense that, for a
given tweet or account, they change with time. Further, tweets are time-sensitive, and the
attention they receive only lasts for a short period of time. Willis et al. [168] have shown
that most retweets happen in the first 20 hours after the original post. Thus, to provide
stable data, we record account information at the time of tweet posting and collect tweet
36
Separation from Inherent Meanings
A basic property of commercial posts is that they are used to spread certain information.
In general, companies have decided the content of the promotion or products in advance
of constructing the tweet, and such information should be considered as fixed. Therefore,
the inherent meaning of the post should not be included in order to predict the influence of
spreading such information. Thus, we design a process to distinguish the core meaning of
the post from other style features. Although both elements could affect the successfulness
of the commercial post, we want to focus the study on only the style features. This is done
First, we conduct a part-of-speech (POS) tagging [43] on the tweet, and extract nouns,
verbs, adverbs, and adjectives. These words or phrases are considered as key words. In
most cases, these key words are the carrier of the inherent meaning of the post. Next,
we group the tweets given these key words, using certain clustering methods. The goal is
to group posts that are writing about similar products or promotions. Given that the core
meaning of the posts is similar in some ways, the model can study the relation between the
We also note that the overall distribution of the influence scores is biased toward smaller
scores. Thus, if we use the influence score directly and define the task as a regression
problem, the result may also be skewed toward very small scores. To remove this bias,
we treat the labeling process as a binary classification problem, where the top 50% of the
tweets (higher scores) are labeled as positive, and the bottom 50% are labeled as negative.
Because the labeling process is performed independently for each group generated from
the previous step, it is able to reduce the impact of the influence caused by the inherent
37
3.3.2 Classification model
As mentioned above, and given the binary labeling of tweets, we use a binary classifier
for the prediction model. To explore differences in impact of the proposed features from
general contextual features, we use the typical n-gram model as the baseline. More specifi-
cally, the baseline comprises n-gram features up to length 5, including all tokens that appear
in the dataset more than once. This approach is the de-facto standard for text classification
tasks such as sentiment classification and topic categorization [2]. Unlike previous work
[117, 2], we find that Maximum Entropy (MaxEnt) [97] works better than Support Vector
Machine (SVM) [62] with the baseline model for classifying the dataset. We also apply the
embedding model is trained to predict the hashtags. Hashtags in commercial tweets gener-
ally contain information about products or promotions, which makes the embedding model
a good fit for commercial posts. Because SVM performs better generally, we apply it to the
tweet embedding model as well as to the proposed comprehensive set of features that we
3.3.3 Features
Although our ultimate goal is to improve a tweet to be more influential, we first focus on
predicting the influence of a commercial tweet. Unlike predicting the cascade of retweeting
[22], we do not include any observation of the diffusion of the post, but extract features that
are available instantly. More specifically, given the broad message that is being captured
by the tweet, we want to model how the high-level structural and meta information impact
its influence.
38
We propose a set of features that works as a high-level representation of the post. Be-
cause the features do not include the inherent meaning of the post, we name them style
features. The proposed features are built to capture the structural and syntactic character-
istics of a post, and also includes certain pertinent information about the posting account.
Nasir et al. [112] have looked into similar features in order to predict retweeting for gen-
eral posts. Building on their work, we construct the feature set shown in Table 3.1 for
Element Features
Usernames, hashtags, and links (URLs) are often used to deliver important information,
39
Usernames mentioned in the tweet are usually used to refer to specific users, or al-
ternatively, used to send the tweet to the users. It is a common way to attract readers by
Hashtags serve to identify a certain topic and are often treated as symbols across tweets
that share the same idea. For commercial tweets, it is also common to use hashtags as
representations of certain products or events. Thus, the information carried by the hashtag
URLs work as an extension to the tweet in order to include detailed and richer informa-
tion. For commercial posts, they play a critical role in pointing readers to additional infor-
mation and details. Therefore, they can potentially increase the chance of being retweeted
Note that the intention of the system is not to alter the inherent meaning of a post, and
to look for general features that affect the influence of commercial tweets, so we do not
look into the actual content or the semantics of these elements. Instead, these features are
represented as binary indicators, and these elements are tokenized for other processes.
Punctuation Features
Rhetorical questions are popular hooks for commercial posts, and question marks serve
to demarcate such hooks. Exclamation marks are often used to express strong emotions.
Commercial tweets are written more formally than general tweets; thus, the use of such
punctuations marks is deliberate. We use a binary feature for each punctuation mark to
40
Complexity Features
The complexity of a tweet indicates the ease (or difficulty) of reading, understanding,
and interpreting the content. We measure such complexity using four features.
is applied to the number of characters of a single tweet, but the tense of a verb, the proper
name, and even the URL, skew the count of characters. Therefore, instead of counting
characters, we use the number of tokens to represent the length of a tweet. It is used both
as a feature and the normalization factor of other features. The analysis of our commercial
tweet dataset showed that the average number of tokens is 15.2, with a standard deviation of
5.1. This shows a significant level of variation in tweet length, and thus it has the potential
to become an indicator.
It has been used as an indicator for the quality of social media content [2]. In this work, we
use the Coleman-Liau Index [24] as the readability feature. The score is calculated as:
where L is the average number of letters per 100 words, and S is the average number of
sentences per 100 words. The resulting score is an approximation of the U.S. grade level
needed to understand the text. Similar to tweet length, readability score captures the surface
The dependency parse tree of a tweet shows the structure of the text, with both word-
and phrase-level relations. We build the dependency parse tree for each tweet using the
Twitter-specific model proposed by Kong et al. [72]. The parse tree is able to capture the
intrinsic and structural property of the tweet. Given such a parse tree, the depth of a tweet
41
is the number of levels starting from the root node to the bottom of the tweet parse tree.
Thus, parse tree depth can be used as a feature to measure the dependency complexity of
the tweet. Parse tree head count is the number of syntactic roots contained in the tweet
parse tree. Each root leads to an individual fragment of the tweet, which is considered to be
a complete and meaningful portion. It is not necessarily equal to the number of sentences
tweet contains a single topic. Therefore, a tweet with more heads in the parse tree tends to
have a higher density of information about the topic. We use the head count to serve as a
feature to measure the density complexity of the tweet. Note that both parse tree depth and
Mentions Features
As described above, the usernames mentioned in commercial tweets could help attract
readers. In most cases, influence is driven by the popularity of the usernames and their
linked accounts. Therefore, in addition to the existence of usernames, we also use the pop-
ularity of the usernames’ accounts as a feature. We use two attributes of the mentioned
username to measure its popularity: whether it is a verified account, and its follower count.
Verified usernames belong to persons whose accounts have been certified as genuine, which
often are associated with “famous people” [98]. The verification of an account indicates
its popularity, and the username verification feature is set to have a binary value. However,
only a small portion of Twitter accounts are verified and thus the applicability of this indi-
cator is limited. Username follower count, on the other hand, is a quantifiable estimator of
the popularity available for all accounts. The username follower count feature is calculated
as the average number of followers across all usernames mentioned in the post.
42
Meta Features of the Post
Previous work has shown that the posting time of a tweet influences the retweet or
response potential [4]. Therefore, we include both the day of week and the time of day as
meta features for the tweet. For both features, we use the local time, and further map the
The author of a tweet has been shown to have a significant impact on the influence
of general tweets [7, 19]. We want to extend such impact to the relation between official
accounts and commercial tweets. To prevent over-fitting and to make the model more
general, we chose attributes that do not reveal the identity of the author account. Post
count is the number of tweets that this official account has posted, while favorite count is
the number of tweets that this account has marked as favorites. Both counts represent the
vitality of the account. Listed count is the number of users that include this account in their
interest lists and it can indicate the popularity of this official account on the social platform.
To eliminate the impact associated with the history of the account on these attributes, we
normalize the post count by the number of days between the registration of the account and
the posting date, normalize favorite count by the post count, and normalize listed count by
Other Features
Sentiment classification has been studied comprehensively [117] and has been used in
previous tweet influence analysis efforts [112]. The sentiment of a tweet is a potential
factor that induces the attention of the readers. Because commercial tweets mostly contain
43
text. Thus, sentiment may provide less differentiation ability, but the numeric value of the
score can still be used as a measurement of the strength of the corresponding sentiment.
We use the Affective Norms for English Words (ANEW), a microblog-based sentiment
word list [113], to generate the sentiment score. Because the output sentiment score is a
summation of all the scores assigned to each word, we normalize the output score by the
A POS tagger labels each word with a certain usage type, given the context of the word.
The POS tag feature has been shown to be useful for many types of social text mining tasks
[117, 73]. To capture the critical information of a commercial post, we use 5 from the
list of 25 Twitter-specific POS tags [43]: common noun, proper noun, verb, adjective, and
adverb. These five tags are then clustered into three POS categories: 1.) common noun and
proper noun as the noun category; 2.) adjective and adverb as the descriptor category; and
3.) verb as the verb category. Similar to extracting meaningful content for labeling, we use
the Gimpel & Owoputi Twitter POS tagger to generate the sequence of POS tags for each
tweet [43]. To represent the writing style of the post, POS distribution features are then
calculated as the normalized POS category counts across all three categories.
Digits in the commercial post often carry meaningful information – such as useful
statistics or emphasis on key ideas. The binary feature of containing digits captures this
3.4 Experiments
In this section, we describe our experiments that show the performance of different
models. All evaluations are performed using five-fold cross validation tests.
44
Gap Amazon Gilt BlackBerry Google
Nordstrom Best Buy Jeep KraftFoods Disney
AT&T Applebee’s Dell Comcast LEVIS
Macy’s AppStore (Apple) JC Penney Delta H&M
Starbucks Travel Channel FedEx Yahoo Motorola
SamsungMobile Microsoft Target Sears AmericanExpress
Netflix GEICO WholeFoods
We build a commercial tweet dataset that contains originating tweets (i.e., no replies
or retweets) posted by the official accounts of 33 companies (Table 3.2). During a 12-
month period, 63,421 tweets were collected using the public Twitter API. We found that
most official accounts are very active in communicating with customers through retweeting
and replies, but they are cautious in posting original commercial tweets, which generates
a limited amount of useful data for the experiment. The source code and dataset for the
Outliers are removed in two steps. Certain announcements, for example, the release
of a new iPhone, have an outsized influence simply because of the information, and thus
the representation may not be the reason for their success. Tweets that are related to such
major announcements or events are found and excluded by keywords. Other attributes may
also cause an unpredictable influence, such as a reference to a song that is currently very
popular. To remove such outliers, for each label group, we compute the z-score of each
post based on the influence score and remove those whose z-scores are larger than 2.
9
https://goo.gl/Y1LFLA
45
3.4.2 Experiment design
We first show the difference caused by using different grouping methods to separate
the inherent meaning from the decorative elements of the post and choose a specific group-
labeling method for model performance analysis. Then we list the performance of the
proposed model, the n-gram baseline model, and the tweet embedding model given the
commercial posts. In order to show more details on the attributes that affect the influence
feature model. Finally, we set up a case study with a set of real commercial posts and apply
Because the model is built to predict the performance of a commercial post, the core
meaning of the post is considered as fixed. Therefore we group the posts based on their
core parts as described before, so that the prediction model can focus on the style parts of
the tweet. In the experiment, we first extract the key words for each tweet, and then apply
• simGroup binary featurizes the key words using a binary representation and cluster
• simGroup emb featurizes the key words using Word2Vec provided by Gensim with
pretrained word vectors [136], then averages the word vectors to generate the vector
• topicGroup applies Latent Dirichlet Allocation model [13] to get the topic distribu-
tion for each tweet, and groups the tweets based on the topic with highest probability.
46
After generating the groups of tweets, labels are assigned to each group individually and
mixed together for the following prediction task. Because the data size is limited, we
test these different group-labeling methods with three, five, and seven groups. In order to
explore the difference caused by various labeling groups, we apply each labeling result to
the n-gram model, proposed style-feature model, and tweet embedding model as mentioned
Figure 3.2 shows the F1 score of three grouping methods with different numbers of
groups given each prediction method. Overall, simGroup binary and simGroup emb gen-
erate comparative performance, while topicGroup does not fit well for this task. Further-
more, simGroup emb is more stable than the other two grouping methods with different
it is more suitable to use the pretrained word embeddings than the simple binary represen-
tation to group tweets. The small portion of isolated key words included in the grouping
process and the limited data size also justify the use of word embedding for tweet clus-
tering. Therefore, to get the best performance, we choose simGroup emb with five groups
47
as the meaning separation method, and use the labels generated from it for the following
experiment. The performance of different prediction models will be further studied in the
following section.
Table 3.3 shows samples from the group assignment generated by simGroup emb with
5 groups. Group 5 is a particular case where the posts only contain special elements such as
hashtags and urls. Unlike the other groups, the actual meaning of the hashtags may not fall
into the same category. However, the limited size of this group ensures that this labeling
process is still convincible. On the other hand, the other four groups work as expected, so
that the labeling process focuses on the style parts of the posts. For example, the labeling
within Group 1 represents how the construction and style of the posts is related to the
We have also tried to group posts by other attributes such as its author account.
48
Accounts that belong to the same category are grouped together. For example, Ama-
zon, Google, and Yahoo are grouped together as technology companies. However, results
show that this intuitive grouping method performs much worse than learning the clustering
directly from the data. In most cases, a single official account does not post commercials
about only one type of products or promotions. In fact, posting commercials through social
media is far more flexible and easier than other approaches. Therefore, companies tend
to post a more comprehensive set of commercials through social platforms than traditional
ones.
After generating the labels using simGroup emb with five groups as described in the
previous section, we use the full dataset for the performance test. As described in Sec-
tion 3.2.2 , we apply our proposed feature model to an SVM classifier with Radial Basis
Function kernel. We also apply the baseline approach with a MaxEnt Classifier, and the
state-of-the-art tweet embedding model with the same SVM classifier for comparison. To
analyze the importance of the features, we conduct an ablation analysis on the proposed
style features. We are designing the system to help companies identify (or even craft)
commercial tweets that are likely to have a large influence. For this reason, we report the
Table 3.4 shows the performance of the baseline method, embedding model, and our
proposed style-feature model, as well as the variation in contribution of the proposed fea-
tures. In general, the style-feature model outperforms both n-gram baseline and the em-
bedding model in terms of F1 score. More specifically, the proposed model tends to have
a much higher recall than the other two models, while a lower precision than the others.
49
Feature Precision Recall F1
Baseline (n-gram) 0.7597 0.7733 0.7664
Embedding 0.7616 0.8158 0.7878
Style (full) 0.7268 0.8708 0.7923
- Author meta -0.0839 -0.1062 -0.0938
- Elements -0.0097 -0.0032 -0.0071
- Punctuation -0.0010 -0.0062 -0.0032
- Mentions +0.0137 -0.0244 -0.0024
- Contain digit -0.0013 -0.0027 -0.0019
- POS dist -0.0005 -0.0006 -0.0006
- Sentiment -0.0002 -0.0005 -0.0003
- Post meta +0.0010 +0.0014 +0.0012
- Complexity +0.0111 -0.0113 +0.0017
The proposed model does not include any particular meaning of the commercial post or
the identity of the mentioned usernames and author account. But the result shows that it
has more capability to predict the potential influence of a commercial post than traditional
content models such as n-gram and embedding models. Moreover, without looking into
the actual core content and the identities, it also reduces the risk of over-fitting the model
to a specific dataset. In this case, a more general model would work better on an unknown
commercial post.
Further, our proposed feature set is much more compact than the n-gram features, and it
is also more compact than the embedding model. The style-feature model is not only more
general and adaptive, but also more efficient and effective than the content-based n-gram
Table 3.5 lists the predicted labels from three models and the text of several sample
tweets where the true label is positive (label 1). More specifically, it shows the sample
50
Ngram Emb Dec Tweet
Community. Connection. Celebration. Today, and
1 1 1 0 every day. #LGBTHistoryMonth
http://soc.att.com/2dAG6sI
Any terrain. Any season. Anytime.
2 1 1 0
pic.twitter.com/RnhHJWlvgF
Pro tip: Sweet bedding = sweet dreams.
3 1 1 0
http://mcys.co/2cx8pGf
Avocados + salt + lime + . What goes in your
4 1 0 0 guacamole? Super Fast Guac: http://bit.ly/1Y9oJOX
#CincodeMayo
Due to forecasted winter weather in the Pacific
5 0 1 1 Northwest, we’ve issued a travel waiver for February
3rd. More info: http://bit.ly/2iPzTuS
OBAP’s dedication to aspiring pilots inspires us, which
6 1 0 1 is why we’re proud to support their programs that mold
the future of aviation.
7 0 1 1 Oh hey @trollhunters @Stranger Things
#18thcenturyproblems #PrideAndPrejudice
8 1 0 1
#NowOnNetflix
Table 3.5: Prediction samples from different models where the true labels are positive
tweets where the style-feature model has different predictions from the content models
Most of the posts where the n-gram and embedding models correctly predict as positive
while the proposed model does not are constructed in an informal way. Many of them are
not constructed as a complete sentence. They are either the combination of several isolated
words and phrases such as Tweet 1 and 2, or written in a special form such as Tweet 3
and 4. Although they have been adapted to the task of tweets, both the POS tagger and
dependency parser are not able to work well on such incomplete sentences, which further
affects the performance of the style-feature model. A bag-of-words assumption does not
carry any order or dependency information; therefore it is less sensitive to these special
51
cases. Thus, the proposed style-feature model generates a lower precision than the n-gram
On the other hand, the proposed model is able to successfully predict more positive
cases than the other models. We note that most tweets the proposed model predicts as
positive while the n-gram or embedding models fail to predict as positive occur in two
situations:
• The construction of the post is complicated, which usually means a complex sentence
• The major body of the post is built of special elements such as hashtags, urls or
The complexity features and the structure analysis, such as tweet parsing, in the proposed
model help locate and extract the posts that have positive influence. Content-based n-gram
and embedding models do not work well on longer and more complex sentences.
As expected, the ablation analysis shows that the author meta feature has the biggest
impact on the final prediction in terms of F1 score. The special elements used in the post
are other attributes that contribute meaningfully to the final prediction. They are very com-
mon and useful in commercial tweets. Moreover, the mentioned usernames and types of
punctuation have considerable impact as well. The use of these two attributes is also more
popular and effective in commercial posts than regular tweets. Complexity is shown to
have an impact in generating a higher recall; we have the same result from the previous
fore, commercial tweets are written to be non-negative, and most commercial tweets have
very limited sentiment difference. Finally, we found the post meta feature working in an
52
Tweet Label
exclusive swag! starting tomorrow, you’re entered to win a custom
1 gecko-themed console controller every time you post using 1
#GEICOGaming.
love the #GEICOGaming turnout. remember, every single post
2 this weekend enters you to win an exclusive gecko-themed 1
console controller!
every post (!) using #GEICOGaming this weekend makes you
3 0
eligible for a custom console controller. bring it!
starting tonight at midnight, every social post containing
4 #GEICOGaming enters you to win a custom console controller! 0
get in while you can.
exclusive swag, limited opportunity! every post (!) using
5 #GEICOGaming this weekend makes you eligible for a custom 1
console controller. bring it!
check this great opportunity! starting this midnight, every social
6 post containing #GEICOGaming enters you to win a custom 1
console controller!
Table 3.6: Commercial tweets about a raffle event for winning console controllers
unexpectedly way, such that removing this feature improves the model. This shows that the
posting time of commercial tweets is not as useful as the posting time of regular tweets [4].
To demonstrate a real use of our prediction model, we pick four commercial tweets
posted by GEICO (1 through 4), and two modified tweets (5 and 6). These tweets, about
a raffle event in which one can win a console controller, are shown in Table 3.6. The label
column lists the prediction from the proposed model, and they all agree with the true labels
for the real tweets (1 as positive and 0 as negative). We exclude the post meta feature to
53
The four real tweets deliver the same core information about the raffle. However, they
differ in their success in generating influence. The positive cases include additional phrases
before the core information that serve as hooks to raise readers’ interest. Our model is able
The positive real tweets are found to have higher readability scores than the negative
ones, mainly owing to the use of additional phrases and subtler use of words. Although a
higher readability score generally implies that the tweet is more difficult to follow, it can
also mean a more precise and attractive expression of the message. The sample tweets
show a positive impact of such an expression on the influence of the tweets. In addition, we
note that the positive cases contain a greater number of nouns than verbs. Although nouns
such as “swag” and “turnout” do not contain core information, they are useful in drawing
more attention.
Samples Tweets 5 and 6 are created from samples Tweets 3 and 4, with the addition
of certain hook phrases to the beginning of the posts. Minor changes are also made to the
main body to meet the length limitation. These modifications lead to an increase in the
number of parse tree heads, and an increase in readability and sentiment scores as well.
With these modifications, the proposed system predicts that the modified tweets will have a
positive influence. In other words, these changes help the tweets have more influence while
The above case study shows a successful use of the system to predict the influence of
real commercial posts. Most pertinently, it shows that one can use the system to craft a
54
3.6 Summary
This research describes a comprehensive feature model to predict the potential influ-
ence of a commercial tweet to its audiences. The proposed model does not include the
inherent meaning of the post and it relies on only the construction, style and meta features
of the post. It ensures the generality of the model such that it can be adapted to most com-
mercial posts. Unlike some previous work, the model does not need any observation of the
diffusion of the post, and therefore the prediction can be made instantly before posting a
commercial tweet. The experiments show that our techniques can provide a useful and sta-
ble performance in predicting the tweets with successful influence while using only a small
set of features. The proposed style-feature model outperforms the content-based n-gram
and embedding models in terms of F1 score. We also show that among all the features, au-
thor meta data has the largest contribution, while the special elements, punctuation marks,
and username mentions contained in the post have comparable contribution as well.
55
Chapter 4: Offline Activity Recognition
4.1 Introduction
Precise real-time user targeting is another critical step to the success of social media
advertising. Social media platforms are able to build rich profiles from the online presence
of users by tracking activities such as participation, messaging and website visits. The
important question we seek to address in this work is, “Can we tell what the user is actually
doing when he/she tweets?” For example, is he/she dining, watching a movie, or studying
in a library? By knowing the activities of a user, such as whether he/she visits restaurants
or travel frequently, more precisely targeted advertisements and marketing strategy can be
directed to them.
Social media users are primarily driven by their interests to write posts. Extracting
these interests from posts has been quite successful [106, 64]. We now seek to unearth the
offline activities that the user is engaged in when he/she posts. Unlike interests, the offline
activities can provide a close to real-time view into the user. As an example, building
interest profiles may tell us that a user likes watching movies, thus ads related to certain
types of movies may evoke his/her attention. However, being able to detect offline activities
can tell us that a user is watching a movie right now, thus ads related to popcorn and beer
56
may have immediate appeal. In other words, knowing the activity a user is engaged in can
Detecting a user’s activity from a tweet could be difficult. To illustrate this, Table 4.1
shows a set of sample tweets along with their reported locations and their assigned activity
labels. The keyword “landed” in Tweet 1 is sufficient to identify the correct location of the
user (airport) and his/her activity (traveling). Tweet 2 needs some inference to understand
the situation of its author – being stuck in a major transportation center. This situation can
still be extracted from the content of the tweet. Tweet 3 contains no information at all of its
activity – travelling. Further, a naive model may identify the activity of Tweet 4 as dining,
because the tweet talks about a drink. However, the author is actually entertaining at a
nightclub. We have observed that it is quite common to post tweets with content that may
clearly indicate one type of activity, while the author is actually engaged in a different type
of activity.
These examples show that the semantic content of a social media post does not, by
itself, always provide meaningful information related to the activity that the author is en-
gaged in while posting. Additionally, user-reported locations are very useful in determining
such activities. For example, [177, 88, 87] have shown correlation between activities and
57
the check-in locations of the posts. However, very few tweets contain such location infor-
mation.
Our goal is, therefore, to build a model that is able to recognize user activities not only
for cases where a clear indicator exists in the content, but also for cases where activity
information is latent and not directly usable. Therefore, the model should work without the
Returning to Table 4.1, it is clear that content alone is not sufficient to extract the correct
offline activity for Tweets 3 and 4, and additional context knowledge is needed. For exam-
ple, the additional knowledge of post time of Tweet 4 (midnight) dramatically increases
the possibility that the author is being entertained at a night club rather than eating at a
restaurant. Historical information is also contextual. Knowing that a post prior to Tweet 3
is about heading home allows us to infer that the author sent this post while traveling. Thus,
we posit that, in order to recognize offline activity, a richer contextual model is required,
To show that such inference can be handled effectively, this work focuses on the fol-
• How can we identify and appropriately label the offline activities of tweets?
• What contextual information (i.e. other than the content) assists in recognizing ac-
tivities?
• How can we effectively recognize user activities using the contextual features?
isting techniques. We start by using a Long Short-Term Memory (LSTM) network [53] to
model only the content of tweets. LSTM is designed to handle sequential data, and it has
58
been shown to provide a reasonable performance on tweet classifications [59, 83, 161]. To
further improve the model, we explore and analyze the inclusion of other contextual fea-
tures with different variations of LSTM model. Based on the analysis and comparison, we
propose a hybrid LSTM model that properly handles the contextual features to improve the
outcome. For evaluation, we create a labeled dataset by collecting tweets where users have
reported their location. For the activity classification task, our proposed model is able to
reduce the error by 12% over the content-only models and 8% over the existing contextual
models.
Finally, we present an orthogonal validation of the proposed hybrid model with a real-
case application. Our model forms an analysis of the activities of the followers of several
well-known Twitter accounts, and the analysis demonstrates strong relationships to the
expected characteristics of these accounts. To the best of our knowledge, this is the first
work that seeks to recognize offline activities using a author-independent model. It is also
the first work that looks into and compares different LSTM-based models with respect to
User profiling on social media has been a popular area, and it is useful for personal-
ization, recommendation, and advertising. Research has been conducted on user profiling
based on the posts and interactions between the users. Rao et al. [135] used linguistic
features to profile users to extract gender, age, regional origin, and political orientation.
Lee et al. [81] built a user profile model based on certain types of words to improve new
59
recommendations. Certain efforts [8, 5, 96] characterize users based on their online com-
munication and webpage-visiting activities. Detecting life events [182, 30] from tweets has
The problem of inference and prediction of real-life activities of users has not received
much attention. To date, there are mainly two types of works have been conducted on the
and recognition of the current activity (activity recognition). Activity prediction considers
all features as historical data, whereas activity recognition focuses on current activities.
Early works on activity prediction [180, 115, 87] relied on the history of check-in locations
provided by the user. Later work [177, 88] added temporal information to the analysis
of activities given location data. None of the work utilized the post content of the users,
which is the major focus of our models. Weerkamp et al. [165] predicted future activities by
summarizing tweet topics where a future time frame is mentioned. To recognize the current
activities, Song et al. [146] built a framework that incorporated the similarity measurement
of the classifiers. It assumes that friends on social platforms are connected through their
activities. Relations in user interest are quite common among friends, however, we think
offline activities do not necessarily hold the same assumption. In contrast, our belief is that
contextual information provided by the same author is more relevant in recognizing offline
activities.
For the task of text mining, LSTM [53] has been used widely for modeling sequential
data. Greff et al. [47] performed a comparison across eight content-based LSTM variants,
and demonstrated that these variants have only limited improvements. To improve per-
60
Network (CNNLSTM)[189] are introduced to capture more appropriate information. Re-
cently, attention mechanisms have been added to LSTM [162, 91] to strengthen the ability
[42] built a contextual LSTM model that adds the contextual feature into the calculation of
each gate function. Yen et al. [182] utilized a multi-task LSTM and included contextual
information by simply concatenating the features. Finally, hierarchical LSTM models have
been built [190, 59] that stack LSTM models with different levels of sequential data. In
general, the effectiveness of each model is highly reliant on the input data and features;
thus, none of the models appear good enough to work with all types of contextual data.
We look into the capabilities of several contextual models with respect to different contex-
tual features and create with a hybrid model that takes advantage of the success of these
models.
In this section, we first describe the process of creating and assigning activity labels to
tweets. Then we show the work exploring several models that are built based on LSTM to
Similar to the labeling approaches of [87] and [146], we design an automatic labeling
process that uses the reported location of the tweets to assign labels. The reported location
is highly predictive in relation to the activities of the tweet. Essentially, we categorize lo-
cations and use predetermined rules to map locations to activities. Note that we also create
additional mapping rules to overcome errors brought by locations that could be involved in
multiple activities.
61
4.3.2 Contextual learning with LSTM
fully, more useful features. We therefore examine several popular LSTM-based models that
used contextual features including static features such as time of post, sequential features
such as POS tags, and historical features such as the most recent tweets from the same
author. The sequence of POS tags allows better understanding of the content, beginning
with the positioning of words. The timing of the post and historical tweets may provide
useful background knowledge of the target tweet. Because the goal of the system is to pro-
vide real-time recognition of activities associated with a given target tweet, we utilize only
tweets posted prior to the target tweet. We do not include the topics of the tweet because
Original LSTM
Sequential models such as LSTM and Gated Recurrent Unit (GRU) [23] are ideal for
text processing because they consider the order and dependencies of tokens. Given that
LSTM and GRU have comparable performance, we use LSTM as the baseline to improve
ht = ot tanh(ct ) (4.5)
62
where i, f , and o are the input gate, forget gate, and output gate, respectively, x is the
A simplified architecture of the LSTM model used for a text classification problem is
shown in Figure 4.1. The output of the embedding layers is a sequence of vectors that
represent the input sequence. LSTM outputs a flat vector representation for the entire input
sequence, and it is fed into another layer to generate the classification output. For our
activity recognition task, the tweet content is the input and the activity label is the output.
Joint-LSTM
Similar to the idea of Yen et al. [182], we design a Joint-LSTM (J-LSTM) model
to concatenate the flat representations of the sequential input of content and contextual
Figure 4.2 shows an example design of Joint-LSTM model. The sequence of POS tags
and the post time of the tweet shown in the figure are the direct contextual features that
have direct relation to the target tweet. The POS tag sequence is generated from the word
sequence and is fed into the model using embedding and LSTM layers. Post time is a
feature that is closely related to offline activities. We treat post time as a sequence of size
1 to be able to use it flexibly in multiple models. We find little difference in terms of the
overall performance between this approach and other approaches, such as feeding the time
63
Figure 4.2: Joint-LSTM for text classification
directly into a dense layer. In addition, the J-LSTM model in Figure 4.2 includes historical
tweets. They are modeled similar to the target tweet, and they share the same embedding
layer with the target tweet. Because the concatenation happens to the flat representation of
the input sequences, J-LSTM suffers from the weakening of sequential information for the
Contextual-LSTM
Ghosh et al. [42] propose a Contextual LSTM (C-LSTM) model to handle contextual
information. They add the contextual feature directly to the decision function of each gate,
64
as shown in the following equations.
ht = ot tanh(ct ) (4.10)
where i, f and o are the input, forget, and output gates, respectively, x is the input, c is
the cell memory, b is the bias, h is the output, and E represents the contextual features.
quences of the contextual features with the embedded sequence of the content, and the
concatenation is sent to an LSTM layer. Figure 4.3 shows an example of C-LSTM model
that takes POS sequence, post time sequence, and historical tweets as contextual features.
65
To form the concatenation properly with all the input embeddings, static features such as
post time are duplicated and transferred into a sequence of the same value. Using the same
input and embedding settings as the J-LSTM model, the embeddings of the target tweet
content and the contextual features are concatenated before sending to the LSTM layer.
Therefore, C-LSTM requires the contextual features to have certain relationship with the
Hierarchical-LSTM
Existing Hierarchical LSTM (H-LSTM) models such as [190] are used mainly to model
contents at different levels of details. In addition, Huang et al. [59] used the structure to
incorporate social context such as retweets and replies. In contrast, we utilize a similar
H-LSTM structure, but include the historical tweets from the same author in chronological
order.
66
Figure 4.4 shows the structure of the H-LSTM model. Each LSTM segment on the
individual level handles a single tweet sequence. The input to the sequence level LSTM
is a propagation of historical tweet representations where the first one is the oldest tweet
and the last one is the target tweet. Because the tweet representations in the sequence
level are formed in chronological order, the sequence can be modeled to learn the historical
background of the activity label of the target tweet. To further utilize the historical tweets,
we also add a self-attention mechanism [154] to the LSTM on the sequence level. All tweet
contents share the same embeddings across the model. The hierarchical structure strictly
limits the type of features that can be used, therefore tests on other contextual features such
In this section, we first analyze the three popular models described in the previous sec-
tion with respect to their ability to incorporate contextual features. Based on the analysis,
LSTM with features of POS tag sequence, post time, and historical tweets. Details on the
construction of the dataset will be covered in the experiment section. These features are
used to explore a more general conclusion for the capability of the contextual models. The
accuracies shown in Figure 4.5 are weighted-average scores across all labels to handle the
imbalanced dataset. In addition, Table 4.2 lists several sample tweets that will be used in
67
Figure 4.5: Comparison of ability to incorporate contextual features between different mod-
els
The bottom right chart in Figure 4.5 shows the use of three models in handling the most
recent five historical tweets. We test with different numbers of historical tweets and find
that the relative performances of different models are similar. Tweet 1 in Table 4.2 was
posted while watching a baseball game and the author posts only baseball-related tweets.
It is surprising that H-LSTM has the worst performance as the structure is designed specif-
ically for historical data. H-LSTM also cannot recognize the correct activity for Tweet 1.
Attention mechanism aims to handle historical information more appropriately, but it does
not help generate any improvement. The utilization of chronological order in including
historical tweets may not be applicable to activity recognition on the target tweet. In other
68
1 Nice day for a game. Less nice was Warren’s first inning.
2 Biggest flag I’ve seen in person. Very cool. #NeverForget #911
3 We made it. #BEmediaday
4 The wait is over! #GreatBarrierReef #Ashes #GoldCoast
words, the habit of posting tweets may not form a chronological dependency chain across
historical tweets.
sequences. We believe that historical tweets have hidden information related to the target
Similar to C-LSTM, J-LSTM does not carry any order information. The merging of the
information for J-LSTM happens at the level of entire tweets, thus it relies on sharing of
the complete information among historical tweets. Because the historical tweets of Tweet
1 also contain a lot of baseball-related words, J-LSTM and C-LSTM are able to recognize
the correct activity of Tweet 1. In addition, the historical tweets of Tweet 2 are very diverse
in terms of the length, topic, and writing style. Therefore, C-LSTM is not able to filter
the noise while J-LSTM still works by combining the complete information. Based on this
analysis, we think that a simple combination of complete recent tweets could better support
Because H-LSTM is introduced to include historical tweets, we apply only J-LSTM and
C-LSTM to the contextual features of POS tags and post time (see remaining three charts
in Figure 4.5). In general, C-LSTM performs better in handling both features. Because
C-LSTM is designed to incorporate features at each step of the input sequence, it generates
69
a larger improvement with stepwise features such as POS tags. When dealing with static
features like post time, C-LSTM adds the same information to the gate decision for each
input step of the content sequence. On the other hand, J-LSTM incorporates this contextual
Tweet 3 is relatively short, but the post time of 6:19 a.m. would help to recognize the
activity of traveling. After segmenting the hashtags in Tweet 4, and knowing the tokens
are proper nouns, we understand that the author is traveling to Australia. For both tweets,
C-LSTM performs better by including the contextual information more accurately with
the corresponding words. Therefore, with deeper and more precise incorporation at each
4.4.3 Hybrid-LSTM
The analyses above show that historical features are better handled by concatenation
at the flat representation level and direct contextual features work better with stepwise
concatenations. In order to handle rich contextual learning that includes different types of
contextual features, we propose a hybrid LSTM model (HD-LSTM) based on the analysis
above. HD-LSTM aims to cover a wide range of contextual features and utilizes different
modeling layers for different contextual features. With the capability of various layers in
Figure 4.6 shows a sample design of HD-LSTM that utilizes text input, along with
contextual features of historical information, POS tag sequence, and post time. In particu-
lar, for each tweet component shown in the dashed box, the content sequence and the direct
contextual features are combined with a concatenation of their embeddings. In each dashed
70
Figure 4.6: Hybrid-LSTM for text classification
box, post time is used to mark the moment when the tweet was written, while POS tag se-
quence helps understand how each word was actually used in the tweet. Then the enriched
sequential representation is fed into a LSTM network that generates a flat vector represen-
tation for the tweet component. At this step, each LSTM module learns the representation
for the semantic, syntactic, and temporal information of the tweet. Next, the enriched flat
representations for all tweets are concatenated to form a larger representation that contains
the information from all inputs. This concatenation further includes the historical informa-
tion of the target tweet to improve the overall understanding of an enriched background.
Finally, the concatenated vector is fed into the output layer and generates the label.
The features that belong to the same category across all tweet components share the
same embedding. In our case, all tweet content sequences, POS tag sequences, and post
71
times share the same embeddings, respectively. To further boost the proposed hybrid
Table 4.3 lists several examples from the development set to show the effect of includ-
ing contextual features on recognizing activities, and the success of the proposed hybrid
model. We use LSTM to show the performance when using content only, J-LSTM to ap-
ply historical tweets, C-LSTM to include both POS tags and post time features, and use
Tweet 1 shows a strong relation to breakfast, however, the true situation is that the
author took a photo of a sandwich while he was waiting at an airport. It is reasonable that
using the tweet content leads to a decision of “dining” activity, and it holds the same even
if the post time is considered. The most recent two historical tweets from the author talked
about leaving the hotel and arriving at the airport. Thus, including the historical tweets
becomes very useful in recognizing the correct “traveling” activity. Tweet 2 describes a
situation where the author is surrounded by many people. With only this clue, it is possible
that the author was shopping at a mall, having a dinner, or waiting at a train station. The
post time of 12:07 p.m. on a Sunday increases the possibility of having a meal, and the
72
model makes the correct decision. The true activity of the author is dining in a cafeteria,
and “E” is the name of the place. Because “E” is an unusual name for a cafeteria, it becomes
hard for the content-only model to utilize this information. In addition, recent tweets from
the author discuss having fun with friends, which also helps determine the correct activity.
Tweet 3 has a strong indicator that the author is at a hospital and the content-only model
can generate the correct output. However, including the historical tweets only results in an
incorrect result of “entertaining.” Several historical tweets mention drinking wine, which
could mislead the historical model. Those historical tweets are all posted at night, while
the post time of the target tweet is early in the morning. Considering this, the hybrid model
is able to give the correct decision by distinguishing the different topics between the target
4.5 Experiments
In this section, we describe our experiments that explore the performance of different
tures in a tweet classification task. As we have stated throughout, the contextual features
include the POS tag sequence and post time of a tweet, as well as the most recent historical
tweets from the same author. Although author identity has proven to be helpful in many
tasks [7, 19], we do not include it because it could potentially create a strong bias in the
Although normally desirable for supervised learning, manual labeling was problematic
for labeling tweets with activities for the following two reasons. First, humans are good at
recognizing surface meaning, especially when no background and external information are
73
required. Thus, manual labeling suffers from the same problem showed by the examples
described in the first section. The activities that cannot be inferred from the content itself
are unlikely to be labeled correctly by humans. Second, a labeled dataset of sufficient size
was highly desirable because the size of the training data is related to the quality of the
model. Although there are certain ways to crowdsource the labeling process, obtaining
sufficient labeled tweets with consistent quality seemed infeasible. Therefore, we label the
We started the data collection from defining a list of place categories that are strongly
related to certain activities. Then we used Google Maps API to collect specific places for
each category with detailed coordinates. Finally, we used Twitter API to collect tweets
that are posted with a reported location that is also within a range of 10 meters from the
coordinates of a specific place. We removed duplicates and only included the tweets that
have reported location type as Point of Interest (POI). POI indicates that an activity can be
conducted at this location [87]. To further clean the data, we removed tweets that contain
less than three tokens or tweets where more than 70% of the tokens are mentioned user-
names. Hashtags are useful elements in tweets and sometimes they can be strong indicators
of locations or activities. However, such use of hashtags may also lead to over-fitting the
model, and the unique manner of creating hashtags makes them less useful to unseen ones.
To prevent this problem while preserve the meaning, we removed the hashtag signs and
segmented the hashtag content so that the hashtags are separated into ordinary words.
Table 4.4 shows the relationship between the predetermined place categories and activ-
ities. As mentioned, additional rules are used to improve the labeling quality. For example,
tweets that have the noun keyword “ceremony” at location “stadium” should be labeled as
“enhancement.”
74
Activity Tweet Count Locations
Enhancement 3848 hospital, library, dentist, doctor, school, university, etc
Traveling 12371 airport, bus station, train station, lodging, etc
Dining 3934 bakery, liquor store, bar, restaurant, meal delivery, cafe
Entertaining 11457 aquarium, movie theater, museum, night club, etc
Shopping 4045 department store, book store, convenience store, etc
Sporting 10028 stadium
Although the data collection process is initialized with the same number of requests for
each activity type, the process results in an imbalanced dataset. In our test, down-sampling
or over-sampling the dataset does not reveal any considerable difference in the overall
performance. Therefore, training data are processed with different weights with respect
to different classes, and the metrics are calculated as the weighted average across classes
(one consequence is that F1-score may not fall in between precision and recall values). The
training, development, and test sets are randomly divided randomly with ratios of 0.6, 0.2
To show the improvement of using contextual features, we also experiment with other
content-only LSTM-based models, i.e., BiLSTM [142], CNNLSTM [189], and LSTM with
self-attentions (LSTM+Att). Unlike certain previous tasks [156, 29], using a word-level
model results in a better performance than a character-level model in our task. We apply the
idea of transfer learning to initialize tweet content embeddings using GloVe [124] before
training. This creates a more domain-specific word embedding compared with using fixed
pretrained embeddings, and it also generates better performance compared with randomly
initialized embeddings. Additionally, POS embeddings are initialized randomly. Post time
75
is represented as day of the week and time of the day, and we set four six-hour time periods
per day. Tweet content embeddings have 200 dimensions, while POS tags, time, and day
When testing with different numbers of historical tweets, we found that including the
five most recent tweets as the contextual feature yields the optimal performance for most
models. We note that H-LSTM is much more sensitive to the number of historical tweets
compared with other models. POS tags are generated using a tweet-specific tagger [116],
10
and the models are built mainly using Keras [20]. We use 200 nodes for all the LSTM
networks in the experiment with a dropout rate of 0.2, categorical cross-entropy as the loss
function, apply Adam optimization for training, and set a mini-batch of size 100. Softmax
function is used in all output layers, and all models are tuned with different epochs for
optimal performance.
Table 4.5 lists the performance of different models. For contextual features, “Direct”
refers the use of POS tag sequence and post time features in addition to target tweet content,
while “All” denotes the use of POS sequence and post time with the content of both the
Models that only use the target tweet content generate results with only limited im-
provement over the original LSTM. In contrast, the use of contextual features boosts the
performance. The post time is more useful than the POS tag sequence, and the benefit of
LSTM uses only the content of tweets and reaches a reasonable performance for the
task given that it has six labels. Bi-LSTM adds the ability to understand the content in
10
Source code is available at https://goo.gl/o9dsBh
76
Content-only
LSTM BiLSTM CNNLSTM LSTM+Att
Recall 65.62 66.62 65.62 66.99
Precision 65.25 66.02 65.01 66.66
F1 64.96 65.71 65.06 66.56
J-LSTM
Time POS Direct Hist=5 All
Recall 66.65 66.00 66.12 67.30 66.91
Precision 65.76 65.40 65.94 67.03 67.88
F1 65.98 65.54 65.98 67.04 67.19
C-LSTM
Time POS Direct Hist=5 All
Recall 66.73 66.85 67.01 66.77 66.80
Precision 66.62 66.33 66.53 66.74 67.61
F1 66.29 66.30 66.61 66.33 67.06
H-LSTM
Hist=5 Hist=5+Att
Recall 65.16 65.56
Precision 66.62 65.53
F1 65.69 65.44
HD-LSTM w/ Hist=5
Time POS Direct Direct+Att
Recall 67.68 67.70 68.70 69.74
Precision 68.60 67.06 68.13 70.00
F1 68.03 67.22 68.23 69.84
another order and helps improve the outcome. Adding the convolutional layer does not
model and the informal use of words in tweets reduces the capability of such information.
J-LSTM works better to include historical tweets and C-LSTM performs better with
direct contextual features, while H-LSTM does not do well to include historical tweets.
77
Because C-LSTM incorporates the contextual features into every token of the input se-
quence, C-LSTM shows benefits from adding more direct contextual features. All three
contextual models are able to benefit from including historical tweets. It is surprising that
C-LSTM generates a certain level of improvement with historical tweets. C-LSTM in-
cludes the tokens from historical tweets with the tokens from the target tweet at each time
step and it is not intuitively correct that words from different tweets have direct relation-
ships. We think that some hidden attributes across tweets from the same author bring the
improvement, such as the use of certain words while the author is engaged in a particular
activity.
Combining the power of both J-LSTM and C-LSTM, the hybrid model outperforms
both content-only models as well as models that use a fixed method to incorporate con-
textual features. When including all features, the large improvement of HD-LSTM over
J-LSTM and C-LSTM shows the effectiveness of the hybrid model. The reported improve-
ments in performance further strengthen the analysis that was used to build the proposed
model: historical tweets can be handled better by concatenating the complete information
of tweets, and the stepwise concatenation of feature representations works better to include
direct contextual features. It is also obvious that HD-LSTM benefits simply from including
more contextual features. In contrast, using a single method to incorporate more contextual
features does not improve the performance consistently. Finally, HD-LSTM also benefits
In this section, we exhibit a real case where the activity recognition is utilized on a large
volume of tweets. The results validate the effectiveness of the activity recognition model.
78
We find seven popular accounts that all have a large number of followers but are distinct
in their fields of focus. For each account, we collect 10,000 followers randomly and, for
each follower, we collect the most recent 200 tweets. For each tweet, we apply he hybrid
model with POS sequence, post time and historical tweet features to generate a probability
distribution over activities. Then we generate a distribution of activities for each follower
by combining the distributions of the tweets posted by that follower. Thus, we are able
the activity labels over the collection of followers for each popular account. This activity
distribution is used to represent the follower activity profile for this popular account. In
the equation below, pf,t,i is the probability for the ith activity label given a single tweet t
from follower f . The probability Pi for the ith activity for the collection of followers for
1 X 1 X
Pi = pf,t,i (4.11)
Z0 f ∈F Z1 t∈T
Because there are duplications and invalid tweets involved in the dataset, the number
of tweets for each follower used for the model may not be the same. Therefore, we have a
normalization factor Z1 to normalize for each follower, and another factor Z0 to normalize
for each popular account. In addition, F is the set of followers for the account, and T is the
We train the model using the full dataset from the experiment. Figure 4.7 shows the
results of analysis for these popular accounts. To make the graph more understandable,
we present the probability for each activity label over popular accounts and therefore the
probabilities for each activity label do not sum up to 1. The imbalanced dataset used to
train the model creates certain trends in different activity labels, but the comparison within
79
Figure 4.7: Summary of the activity distributions for followers of popular accounts
It is straightforward to see that espn has a high probability for “Sporting” and Trav-
elEditor holds the peak in “Traveling.” khanacademy and ClevelandClinic represent ed-
ucational and medical needs and lead to an obvious result of the highest probabilities in
attention of its followers for travel. The need of expanding medical services from the team
and the need of heading to medical facilities from the patients could cause such increasing
attention in “Traveling”. WholeFoods and sprinkles, as a food market chain and famous
cupcake bakery, have the highest involvement of both “dining” and “shopping” for their
followers. It shows that the followers of WholeFoods also care about personal enhance-
ment other than foods. YouTube has a high involvement of “Entertaining” for its followers,
80
while the peak of sprinkles indicates that the interest in cupcakes could lead to the interest
in entertainment.
These observations and conclusions provide validation of the usefulness and effective-
4.7 Summary
ities of a user when posting tweets. Our contributions include a location-based method to
label tweets with offline activities, as well as an analysis and exploration of the different
ways of including direct and historical contextual features with LSTM and effectiveness of
each technique. We propose a hybrid LSTM model that combines and takes advantage of
the various methods to include contextual features. Our experiments show that including
contextual information improves performance over the content-only models. Further, the
hybrid model is able to incorporate the contextual features more effectively than existing
methods. The amount of improvement shows the importance of choosing the right method
for including certain types of contextual features. Finally, we validate our activity recog-
nition model by using it to derive an activity analysis of the followers for several popular
Twitter accounts.
81
Chapter 5: Constrained Paraphrase Generation for Commercial
Tweets
5.1 Introduction
Our last work aims to the core of social media advertising – crafting commercial tweets.
Social media has become an extremely popular platform for corporate marketing and ad-
vertising [153]. Generating attractive yet precise commercial tweets has become a critical
challenge for the companies. In order to maximize their effect, multiple commercial posts
containing the same information are often sent to their target audiences. Figure 5.1 gives
an example of multiple commercial tweets containing the same information about a new
product Spicy Chicken McNuggets, posted on the same day. While capturing the same es-
sential information, these tweets are worded differently, in order to not look repetitive and
At present, all the commercial posts are still crafted manually, making social media
the work can be assisted by systems that help generate new commercial posts with the
same meanings yet with different phrasing. Such an approach could assist in automatically
Our research focuses on paraphrase generation for commercial tweets that preserve
the original meaning while being diverse. Paraphrase generation has been studied widely,
82
Figure 5.1: Commercial tweets that are posted for the same product
along with other text generation tasks such as Machine Translation [60], Summarization
[99], Text Simplification [143], Question Answering [34], and others. Recently, the use of
Deep Neural Networks (DNNs) has helped models learn and understand more sophisticated
hidden factors in generating text content [130]. It mainly involves the Seq2seq models [6],
models [154]. The ability to model the process of text generation, especially the delivery
Controlled paraphrase generation is similar to the problem of text generation but adds
certain specific requirements. The early work focused on attributes such as sentiment or
writing style with the goal of enriching the generated text [58, 185]. In later efforts, more
specific requirements were added, such as the choice of words [18, 186]. Recent work
in this area has incorporated structural information to Graph-to-text generation tasks [28,
145].
The focus of our work is the paraphrasing of commercial tweets. This problem is
distinct from prior work in paraphrase generation in having hard constraints [55, 127, 186].
83
Unlike the latent controllable attributes, such as writing style, hard constraints are those that
require certain words or phrases to be kept in the generation. For example, the highlighted
parts in Figure 5.1 are considered hard constraints that must be maintained in the generated
paraphrase.
In order to address the problem described above, this chapter proposes a Constraint-
mercial tweets in ways that meet hard constraints while encouraging diversity in the gen-
erated text. Specific components of our work include utilizing a large paraphrase dataset
and showing its compatibility by applying the learned knowledge to commercial posts on
social media, introducing an automatic process to identify the hard constraint in a con-
tent and embed the constraint directly into the text data, and showing that the embedded
constraint information can help learn a causal language model and results in performance
improvements.
To the best of our knowledge, this is the first work that embeds generation constraint in
the learning process of language models. The proposed constrained generation framework
outperforms the existing CopyNet structure [48] across multiple evaluation metrics. At the
same time, we show that the constraint-embedded data can enhance the performance of
CopyNet.
Over decades, the topic of paraphrase generation has taken a similar research path as
other text generation tasks. Linguistic knowledge was first introduced with hand-crafted
rules to build the system [103]. Statistic models were also used with shallow linguistic
features[187], while syntactic and semantic information was explored to help the modeling
84
of the paraphrase generation [38, 74]. The success of Deep Neural Network in Machine
Translation has been matched by its efficacy in paraphrasing as well. Learning from a large
parallel corpus, standard encoder-decoder structure can model the source text as a hidden
representation and generate the target paraphrase based on that [128, 95, 33]. Word Embed-
ding Attention was added to better model the semantics of the words [94]. An evaluator is
paraphrase generation [86]. Another approach to improve the generative performance is ap-
plying Variational Autoencoder (VAE) to the encoder-decoder structure [49, 16]. Recently,
the powerful Transformer structure [154] was applied to paraphrasing tasks [169, 37]. In
addition to word sequences, Wang et al.[158] also applied Transformer to the correspond-
ing frame and role label sequences to improve the generation performance. In this chapter,
we build on the general approach of using Transformer-based language models for para-
phrase generation tasks, since these models have the most promising performance.
to be included in the output text. This additional step helps the paraphrasing to be more
task-oriented and improves the generation quality. Attention mechanism is utilized to build
Pointer Net [155] and CopyNet [48] to specifically locate the relation between the words
in the source and target sequences – this can be a promising approach for meeting hard
constraints as generated text can include certain words from the source text. Cao et al.[18]
trained a separate alignment table to limit the vocabulary used in the decoding process.
Hu et al. [58] incorporated discriminators and a latent code to the VAE Encoder-Decoder
model to control the attributes incorporated in the generated text. Chen et al. [21] added
two hidden codes to represent the semantic and syntactic attributes, which they used to
control the semantic similarity and writing style, respectively. To create text content for
85
adversarial attacks, Wang et al. [159] included a separate controlled attribute to a encoder-
decoder framework. Generative Adversarial Networks (GAN) are combined with Trans-
former to incorporate the writing style that is extracted from a reference text to the output
text [185]. Keskar et al.[66] built an explicit relationship between subsets of training data
and the generative model using control codes. Recent work [105, 186] treated the con-
strained generation in a special way by inserting words based on the pre-defined keywords.
This ensures the persistence of the keywords, and it also fixes the order of the words. In
addition to the keywords, the generated text only relies on the learning domain, and it
cannot take context information per generation task, such as the source sequence. How-
ever, we want a model that can be flexible in terms of the order of these keywords in the
output paraphrase. Several methods [55, 127, 57] haven been proposed to handle the hard-
constrained generation by modifying the decoding and inference stage. Our work focuses
on the same requirement of hard constraints and the constraint information can be learned
86
Language models are typically used to understand the writing of natural language text
and to generate natural text based on the learned knowledge. Language models can learn
from the text, including grammar rules, word usage, and writing styles. Towards our goal
of paraphrase generation with hard constraints, we believe the models can also learn these
text content and let the language model learn such constraints.
Figure 5.2 shows the overall workflow of our proposed CELM framework. Specifi-
cally, given the original text, hard constraints are identified automatically, and then these
constraints are embedded into the text sequence. A causal language model is used to gen-
erate the output given the embedded text sequence, and the extracted hard constraints are
realized in the output. The rest of this section describes these steps in more details.
Instead of using latent variables to control the constrained generation, we embed the
specific constraint information directly in the content. While it may be feasible to assign
the constraints manually, we want the system to identify the words automatically where the
87
We explore the constraints for commercial tweets starting from certain nouns and nu-
merical representations. We rely on the syntactic dependency parse tree of a given text
sequence to identify the constraint in commercial tweets, as a dependency parse tree can
give more information than part-of-speech (POS) tags. The structure of commercial tweets
is generally simple and the constraints are usually focused on the proper nouns and num-
bers. For example, Figure 5.3 shows the result of dependency parsing of a commercial
tweet. The name Ridley Scott is clearly a constraint in the text. In order to keep certain
limit the proper nouns to be the root, subject, or object in a dependency relation, while
allowing numerical representations that are number modifiers. A preliminary test shows
that this approach results in a 98% precision and 75% recall to identify the hard constraints
Similar to [120], we embed the constraint in the text sequence by replacing the cor-
responding token or phrases with their constraint types. Initially we mark two constraint
types, which are proper nouns and numerical representations. Figure 5.4 shows an example
of replacing a proper noun phrase with a special token to embed the constraint information.
88
We treat the token or phrase having the same value to be the same constraint. Therefore, it
is possible that the same constraint occurs multiple times in a single text sequence.
Although the dependency relation is used to identify the constraint, we do not directly
adding such relationship information to the constraint type. We believe these relations
can be learned by an effective language model. Therefore, we omit these relations out
from constructing the specific constraint information to increase the model flexibility. Less
limitation in constructing the constraint also increases the size of training samples where
Language models try to generate the current word wi given the context words wc :
In this work, we rely on the popular causal language model that imitates the writing
habits of humans and utilizes only the information generated previously. Therefore, context
words are the words that have been generated previously in the text sequence.
With the growing capability of deep learning models, it becomes possible for language
models to train on extremely large datasets. GPT-2 [130] has been shown to successfully
fulfill many text-generation tasks, such as summarization and translation. Similar to [169,
50], we utilize the pre-trained GPT-2 model as the language model to generate paraphrase
for a given text sequence. GPT-2 is a pre-trained causal language model that focuses on
generating the most appropriate token to form coherent writing. Therefore, we form single
sequences from the the paraphrase pairs to fine-tune the model, so that the model learns
to perform paraphrase generation. Paraphrase pairs are concatenated with a special token
89
(such as ”>>><<<”) to identify the paraphrase activity as well as the separation of the
We treat the tokens that represent constraint types like ordinary tokens in language
modeling. Along with the paraphrase separation token, these special tokens become part of
the corpus. Because these special tokens have a much higher occurrence than regular words
in the dataset, the paraphrase activity and corresponding constraints are learned more easily
through fine-tuning.
One of the goals for CELM framework is to generate paraphrase commercial tweets
with enough diversity. The model should avoid using many of the tokens from the source
text. Therefore, instead of a greedy approach or a beam search [147], we sample the tokens
to generate multiple paraphrases for each source sequence. Greedy approaches focus on
the output sequences that have the highest probabilities. However, these approaches often
In this work, we apply Top-k sampling [39] that generates each token randomly based
on the conditional probability of the most likely k tokens. The probabilities of the top k
We also introduce the use of Top-p method, which limits the sampling pool to be the
smallest possible set of tokens whose probability summation exceeds the probability p.
Renormalization is also applied to the limited probability set. Unlike the Top-k approach,
Top-p builds a dynamic sampling pool where less tokens are included when the entropy of
90
the probability distribution is lower.
Combining these, each token is inferred by sampling from a dedicated set that meets
wi ∼P (w|w1:i−1 )
Because the constraints are embedded in the text sequence, the generated sequences
are expected to have the constraint tokens so that they can be realized to the actual values.
Based on the number of constraints for each constraint type, the final realization is handled
in different ways. Figure 5.5 shows the examples for both cases.
Figure 5.5: Examples of text content with single and multiple constraints
Single-constraint Realization: For cases where only one constraint is identified in each
text content, the model simply replaces the constraint tokens back to their actual values.
91
Multiple-constraint Realization: When multiple constraints are extracted for one con-
straint type, the model goes through all possible permutations of the actual values to replace
the certain constraint tokens. It ensures that every actual value takes at least one constraint
token of its type. Then, the output sequence that has the highest semantic similarity to the
5.4 Experiments
In this section, we report results from a set of experiments designed to demonstrate the
capability of learning and incorporating hard constraints through causal language models.
Our proposed CELM framework identifies hard constraints automatically from commercial
tweets, embeds the constraints in the tweet, relies on GPT-2 for training and inference, and
To form the comparison, we utilize CopyNet [48] as the baseline. CopyNet is the state-
of-the-art language model that is designed specifically to handle the necessity of making
sure certain tokens from the input sequence are kept in the output sequence. CopyNet can
learn the constraint relation through pairs of raw text samples. Additionally, we provide
the constraint-embedded data and apply constraint realization to CopyNet, as well as test
Unlike other text-generation tasks, sources for sentential paraphrase datasets are lim-
ited. In particular, the sizes of the datasets are often small. Popular datasets such as the
one reported by Dolan et al.[32] are generated by human annotators in the news domain.
For the Twitter domain, two paraphrase datasets [174, 75] are constructed either manually
or by relying on a strong assumption of sharing the same links. They suffer from size and
92
domain limitations, which make them not ideal to train a generation model for commercial
tweets.
Fortunately, we find that the writing style of commercial tweets is more formal and
closer to day-to-day writings. Therefore, we trained the model using the paralleled machine-
translated (PMT) paraphrase dataset [167]. This dataset is automatically constructed using
a neural machine translation model, and we are able to use about 5 million sentential para-
phrase pairs for training. Constructed based on CzEng [14], the PMT dataset covers a wide
range of fields including tweets. Because of the formal writing style seen in commercial
tweets, we find that the cases in PMT are considerably compatible with commercial tweets.
Table 5.1 lists certain examples that demonstrate this compatibility in terms of the use of
Table 5.1: Examples for data compatibility between PMT and CommTweet
We apply the dependency parser from spaCy [56] and embed the constraint tokens
where the same hard constraint words are located in both sequences of a paraphrase pair.
We build two training sets from the result: 1) include only the constraint-embedded para-
phrase pairs where at least one matched constraint pair is located (ONLY); 2) besides the
pairs from ONLY, include the original text from PMT for all other paraphrase pairs (MIX).
The ONLY set focuses on learning the hard constraint, whereas the MIX set also provides
93
more sources to learn general paraphrase generation. Additionally, to compare the perfor-
mance, we use the entire original dataset as the third training set (ORI). The ORI and MIX
sets both contain about 5,000,000 sentence pairs each, and ONLY set has 800,000 pairs.
We utilize the knowledge learned from the PMT dataset and apply it to our commercial
comments) from 35 verified official accounts of popular brands. The same constraint iden-
tification method is applied, and we further split the dataset into two subsets. One subset
(SINGLE) contains the commercial tweets where only one constraint is identified for each
constraint type (proper nouns and numerical representations). The other subset (MULTI)
contains cases where more than one constraint is found for each type. SINGLE has 31,922
Two preprocessing steps were added to these datasets. First, links and hashtags are
special elements in commercial tweets, and they are intended to be the same regardless of
the content of the tweet. Therefore, we remove the links from the data, and segment the
hashtags so that they can be part of the content to get involved in paraphrase generation.
Second, from all datasets, we remove the tweets that are extremely short.
Consistent with the goals of paraphrasing, we use several metrics to capture the di-
versity and the semantic similarity of the generated text compared to the original text .
Measuring the word usage is an effective way to demonstrate the diversity of the writing,
and particularly, Witteveen et al. [169] show that ROUGE-L [90] is useful in determining
94
the uniqueness of the generation. We also include uni-gram BLEU [118] to measure the di-
versity. Meanwhile, the semantic similarity is measured by computing the cosine similarity
the hard constraints that have been observed, and perplexity to quantify the coherency of the
writing of the generated paraphrase. The perplexity score is generated by running inference
on the pre-trained GPT2-medium model. Note that we use GPT2-small to generate this part
of the experiment results due to its efficiency. Therefore, the perplexity score can only be
used to compare the performance results that are generated using the same language model
(CopyNet or GPT-2).
As suggested in [169], we fine-tune the GPT-2 model for a small number of epochs
to give the model enough exposure to the task while avoid over-fitting on the new data.
CopyNet is trained from scratch, but we also leave some allowance when training and
validating so that the model is not overfitted and can be applied to the test data in another
domain. The training sets (ORI, ONLY, MIX) and test sets (SINGLE, MULTI) are identical
when used for experiments related to CopyNet and GPT-2 models. We set a maximum
Tables 5.2 and 5.3 list the performance of the CopyNet baseline model the proposed
CELM framework with GPT-2 using original and constraint-embedded data. As stated
earlier, BLEU and ROUGE-L are used to measure the diversity of the generation, and
a lower score represents a more diverse generation. Similarity shows the quality of the
paraphrase in terms of how much semantic information it keeps from the original content,
95
Baseline (CopyNet) CELM (GPT-2)
ORI ONLY MIX ORI ONLY MIX
BLEU 0.532 0.587 0.637 0.213 0.267 0.262
ROUGE-L 0.750 0.817 0.863 0.234 0.295 0.294
Similarity 0.774 0.843 0.877 0.781 0.828 0.817
Coverage 0.477 0.876 0.845 0.261 0.912 0.865
Perplexity* 1300 1144 863 177 306 309
and coverage checks the ability of the model to meet the hard constraint. Finally, perplexity
compares the writing of the generation against the knowledge of the language model, and
In general, CopyNet tends to reuse a lot of word sequences directly from the input
content, and it results in much higher BLEU and ROUGE-L scores. We cannot rely on
perplexity scores to compare the quality of generated content between the two language
models, but it seems to us that CopyNet can output text content that is more fluent and
coherent. CopyNet uses word-level embeddings, whereas GPT-2 is built on sub-word level
When CopyNet and CELM are handling data without specific constraints embedded,
their generations are comparable in terms of similarity to the original content. Because
96
CopyNet is designed specifically to handle sequences with hard constraints, it has a better
chance than CELM to correctly keep the designated tokens. On the other hand, the genera-
tions of CopyNet are much less diverse than those of CELM. The fact that CopyNet is more
likely to repeat large portions of sub-sequences from the source in the output improves the
When constraint information is embedded into the data, CELM shows a much larger
improvement over CopyNet in terms of coverage. It shows that both models are not well-
designed to handle the hard constraint directly from the original content, so that they im-
prove from the embedding of the constraints in the content. Meanwhile, CELM keeps the
major advantage of generation diversity over CopyNet. We believe the pre-trained knowl-
coverage improvement, it still does not help CopyNet ease the tendency of repeating sub-
Comparing the two types of data where constraints are embedded reveals that training
on only the pairs (ONLY) where hard constraints are found offers the best performance
is embedded (MIX) helps CopyNet improve the similarity score, but further reduces the
diversity in the generation. Equipped with pre-trained knowledge, CELM does not have
much difference in similarity and diversity scores when MIX data is used. The changes in
perplexity scores when using ORI, ONLY and MIX datasets are reversed between CopyNet
and CELM. For CELM, we think the embedded constraint tokens break the pre-trained
hand, is trained from scratch, so that the embedded constraints help in understanding the
97
Most observations from the test on the SINGLE dataset remain the same when test-
ing on the MULTI dataset. Both models get lower coverage scores because it is harder to
handle more constraints in the content. CopyNet has better diversity but worse similarity
measurement on MULTI test set, whereas CELM has the opposite. CopyNet suffers from
including more embedded constraints to generate the corresponding constraints, but it re-
sults in a more diverse generation. CELM has more power to learn from the additional
constraint information, but includes more repeated tokens from the input as well.
Model Content
Original No fork, no fire, no problem. smores day
Baseline (CopyNet) No fork, no fire , no no.
No fire and smores, no trouble. sm hours are coming up. it’s like
CELM (GPT-2)
the days of the day.
Original Looks like most Americans lack savings to cover emergencies
Baseline (CopyNet) Looks like most Americans are missing savings
People like most of the Americans lack the savings to pay for
CELM (GPT-2)
emergency situations.
Table 5.4 lists two samples from the CommTweet dataset to demonstrate the character-
istics of the generation of both models. The first commercial tweet does not contain any
identified hard constraints. It is obvious that CopyNet repeats a lot of words from the origi-
nal text whereas CELM generates more diverse writing. The second example contains one
hard constraint and both models cover it. CopyNet keeps the main structure of the content
and replaces some tokens. CELM rewrites most of the text and still maintain the constraint
Overall, the experiments show that embedding the hard constraint into the content
can dramatically increase the capability to keep the constraint in the output, regardless
98
of whether the language model is designed specifically to handle such a requirement. A
pre-trained large language model can generate paraphrase with greater diversity and com-
patible similarity. With more constraints identified in the original content, the model may
5.5 Summary
of commercial tweets. The framework includes a process to identify hard constraints, em-
bed the constraints in the text content, generate output using a causal language model, and
realize the constraints to form the final paraphrase. We also show that the model trained
on a general domain dataset can work compatibly on a dataset of commercial tweets. The
experiments demonstrate that the constraint-embedded data can help generation models to
create better paraphrases in terms of semantic similarity and diversity, while meeting con-
straints. The improvement applies to both general language models and models specifically
99
Chapter 6: Conclusions and Future Work
6.1 Conclusions
Our work focuses on different aspects of utilizing Twitter for better advertising. It
involves analyzing user feedback, predicting the influence of commercial tweets, profiling
users based on their offline activities, and generating paraphrases for commercial tweets.
These contributions help reduce human effort and increase the efficiency in multiple steps
els to form a result that improves the performance in a mixed-classification task. The model
focuses on the difficulty caused by distinct requirements and definitions for class labels of
different domains, and it utilizes the fact that different models treat the labels differently to
build the ensemble model. The model includes a tweet vector with the probabilistic output
from several classifiers to further improve the performance of the ensemble model.
We define an influence score to measure the level of attention a commercial post draws
from its audiences. To predict whether a commercial tweet will have enough influence, we
create a set of style features and apply them to a classifier. The style features focus on the
creation of tweets and do not include the inherent meaning of the tweets. Therefore the
model is generalized to any commercial tweets. The ablation analysis on these meta and
100
linguistic-based features discovers the secret of crafting successful commercial tweets, and
The recognition of offline activities provides a unique view for user profiling. We ex-
plore the existing LSTM-based structures that can include features in addition to the target
tweet content. We propose a hybrid-LSTM model to efficiently include contextual and his-
torical information with the target tweet to recognize user activities. A case study using the
model reveals the fact that the field of the company can be reflected by the major activities
of its followers.
We discover the effectiveness of embedding constraint information into text content and
generating paraphrases for commercial tweets using a causal language model such as GPT-
2. The hard constraints are critical in paraphrase generation so that the key information
can be preserved intact in the generated commercial tweets. We show that the knowledge
learned from a general domain can be transferred and applied to the domain of commercial
tweets. The proposed CELM framework can generate paraphrase tweets that are semanti-
cally similar to the original tweet and diverse in terms of the text to provide more choices.
Our work covers the use of statistical classification models such as SVM, neural net-
work classification models like LSTM, and sequence generation models such as GPT-2.
els. The contextual information offers additional help by incorporating tweet content, part-
of-speech tags, post time, or historical data. We also demonstrate the use of embedding
constraints directly into the text content and generating paraphrases with these hard con-
straints. We discover ways to map outputs from one specific domain to a different one and
show that learned knowledge can be transferred to another compatible domain. Finally, our
101
contribution includes the collection of the datasets of commercial tweets and tweets with
The work of mix-classification models for tweets can be extended in several dimen-
sions. Additional variants of the ensemble classifier can be explored, with the goal of
better utilizing the probability distribution generated by the individual models for a more
effective combination of the models. For example, one can focus on improving the general-
ity of the new ensemble method in handling additional types of probability distributions as
the input or developing methods that learn the characteristics of each classification model
on a given dataset and use this knowledge in combining multiple models. Another direction
can be to incorporate more individual models as well as additional context features to the
ensemble method.
On the basis of the influence prediction system for commercial tweets, a suggestion
system can be built that helps companies generate better commercial tweets in terms of the
influence on their audiences. Similar to the example shown in the case study, the suggestion
system can propose potential modifications, using the prediction system to determine which
modifications lead to a more successful post. Other techniques can be explored to separate
the writing of the tweet from the commercial information it carries. Working on a given
Based on our work of the activity recognition model, we intend to identify more con-
textual features and explore their abilities using additional models. The current labeling
process relies on the reported location of each tweet. Thus, determining better ways to
improve location accuracy could potentially increase the quality of our work. During our
102
experiments, we found that some images attached to tweets may be useful in identifying
activities. While it is currently not common to include images as contextual features for
To extend the constrained paraphrase generation model, we plan to explore more types
of hard constraints and their impact when embedding into the content. One approach to
distinguish the constraint types is to cluster all the candidates for the hard constraint and
assign type tokens accordingly. We also want to discover ways to include dependency infor-
mation into the constraint itself, but maintain enough flexibility in utilizing the constraints.
Exploring better solutions to handle multiple constraints for each type in a sequence is
Furthermore, our work relies on the assumption of author independence, which does
not take into consideration any author identity information. We have concerns about model
bias when author identity is included. However, a well-designed author representation can
be used to include characteristic-related information while keeping the model unbiased. In-
fluencers are the usernames mentioned in the posts to raise attention to the advertisements,
and more companies have realized the importance of using influencers in their commer-
cials. Explore the relation between the characteristics of the influencer and the success of
the commercial tweets will become a meaningful topic. With the emerging use of graph
models, our solutions can benefit from incorporating tweet-level or author-level relation-
ships. Some of the proposed features or models can be converted and applied to Graph
Neural Network (GNN) models [172], and the propagation of the feature information can
help improve the performance. In addition, the datasets we created for our work are based
on certain special requirements and focus on unique perspectives. Besides the standard
103
evaluation methods, some human evaluations can improve the reliability of the experiment
Our work provides limited contribution to the field of social media marketing and ad-
vertising. We try to overcome several problems in the field, but more meaningful questions
and challenges are still open to be solved. We hope that our work gives an explicit direction
for the research in this field and leads to further accomplishment in the future.
104
Appendix A: Implementation and Datasets
The implementation codes for all four steps of the social media loop are published on
my personal GitHub page. They are organized as four separate projects, and each one
The two datasets that are created and used in the work are also publicly available.
CommTweet dataset contains the commercial tweets posted by the official accounts of 36
companies listed in Table A.1. Commercial tweets refer to the original tweets that are not
retweets or comments. They are the tweets companies use to post marketing or advertising
information.
ActivityTweet dataset includes ordinary tweets with activity labels listed in Table A.2.
These labels represent the offline activity that the author was engaged in when the tweet
105
was posted. Tweets in the dataset all have reported locations which are used to determine
Finally, the links to the implementation codes and datasets are listed in the following
table.
Implementation Codes
https://github.com/renhaocui
CommTweet Dataset
https://1drv.ms/u/s!AhCHbLu6TCc8hMYirS6lFOVRUDSttw?e=WIenmB
ActivityTweet Dataset
https://1drv.ms/u/s!AhCHbLu6TCc8hMYju-IT9PDbt9LIKg?e=EnLhmY
106
Bibliography
[1] Fabian Abel, Qi Gao, Geert-Jan Houben, and Ke Tao. Analyzing user modeling
on twitter for personalized news recommendations. In International Conference on
User Modeling, Adaptation, and Personalization, pages 1–12. Springer, 2011.
[2] Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis, and Gilad
Mishne. Finding high-quality content in social media. In Proceedings of the 2008
International Conference on Web Search and Data Mining, pages 183–194. ACM,
2008.
[3] Isabel Anger and Christian Kittl. Measuring influence on twitter. In Proceedings
of the 11th International Conference on Knowledge Management and Knowledge
Technologies, page 31. ACM, 2011.
[4] Yoav Artzi, Patrick Pantel, and Michael Gamon. Predicting responses to microblog
posts. NAACL HLT 2012, page 602, 2012.
[5] Mohamed Faouzi Atig, Sofia Cassel, Lisa Kaati, and Amendra Shrestha. Activity
profiles in online social media. In Advances in Social Networks Analysis and Mining
(ASONAM), 2014 IEEE/ACM International Conference on, pages 850–855. IEEE,
2014.
[6] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine trans-
lation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473,
2014.
[7] Eytan Bakshy, Jake M Hofman, Winter A Mason, and Duncan J Watts. Everyone’s
an influencer: quantifying influence on twitter. In Proceedings of the fourth ACM
international conference on Web search and data mining, pages 65–74. ACM, 2011.
[8] Fabrı́cio Benevenuto, Tiago Rodrigues, Meeyoung Cha, and Virgı́lio Almeida. Char-
acterizing user behavior in online social networks. In Proceedings of the 9th ACM
SIGCOMM conference on Internet measurement conference, pages 49–62. ACM,
2009.
107
[9] Adam Bermingham and Alan Smeaton. On using twitter to monitor political sen-
timent and predict election results. In Proceedings of the Workshop on Sentiment
Analysis where AI meets Psychology (SAAIP 2011), pages 2–10, 2011.
[10] Parantapa Bhattacharya, Muhammad Bilal Zafar, Niloy Ganguly, Saptarshi Ghosh,
and Krishna P Gummadi. Inferring user interests in the twitter social network. In
Proceedings of the 8th ACM Conference on Recommender systems, pages 357–360.
ACM, 2014.
[11] Monica Billio, Roberto Casarin, Francesco Ravazzolo, and Herman K Van Dijk.
Bayesian combinations of stock price predictions with an application to the amster-
dam exchange index. 2011.
[12] István Bı́ró, Dávid Siklósi, Jácint Szabó, and András A Benczúr. Linked latent
dirichlet allocation in web spam filtering. In Proceedings of the 5th International
Workshop on Adversarial Information Retrieval on the Web, pages 37–40. ACM,
2009.
[13] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the
Journal of machine Learning research, 3:993–1022, 2003.
[14] Ondřej Bojar, Ondřej Dušek, Tom Kocmi, Jindřich Libovickỳ, Michal Novák, Mar-
tin Popel, Roman Sudarikov, and Dušan Variš. Czeng 1.6: enlarged czech-english
parallel corpus with processing tools dockered. In International Conference on Text,
Speech, and Dialogue, pages 231–238. Springer, 2016.
[15] Johan Bollen, Huina Mao, and Xiaojun Zeng. Twitter mood predicts the stock mar-
ket. Journal of Computational Science, 2(1):1–8, 2011.
[16] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz,
and Samy Bengio. Generating sentences from a continuous space. arXiv preprint
arXiv:1511.06349, 2015.
[17] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller,
Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grob-
ler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux.
API design for machine learning software: experiences from the scikit-learn project.
In ECML PKDD Workshop: Languages for Data Mining and Machine Learning,
pages 108–122, 2013.
[18] Ziqiang Cao, Chuwei Luo, Wenjie Li, and Sujian Li. Joint copying and restricted
generation for paraphrase. arXiv preprint arXiv:1611.09235, 2016.
[19] Meeyoung Cha, Hamed Haddadi, Fabricio Benevenuto, P Krishna Gummadi, et al.
Measuring user influence in twitter: The million follower fallacy. Icwsm, 10(10-
17):30, 2010.
108
[20] P.W.D. Charles. Project title. https://github.com/charlespwd/
project-title, 2013.
[21] Mingda Chen, Qingming Tang, Sam Wiseman, and Kevin Gimpel. Controllable
paraphrase generation with a syntactic exemplar. arXiv preprint arXiv:1906.00565,
2019.
[22] Justin Cheng, Lada Adamic, P Alex Dow, Jon Michael Kleinberg, and Jure
Leskovec. Can cascades be predicted? In Proceedings of the 23rd international
conference on World wide web, pages 925–936. ACM, 2014.
[23] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi
Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representa-
tions using rnn encoder-decoder for statistical machine translation. arXiv preprint
arXiv:1406.1078, 2014.
[24] Meri Coleman and Ta Lin Liau. A computer readability formula designed for ma-
chine scoring. Journal of Applied Psychology, 60(2):283, 1975.
[25] Michael D Conover, Bruno Gonçalves, Jacob Ratkiewicz, Alessandro Flammini, and
Filippo Menczer. Predicting the political alignment of twitter users. In 2011 IEEE
third international conference on privacy, security, risk and trust and 2011 IEEE
third international conference on social computing, pages 192–199. IEEE, 2011.
[26] Michael D Conover, Jacob Ratkiewicz, Matthew Francisco, Bruno Gonçalves, Fil-
ippo Menczer, and Alessandro Flammini. Political polarization on twitter. In Fifth
international AAAI conference on weblogs and social media, 2011.
[27] Nadia FF Da Silva, Eduardo R Hruschka, and Estevam R Hruschka Jr. Tweet sen-
timent analysis with classifier ensembles. Decision Support Systems, 66:170–179,
2014.
[28] Marco Damonte and Shay B Cohen. Structural neural encoders for amr-to-text gen-
eration. arXiv preprint arXiv:1903.11410, 2019.
[29] Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick, Michael Muehl, and William W
Cohen. Tweet2vec: Character-based distributed representations for social media.
arXiv preprint arXiv:1605.03481, 2016.
[30] Thomas Dickinson, Miriam Fernández, Lisa A Thomas, Paul Mulholland, Pam
Briggs, and Harith Alani. Identifying important life events from twitter using se-
mantic and syntactic patterns. 2016.
109
[32] William B Dolan and Chris Brockett. Automatically constructing a corpus of sen-
tential paraphrases. In Proceedings of the Third International Workshop on Para-
phrasing (IWP2005), 2005.
[33] Li Dong, Jonathan Mallinson, Siva Reddy, and Mirella Lapata. Learning to para-
phrase for question answering. arXiv preprint arXiv:1708.06022, 2017.
[34] Nan Duan, Duyu Tang, Peng Chen, and Ming Zhou. Question generation for ques-
tion answering. In Proceedings of the 2017 Conference on Empirical Methods in
Natural Language Processing, pages 866–874, 2017.
[35] Yogesh K Dwivedi, Kawaljeet Kaur Kapoor, and Hsin Chen. Social media marketing
and advertising. The Marketing Review, 15(3):289–309, 2015.
[36] Paul S Earle, Daniel C Bowden, and Michelle Guy. Twitter earthquake detection:
earthquake monitoring in a social world. Annals of Geophysics, 54(6), 2012.
[37] Elozino Egonmwan and Yllias Chali. Transformer and seq2seq model for para-
phrase generation. In Proceedings of the 3rd Workshop on Neural Generation and
Translation, pages 249–255, 2019.
[38] Michael Ellsworth and Adam Janin. Mutaphrase: Paraphrasing with framenet. In
Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphras-
ing, pages 143–150, 2007.
[39] Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation.
arXiv preprint arXiv:1805.04833, 2018.
[41] Shuai Gao, Jun Ma, and Zhumin Chen. Modeling and predicting retweeting dynam-
ics on microblogging platforms. In Proceedings of the Eighth ACM International
Conference on Web Search and Data Mining, pages 107–116. ACM, 2015.
[42] Shalini Ghosh, Oriol Vinyals, Brian Strope, Scott Roy, Tom Dean, and Larry
Heck. Contextual lstm (clstm) models for large scale nlp tasks. arXiv preprint
arXiv:1602.06291, 2016.
[43] Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills,
Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A
Smith. Part-of-speech tagging for twitter: Annotation, features, and experiments.
In Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies: short papers-Volume 2, pages 42–47.
Association for Computational Linguistics, 2011.
110
[44] Tilmann Gneiting and Adrian E Raftery. Weather forecasting with ensemble meth-
ods. Science, 310(5746):248–249, 2005.
[45] Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using
distant supervision. CS224N Project Report, Stanford, 1:12, 2009.
[46] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In
Advances in neural information processing systems, pages 2672–2680, 2014.
[47] Klaus Greff, Rupesh K Srivastava, Jan Koutnı́k, Bas R Steunebrink, and Jürgen
Schmidhuber. Lstm: A search space odyssey. IEEE transactions on neural networks
and learning systems, 2017.
[48] Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. Incorporating copying mech-
anism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393, 2016.
[49] Ankush Gupta, Arvind Agarwal, Prawaan Singh, and Piyush Rai. A deep generative
framework for paraphrase generation. arXiv preprint arXiv:1709.05074, 2017.
[50] Chaitra Hegde and Shrikumar Patil. Unsupervised paraphrase generation using pre-
trained language models. arXiv preprint arXiv:2006.05477, 2020.
[51] Geoffrey E Hinton. Products of experts. In Artificial Neural Networks, 1999. ICANN
99. Ninth International Conference on (Conf. Publ. No. 470), volume 1, pages 1–6.
IET, 1999.
[53] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural com-
putation, 9(8):1735–1780, 1997.
[54] Jennifer A Hoeting, David Madigan, Adrian E Raftery, and Chris T Volinsky.
Bayesian model averaging: a tutorial. Statistical science, pages 382–401, 1999.
[55] Chris Hokamp and Qun Liu. Lexically constrained decoding for sequence generation
using grid beam search. arXiv preprint arXiv:1704.07138, 2017.
[56] Matthew Honnibal and Ines Montani. spacy 2: Natural language understanding
with bloom embeddings, convolutional neural networks and incremental parsing. To
appear, 7(1), 2017.
[57] J Edward Hu, Huda Khayrallah, Ryan Culkin, Patrick Xia, Tongfei Chen, Matt Post,
and Benjamin Van Durme. Improved lexically constrained decoding for translation
and monolingual rewriting. In Proceedings of the 2019 Conference of the North
111
American Chapter of the Association for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long and Short Papers), pages 839–850, 2019.
[58] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing.
Toward controlled generation of text. arXiv preprint arXiv:1703.00955, 2017.
[59] Minlie Huang, Yujie Cao, and Chao Dong. Modeling rich contexts for sentiment
classification with lstm. arXiv preprint arXiv:1605.01478, 2016.
[60] William John Hutchins and Harold L Somers. An introduction to machine transla-
tion, volume 362. Academic Press London, 1992.
[61] Kazushi Ikeda, Gen Hattori, Chihiro Ono, Hideki Asoh, and Teruo Higashino.
Twitter user profiling based on text and community mining for market analysis.
Knowledge-Based Systems, 51:35–47, 2013.
[62] Thorsten Joachims. Text categorization with support vector machines: Learning
with many relevant features. Springer, 1998.
[64] Pavan Kapanipathi, Prateek Jain, Chitra Venkataramani, and Amit Sheth. User in-
terests identification on twitter using a hierarchical knowledge base. In European
Semantic Web Conference, pages 99–113. Springer, 2014.
[65] Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american statis-
tical association, 90(430):773–795, 1995.
[66] Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard
Socher. Ctrl: A conditional transformer language model for controllable generation.
arXiv preprint arXiv:1909.05858, 2019.
[67] Dan Klein, Kristina Toutanova, H Tolga Ilhan, Sepandar D Kamvar, and Christo-
pher D Manning. Combining heterogeneous classifiers for word-sense disambigua-
tion. In Proceedings of the ACL-02 workshop on Word sense disambiguation: recent
successes and future directions-Volume 8, pages 74–80. Association for Computa-
tional Linguistics, 2002.
[68] Johannes Knoll. Advertising in social media: a review of empirical evidence. Inter-
national journal of Advertising, 35(2):266–300, 2016.
[69] Daphne Koller and Mehran Sahami. Hierarchically classifying documents using
very few words. Technical report, Stanford InfoLab, 1997.
112
[70] J Zico Kolter and Marcus A Maloof. Dynamic weighted majority: An ensemble
method for drifting concepts. The Journal of Machine Learning Research, 8:2755–
2790, 2007.
[71] Jeremy Z Kolter, Marcus Maloof, et al. Dynamic weighted majority: A new ensem-
ble method for tracking concept drift. In Data Mining, 2003. ICDM 2003. Third
IEEE International Conference on, pages 123–130. IEEE, 2003.
[72] Lingpeng Kong, Nathan Schneider, Swabha Swayamdipta, Archna Bhatia, Chris
Dyer, and Noah A Smith. A dependency parser for tweets. In Proceedings of the
Conference on Empirical Methods in Natural Language Processing, Doha, Qatar,
to appear, volume 4, 2014.
[73] Efthymios Kouloumpis, Theresa Wilson, and Johanna D Moore. Twitter sentiment
analysis: The good the bad and the omg! Icwsm, 11:538–541, 2011.
[74] Raymond Kozlowski, Kathleen F McCoy, and K Vijay-Shanker. Generation
of single-sentence paraphrases from predicate/argument structure using lexico-
grammatical resources. In Proceedings of the second international workshop on
Paraphrasing, pages 1–8, 2003.
[75] Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. A continuously growing dataset of
sentential paraphrases. arXiv preprint arXiv:1708.00391, 2017.
[76] Leah S Larkey and W Bruce Croft. Combining classifiers in text categorization. In
Proceedings of the 19th annual international ACM SIGIR conference on Research
and development in information retrieval, pages 289–297. ACM, 1996.
[77] Quoc Le and Tomas Mikolov. Distributed representations of sentences and docu-
ments. In Proceedings of the 31st International Conference on Machine Learning
(ICML-14), pages 1188–1196, 2014.
[78] Kathy Lee, Diana Palsetia, Ramanathan Narayanan, Md Mostofa Ali Patwary, Ankit
Agrawal, and Alok Choudhary. Twitter trending topic classification. In Data Mining
Workshops (ICDMW), 2011 IEEE 11th International Conference on, pages 251–258.
IEEE, 2011.
[79] Kyumin Lee, Jalal Mahmud, Jilin Chen, Michelle Zhou, and Jeffrey Nichols. Who
will retweet this?: Automatically identifying and engaging strangers on twitter to
spread information. In Proceedings of the 19th international conference on Intelli-
gent User Interfaces, pages 247–256. ACM, 2014.
[80] Ryong Lee and Kazutoshi Sumiya. Measuring geographical regularities of crowd
behaviors for twitter-based geo-social event detection. In Proceedings of the 2nd
ACM SIGSPATIAL international workshop on location based social networks, pages
1–10. ACM, 2010.
113
[81] Won-Jo Lee, Kyo-Joong Oh, Chae-Gyun Lim, and Ho-Jin Choi. User profile extrac-
tion from twitter for personalized news recommendation. In Advanced Communica-
tion Technology (ICACT), 2014 16th International Conference on, pages 779–783.
IEEE, 2014.
[82] David D Lewis and William A Gale. A sequential algorithm for training text clas-
sifiers. In Proceedings of the 17th annual international ACM SIGIR conference
on Research and development in information retrieval, pages 3–12. Springer-Verlag
New York, Inc., 1994.
[83] Jia Li, Hua Xu, Xingwei He, Junhui Deng, and Xiaomin Sun. Tweet modeling with
lstm recurrent neural networks for hashtag recommendation. In Neural Networks
(IJCNN), 2016 International Joint Conference on, pages 1570–1577. IEEE, 2016.
[84] Rui Li, Kin Hou Lei, Ravi Khadiwala, and Kevin Chen-Chuan Chang. Tedas: A
twitter-based event detection and analysis system. In 2012 IEEE 28th International
Conference on Data Engineering, pages 1273–1276. IEEE, 2012.
[85] Yung-Ming Li, Ya-Lin Shiu, et al. A diffusion mechanism for social advertising over
microblogs. DECISION SUPPORT SYSTEMS, 54(1):9–22, 2012.
[86] Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li. Paraphrase generation with deep
reinforcement learning. arXiv preprint arXiv:1711.00279, 2017.
[87] Defu Lian and Xing Xie. Collaborative activity recognition via check-in history.
In Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Location-
Based Social Networks, pages 45–48. ACM, 2011.
[88] Dongliang Liao, Weiqing Liu, Yuan Zhong, Jing Li, and Guowei Wang. Predicting
activity and location with multi-task context aware recurrent neural network. In
IJCAI, pages 3435–3441, 2018.
[89] Andy Liaw and Matthew Wiener. Classification and regression by randomforest. R
news, 2(3):18–22, 2002.
[90] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text
summarization branches out, pages 74–81, 2004.
[91] Yang Liu, Chengjie Sun, Lei Lin, and Xiaolong Wang. Learning natural lan-
guage inference using bidirectional lstm model and inner-attention. arXiv preprint
arXiv:1605.09090, 2016.
[92] Elena Lloret and Manuel Palomar. Towards automatic tweet generation: A compar-
ative study from the text summarization perspective in the journalism genre. Expert
Systems with Applications, 40(16):6624–6630, 2013.
114
[93] Chunliang Lu, Wai Lam, and Yingxiao Zhang. Twitter user modeling and tweets
recommendation based on wikipedia concept graph. In Workshops at the Twenty-
Sixth AAAI Conference on Artificial Intelligence, 2012.
[94] Shuming Ma, Xu Sun, Wei Li, Sujian Li, Wenjie Li, and Xuancheng Ren. Query
and output: Generating words by querying distributed word representations for para-
phrase generation. arXiv preprint arXiv:1803.01465, 2018.
[95] Jonathan Mallinson, Rico Sennrich, and Mirella Lapata. Paraphrasing revisited with
neural machine translation. In Proceedings of the 15th Conference of the European
Chapter of the Association for Computational Linguistics: Volume 1, Long Papers,
pages 881–893, 2017.
[96] R Dean Malmgren, Jake M Hofman, Luis AN Amaral, and Duncan J Watts. Char-
acterizing individual communication patterns. In Proceedings of the 15th ACM
SIGKDD international conference on Knowledge discovery and data mining, pages
607–616, 2009.
[98] Alice Marwick et al. To see and be seen: Celebrity practice on twitter. Convergence:
the international journal of research into new media technologies, 17(2):139–158,
2011.
[99] Mani Maybury. Advances in automatic text summarization. MIT press, 1999.
[100] Jon D Mcauliffe and David M Blei. Supervised topic models. In Advances in neural
information processing systems, pages 121–128, 2008.
[101] Andrew McCallum, Kamal Nigam, et al. A comparison of event models for naive
bayes text classification. In AAAI-98 workshop on learning for text categorization,
volume 752, pages 41–48. Citeseer, 1998.
[102] Michael Mccord and M Chuah. Spam detection on twitter using traditional classi-
fiers. In Autonomic and trusted computing, pages 175–186. Springer, 2011.
[103] Kathleen McKeown. Paraphrasing questions using given and new information.
American Journal of Computational Linguistics, 9(1):1–10, 1983.
[104] Rishabh Mehrotra, Scott Sanner, Wray Buntine, and Lexing Xie. Improving lda
topic models for microblogs via tweet pooling and automatic labeling. In Proceed-
ings of the 36th international ACM SIGIR conference on Research and development
in information retrieval, pages 889–892. ACM, 2013.
115
[105] Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and Lei Li. Cgmh: Constrained sentence
generation by metropolis-hastings sampling. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 33, pages 6834–6842, 2019.
[106] Matthew Michelson and Sofus A Macskassy. Discovering users’ topics of interest
on twitter: a first look. In Proceedings of the fourth workshop on Analytics for noisy
unstructured text data, pages 73–80. ACM, 2010.
[107] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of
word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[108] Tom M Mitchell. Machine learning. 1997. Burr Ridge, IL: McGraw Hill, 45:995,
1997.
[109] Jacob M Montgomery, Florian M Hollenbach, and Michael D Ward. Improving pre-
dictions using ensemble bayesian model averaging. Political Analysis, 20(3):271–
291, 2012.
[110] Mor Naaman. Social multimedia: highlighting opportunities for search and mining
of multimedia data in social media applications. Multimedia Tools and Applications,
56(1):9–34, 2012.
[111] Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio Sebastiani, and Veselin Stoy-
anov. Semeval-2016 task 4: Sentiment analysis in twitter. In Proceedings of the 10th
international workshop on semantic evaluation (semeval-2016), pages 1–18, 2016.
[112] Nasir Naveed, Thomas Gottron, Jérôme Kunegis, and Arifah Che Alhadi. Bad news
travel fast: A content-based analysis of interestingness on twitter. In Proceedings of
the 3rd International Web Science Conference, page 8. ACM, 2011.
[113] Finn Årup Nielsen. A new anew: Evaluation of a word list for sentiment analysis in
microblogs. arXiv preprint arXiv:1103.2903, 2011.
[114] Kamal Nigam, John Lafferty, and Andrew McCallum. Using maximum entropy
for text classification. In IJCAI-99 workshop on machine learning for information
filtering, volume 1, pages 61–67, 1999.
[115] Anastasios Noulas, Salvatore Scellato, Cecilia Mascolo, and Massimiliano Pontil.
An empirical study of geographic user activity patterns in foursquare. ICwSM,
11:70–573, 2011.
[116] Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider,
and Noah A Smith. Improved part-of-speech tagging for online conversational text
with word clusters. In Proceedings of the 2013 conference of the North American
chapter of the association for computational linguistics: human language technolo-
gies, pages 380–390, 2013.
116
[117] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up?: sentiment clas-
sification using machine learning techniques. In Proceedings of the ACL-02 confer-
ence on Empirical methods in natural language processing-Volume 10, pages 79–86.
Association for Computational Linguistics, 2002.
[118] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method
for automatic evaluation of machine translation. In Proceedings of the 40th annual
meeting of the Association for Computational Linguistics, pages 311–318, 2002.
[119] Ravi Parikh and Matin Movassate. Sentiment analysis of user-generated twitter up-
dates using various classification techniques. CS224N Final Report, pages 1–18,
2009.
[120] Md Rizwan Parvez, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Build-
ing language models for text with named entities. arXiv preprint arXiv:1805.04836,
2018.
[121] Michael J Paul and Mark Dredze. You are what you tweet: Analyzing twitter for
public health. Icwsm, 20:265–272, 2011.
[122] Huan-Kai Peng, Jiang Zhu, Dongzhen Piao, Rong Yan, and Ying Zhang. Retweet
modeling using conditional random fields. In 2011 IEEE 11th International Confer-
ence on Data Mining Workshops, pages 336–343. IEEE, 2011.
[123] Marco Pennacchiotti and Ana-Maria Popescu. A machine learning approach to twit-
ter user classification. In Fifth International AAAI Conference on Weblogs and Social
Media, 2011.
[124] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vec-
tors for word representation. In Proceedings of the 2014 conference on empirical
methods in natural language processing (EMNLP), pages 1532–1543, 2014.
[125] Sasa Petrovic, Miles Osborne, and Victor Lavrenko. Rt to win! predicting message
propagation in twitter. In ICWSM, 2011.
[126] Ana-Maria Popescu, Marco Pennacchiotti, and Deepa Paranjpe. Extracting events
and event descriptions from twitter. In WWW (Companion Volume), pages 105–106,
2011.
[127] Matt Post and David Vilar. Fast lexically constrained decoding with dynamic beam
allocation for neural machine translation. arXiv preprint arXiv:1804.06609, 2018.
[128] Aaditya Prakash, Sadid A Hasan, Kathy Lee, Vivek Datla, Ashequl Qadir, Joey
Liu, and Oladimeji Farri. Neural paraphrase generation with stacked residual lstm
networks. arXiv preprint arXiv:1610.03098, 2016.
117
[129] Daniele Quercia, Harry Askham, and Jon Crowcroft. Tweetlda: supervised topic
classification and link prediction in twitter. In Proceedings of the 4th Annual ACM
Web Science Conference, pages 247–250. ACM, 2012.
[130] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
Sutskever. Language models are unsupervised multitask learners. OpenAI blog,
1(8):9, 2019.
[131] Adrian E Raftery, Tilmann Gneiting, Fadoua Balabdaoui, and Michael Polakowski.
Using bayesian model averaging to calibrate forecast ensembles. Monthly Weather
Review, 133(5):1155–1174, 2005.
[132] Daniel Ramage, Susan T Dumais, and Daniel J Liebling. Characterizing microblogs
with topic models. ICWSM, 5(4):130–137, 2010.
[133] Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D Manning. La-
beled lda: A supervised topic model for credit attribution in multi-labeled corpora.
In Proceedings of the 2009 Conference on Empirical Methods in Natural Language
Processing: Volume 1-Volume 1, pages 248–256. Association for Computational
Linguistics, 2009.
[134] Adithya Rao, Nemanja Spasojevic, Zhisheng Li, and Trevor Dsouza. Klout score:
Measuring influence across multiple social networks. In 2015 IEEE International
Conference on Big Data (Big Data), pages 2282–2289. IEEE, 2015.
[135] Delip Rao, David Yarowsky, Abhishek Shreevats, and Manaswi Gupta. Classifying
latent user attributes in twitter. In Proceedings of the 2nd international workshop on
Search and mining user-generated contents, pages 37–44. ACM, 2010.
[136] Radim Řehůřek and Petr Sojka. Software Framework for Topic Modelling with
Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges
for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. http:
//is.muni.cz/publication/884893/en.
[137] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using
siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
[138] Alan Ritter, Oren Etzioni, Sam Clark, et al. Open domain event extraction from twit-
ter. In Proceedings of the 18th ACM SIGKDD international conference on Knowl-
edge discovery and data mining, pages 1104–1112. ACM, 2012.
[139] Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. The
author-topic model for authors and documents. In Proceedings of the 20th con-
ference on Uncertainty in artificial intelligence, pages 487–494. AUAI Press, 2004.
118
[140] Hassan Saif, Yulan He, and Harith Alani. Semantic sentiment analysis of twitter. In
International semantic web conference, pages 508–524. Springer, 2012.
[141] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. Earthquake shakes twitter
users: real-time event detection by social sensors. In Proceedings of the 19th inter-
national conference on World wide web, pages 851–860. ACM, 2010.
[142] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE
Transactions on Signal Processing, 45(11):2673–2681, 1997.
[144] Priya Sidhaye and Jackie Chi Kit Cheung. Indicative tweet generation: An extractive
summarization problem? In Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, pages 138–147, 2015.
[145] Linfeng Song, Ante Wang, Jinsong Su, Yue Zhang, Kun Xu, Yubin Ge, and Dong
Yu. Structural information preserving for graph-to-text generation. In Proceedings
of the 58th Annual Meeting of the Association for Computational Linguistics, pages
7987–7998, 2020.
[146] Yangqiu Song, Zhengdong Lu, Cane Wing-ki Leung, and Qiang Yang. Collaborative
boosting for activity classification in microblogs. In Proceedings of the 19th ACM
SIGKDD international conference on Knowledge discovery and data mining, pages
482–490. ACM, 2013.
[147] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.
The journal of machine learning research, 15(1):1929–1958, 2014.
[148] Bongwon Suh, Lichan Hong, Peter Pirolli, and Ed H Chi. Want to be retweeted?
large scale analytics on factors impacting retweet in twitter network. In Social com-
puting (socialcom), 2010 ieee second international conference on, pages 177–184.
IEEE, 2010.
[149] Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. Learning
sentiment-specific word embedding for twitter sentiment classification. In ACL (1),
pages 1555–1565, 2014.
[150] Grigorios Tsoumakas, Lefteris Angelis, and Ioannis Vlahavas. Selective fusion of
heterogeneous classifiers. Intelligent Data Analysis, 9(6):511–525, 2005.
[151] Andranik Tumasjan, Timm O Sprenger, Philipp G Sandner, and Isabell M Welpe.
Predicting elections with twitter: What 140 characters reveal about political senti-
ment. In Fourth international AAAI conference on weblogs and social media, 2010.
119
[152] Tracy L Tuten. Advertising 2.0: social media marketing in a web 2.0 world: social
media marketing in a web 2.0 world. ABC-CLIO, 2008.
[153] Tracy L Tuten. Social media marketing. SAGE Publications Limited, 2020.
[154] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.
In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.
[155] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances
in neural information processing systems, pages 2692–2700, 2015.
[156] Soroush Vosoughi, Prashanth Vijayaraghavan, and Deb Roy. Tweet2vec: Learning
tweet embeddings using character-level cnn-lstm encoder-decoder. In Proceedings
of the 39th International ACM SIGIR conference on Research and Development in
Information Retrieval, pages 1041–1044, 2016.
[157] Sida Wang and Christopher D Manning. Baselines and bigrams: Simple, good sen-
timent and topic classification. In Proceedings of the 50th Annual Meeting of the
Association for Computational Linguistics: Short Papers-Volume 2, pages 90–94.
Association for Computational Linguistics, 2012.
[158] Su Wang, Rahul Gupta, Nancy Chang, and Jason Baldridge. A task in a suit and a
tie: paraphrase generation with semantic augmentation. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 33, pages 7176–7183, 2019.
[159] Tianlu Wang, Xuezhi Wang, Yao Qin, Ben Packer, Kang Li, Jilin Chen, Alex Beutel,
and Ed Chi. Cat-gen: Improving robustness in nlp models via controlled adversarial
text generation. arXiv preprint arXiv:2010.02338, 2020.
[160] Xiaofeng Wang, Matthew S Gerber, and Donald E Brown. Automatic crime predic-
tion using events extracted from twitter posts. In International conference on social
computing, behavioral-cultural modeling, and prediction, pages 231–238. Springer,
2012.
[161] Xin Wang, Yuanchao Liu, SUN Chengjie, Baoxun Wang, and Xiaolong Wang. Pre-
dicting polarities of tweets by composing word embeddings with long short-term
memory. In Proceedings of the 53rd Annual Meeting of the Association for Compu-
tational Linguistics and the 7th International Joint Conference on Natural Language
Processing (Volume 1: Long Papers), volume 1, pages 1343–1353, 2015.
[162] Yequan Wang, Minlie Huang, Li Zhao, et al. Attention-based lstm for aspect-level
sentiment classification. In Proceedings of the 2016 conference on empirical meth-
ods in natural language processing, pages 606–615, 2016.
120
[163] Yu Wang, Eugene Agichtein, and Michele Benzi. Tm-lda: efficient online modeling
of latent topic transitions in social media. In Proceedings of the 18th ACM SIGKDD
international conference on Knowledge discovery and data mining, pages 123–131.
ACM, 2012.
[164] Yuan Wang and Yiyi Yang. Dialogic communication on social media: How organi-
zations use twitter to build dialogic relationships with their publics. Computers in
Human Behavior, 104:106183, 2020.
[165] Wouter Weerkamp, Maarten De Rijke, et al. Activity prediction: A twitter-based
exploration. In SIGIR Workshop on Time-aware Information Access, 2012.
[166] Jianshu Weng and Bu-Sung Lee. Event detection in twitter. In Fifth international
AAAI conference on weblogs and social media, 2011.
[167] John Wieting and Kevin Gimpel. Paranmt-50m: Pushing the limits of paraphras-
tic sentence embeddings with millions of machine translations. arXiv preprint
arXiv:1711.05732, 2017.
[168] Alistair Willis, Ali Fisher, and Ilia Lvov. Mapping networks of influence: tracking
twitter conversations through time and space. Participations: Journal of Audience
& Reception Studies, 12(1):494–530, 2015.
[169] Sam Witteveen and Martin Andrews. Paraphrasing with large language models.
arXiv preprint arXiv:1911.09661, 2019.
[170] David H Wolpert. Stacked generalization. Neural networks, 5(2):241–259, 1992.
[171] Jonathan H Wright. Bayesian model averaging and exchange rate forecasts. Journal
of Econometrics, 146(2):329–341, 2008.
[172] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu
Philip. A comprehensive survey on graph neural networks. IEEE Transactions on
Neural Networks and Learning Systems, 2020.
[173] Heng Xu, Lih-Bin Oh, and Hock-Hai Teo. Perceived effectiveness of text vs. multi-
media location-based advertising messaging. International Journal of Mobile Com-
munications, 7(2):154–177, 2009.
[174] Wei Xu, Alan Ritter, Chris Callison-Burch, William B Dolan, and Yangfeng Ji. Ex-
tracting lexically divergent paraphrases from twitter. Transactions of the Association
for Computational Linguistics, 2:435–448, 2014.
[175] Zhiheng Xu, Long Ru, Liang Xiang, and Qing Yang. Discovering user inter-
est on twitter with a modified author-topic model. In Proceedings of the 2011
IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent
Agent Technology-Volume 01, pages 422–429. IEEE Computer Society, 2011.
121
[176] Zhiheng Xu and Qing Yang. Analyzing user retweet behavior on twitter. In Proceed-
ings of the 2012 International Conference on Advances in Social Networks Analysis
and Mining (ASONAM 2012), pages 46–50. IEEE Computer Society, 2012.
[177] Dingqi Yang, Daqing Zhang, Vincent W Zheng, and Zhiyong Yu. Modeling user
activity preference by leveraging user spatial temporal characteristics in lbsns. IEEE
Transactions on Systems, Man, and Cybernetics: Systems, 45(1):129–142, 2015.
[178] Shuang-Hong Yang, Alek Kolcz, Andy Schlaikjer, and Pankaj Gupta. Large-scale
high-precision topic modeling on twitter. In Proceedings of the 20th ACM SIGKDD
international conference on Knowledge discovery and data mining, pages 1907–
1916, 2014.
[179] Zi Yang, Jingyi Guo, Keke Cai, Jie Tang, Juanzi Li, Li Zhang, and Zhong Su. Un-
derstanding retweeting behaviors in social networks. In Proceedings of the 19th
ACM international conference on Information and knowledge management, pages
1633–1636. ACM, 2010.
[180] Jihang Ye, Zhe Zhu, and Hong Cheng. What’s your next move: User activity pre-
diction in location-based social networks. In Proceedings of the 2013 SIAM Inter-
national Conference on Data Mining, pages 171–179. SIAM, 2013.
[181] Shaozhi Ye and S Felix Wu. Measuring message propagation and social influence
on Twitter. com. Springer, 2010.
[182] An-Zi Yen, Hen-Hsen Huang, and Hsin-Hsi Chen. Detecting personal life events
from twitter by multi-task lstm. In Companion of the The Web Conference 2018 on
The Web Conference 2018, pages 21–22. International World Wide Web Conferences
Steering Committee, 2018.
[183] Zibin Yin, Ya Zhang, Weiyuan Chen, and Richard Zong. Discovering patterns of
advertisement propagation in sina-microblog. In Proceedings of the Sixth Inter-
national Workshop on Data Mining for Online Advertising and Internet Economy,
page 1. ACM, 2012.
[184] Tauhid R Zaman, Ralf Herbrich, Jurgen Van Gael, and David Stern. Predicting
information spreading in twitter. In Workshop on computational social science and
the wisdom of crowds, nips, volume 104, pages 17599–601. Citeseer, 2010.
[185] Kuo-Hao Zeng, Mohammad Shoeybi, and Ming-Yu Liu. Style example-
guided text generation using generative adversarial transformers. arXiv preprint
arXiv:2003.00674, 2020.
[186] Yizhe Zhang, Guoyin Wang, Chunyuan Li, Zhe Gan, Chris Brockett, and Bill Dolan.
Pointer: Constrained text generation via insertion-based generative pre-training.
arXiv preprint arXiv:2005.00558, 2020.
122
[187] Shiqi Zhao, Xiang Lan, Ting Liu, and Sheng Li. Application-driven statistical para-
phrase generation. In Proceedings of the Joint Conference of the 47th Annual Meet-
ing of the ACL and the 4th International Joint Conference on Natural Language
Processing of the AFNLP, pages 834–842, 2009.
[188] Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan,
and Xiaoming Li. Comparing twitter and traditional media using topic models. In
Advances in Information Retrieval, pages 338–349. Springer, 2011.
[189] Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Francis Lau. A c-lstm neural net-
work for text classification. arXiv preprint arXiv:1511.08630, 2015.
[190] Qianrong Zhou, Liyun Wen, Xiaojie Wang, Long Ma, and Yue Wang. A hierarchical
lstm model for joint tasks. In China National Conference on Chinese Computational
Linguistics, pages 324–335. Springer, 2016.
123