0% found this document useful (0 votes)
32 views

Dissertation

DOC

Uploaded by

yassine badou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views

Dissertation

DOC

Uploaded by

yassine badou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 137

Improved Utilization of Advertising through Social Media

Dissertation

Presented in Partial Fulfillment of the Requirements for the Degree Doctor


of Philosophy in the Graduate School of The Ohio State University

By

Renhao Cui, B.S

Graduate Program in Department of Computer Science and Engineering

The Ohio State University

2020

Dissertation Committee:
Rajiv Ramnath, Gagan Agrawal, Advisor
Eric Fosler-Lussier
Ping Zhang
© Copyright by

Renhao Cui

2020
Abstract

In the past decade, social media has become the dominating platform for advertisement.

The broad accessibility, rich message types, large audience size, and accurate customer

targeting allow for efficient propagation of commercial posts. In addition to helping com-

panies disseminate their advertisements more effectively, social media also provides the

opportunity for rapid receipt of customer feedback. However, many companies still rely

heavily on human effort to utilize social media for advertising.

In this work, we demonstrate multiple methods to help companies utilize social media

for a better advertising and marketing experience by drawing from and extending machine

learning and data mining techniques.

We apply ensemble models to classify user feedback based on a mixed set of label

requirements. Then we build a set of linguistics features to predict the potential of drawing

audience attention to commercial posts. To better identify target customers, we utilize

a hybrid Long Short Term Memory network (LSTM) to recognize the activities of users

when posting tweets. As the last step, we propose a constrained generation framework to

help rephrase commercial posts that are more diverse in terms of text and that preserve

the key information. Our work covers multiple areas of advertising on social media, from

preparation to generation, and from feedback analysis to performance prediction.

ii
This is dedicated to my beloved mom.

You will always be remembered.

iii
Acknowledgments

I would like to express the sincerest appreciation to my advisors, Prof. Rajiv Ramnath

and Prof. Gagan Agrawal, for their insightful, encouraging, and constant help in both my

research and my life. I also wish to say thank you to my committee members, Prof. Eric

Fosler-Lussier and Prof. Ping Zhang, for their time, guidance, and goodwill.

I would like to express my gratitude to Astute Global for their support, assistance, and

recognition to my research.

Most importantly, I am eternally grateful to the greatest family of mine for their uncon-

ditional love. They are the ones who keep me going forward.

Although I have had a tough and unpredictable time in the past few years, I want to

thank all the people I ever talked and listened to. You made my days brighter and better.

Finally, I owe it all to you, mom.

iv
Vita

2011 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.S. Computer Science, University of


Missouri
2012 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Analyst Developer, Dish Network

2013-present . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Graduate Research Associate,


Computer Science and Engineering, The
Ohio State University

Publications

Cui, Renhao, Gagan Agrawal, and Rajiv Ramnath. “Tweets can tell: activity recognition
using hybrid gated recurrent neural networks.” Social Network Analysis and Mining, 10.1
(2020): 1-15

Cui, Renhao, Gagan Agrawal, and Rajiv Ramnath. “Tweets can tell: Activity recogni-
tion using hybrid long short-term memory model.” Proceedings of the 2019 IEEE/ACM
international conference on advances in social networks analysis and mining, 2019.

Cui, Renhao, Gagan Agrawal, and Rajiv Ramnath. “Towards Successful Social Media
Advertising: Predicting the Influence of Commercial Tweets.” arXiv preprint, arXiv:1910.
12446 (2019)

Das, Manirupa, Cui, Renhao. “Comparison of Quality Indicators in User-generated Con-


tent Using Social Media and Scholarly Text” arXiv preprint, arXiv:1910.11399 (2019)

Cui, Renhao, et al. “Ensemble of heterogeneous classifiers for improving automated tweet
classification.” 2016 IEEE 16th International Conference on Data Mining Workshops
(ICDMW). IEEE, 2016

v
Das, Manirupa, Renhao Cui, David R. Campbell, Gagan Agrawal, and Rajiv Ramnath.
“Towards methods for systematic research on big data.” 2015 IEEE International Confer-
ence on Big Data (Big Data). IEEE, 2015

Fields of Study

Major Field: Computer Science and Engineering

vi
Table of Contents

Page

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

Vita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 General Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 5

2. Mixed Domain Tweet Classification . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Domain Conversion of Probabilistic Output . . . . . . . . . . . . . . . . 11
2.3.1 Mapping labels . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Mapping probabilities . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Ensemble of Probabilistic Models . . . . . . . . . . . . . . . . . . . . . 14
2.4.1 Dynamic Weighting . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 Stacking+ Ensemble . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.2 Baselines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

vii
2.5.3 Experiment design . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.4 Experiment results and analysis . . . . . . . . . . . . . . . . . . 23
2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3. Commercial Tweet Influence Prediction . . . . . . . . . . . . . . . . . . . . . 30

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 Influence Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 Data labeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.2 Classification model . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3.3 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.2 Experiment design . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.3 Group label analysis . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4.4 Experiment results and analysis . . . . . . . . . . . . . . . . . . 49
3.5 Demonstrating Use of the Framework: A Case Study . . . . . . . . . . . 53
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4. Offline Activity Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.3 Working with Contextual Features using LSTM . . . . . . . . . . . . . . 61
4.3.1 Activity labeling . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3.2 Contextual learning with LSTM . . . . . . . . . . . . . . . . . . 62
4.4 Our Proposed Hybrid-LSTM Model . . . . . . . . . . . . . . . . . . . . 67
4.4.1 Including historical tweets . . . . . . . . . . . . . . . . . . . . . 67
4.4.2 Including direct contextual features . . . . . . . . . . . . . . . . 69
4.4.3 Hybrid-LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.4 Illustrative examples . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.2 Experiment design . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5.3 Experiment results and analysis . . . . . . . . . . . . . . . . . . 76
4.6 Demonstrating Use of the Approach: A Case Study . . . . . . . . . . . . 78
4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

5. Constrained Paraphrase Generation for Commercial Tweets . . . . . . . . . . . 82

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

viii
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3 Constraint-Embedded Language Modeling (CELM) for
Paraphrase Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.1 Constraint identification . . . . . . . . . . . . . . . . . . . . . . 87
5.3.2 Constraint embedding . . . . . . . . . . . . . . . . . . . . . . . 88
5.3.3 Causal language modeling . . . . . . . . . . . . . . . . . . . . . 89
5.3.4 Decoding and generation . . . . . . . . . . . . . . . . . . . . . 90
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4.2 Experiment design . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.4.3 Experiment results and analysis . . . . . . . . . . . . . . . . . . 95
5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6. Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100


6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Appendices 105

A. Implementation and Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

ix
List of Tables

Table Page

2.1 Motivating ensemble method: output probabilities of different topics . . . . 14

2.2 Tweet dataset - brands and number of labeled tweets . . . . . . . . . . . . 20

3.1 Style features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 Collected brands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3 simGroup emb group samples . . . . . . . . . . . . . . . . . . . . . . . . 48

3.4 Model performance and ablation analysis . . . . . . . . . . . . . . . . . . 50

3.5 Prediction samples from different models where the true labels are positive 51

3.6 Commercial tweets about a raffle event for winning console controllers . . . 53

4.1 Sample tweets with reported locations . . . . . . . . . . . . . . . . . . . . 57

4.2 Sample Tweets for Model Analysis . . . . . . . . . . . . . . . . . . . . . . 69

4.3 Sample tweets and model predictions . . . . . . . . . . . . . . . . . . . . . 72

4.4 Location - activity label mapping . . . . . . . . . . . . . . . . . . . . . . . 75

4.5 Comparison of model performance . . . . . . . . . . . . . . . . . . . . . . 77

5.1 Examples for data compatibility between PMT and CommTweet . . . . . . 93

5.2 Model performance on SINGLE set . . . . . . . . . . . . . . . . . . . . . 96

x
5.3 Model performance on MULTI set . . . . . . . . . . . . . . . . . . . . . . 96

5.4 Sample paraphrase generations from the models . . . . . . . . . . . . . . . 98

A.1 Companies that are included in the CommTweet dataset . . . . . . . . . . . 105

A.2 Activity labels in ActivityTweet dataset . . . . . . . . . . . . . . . . . . . 106

A.3 Links to the related resources . . . . . . . . . . . . . . . . . . . . . . . . . 106

xi
List of Figures

Figure Page

2.1 Label mapping across different domains . . . . . . . . . . . . . . . . . . . 12

2.2 Process of Stacking+ ensemble model . . . . . . . . . . . . . . . . . . . . 19

2.3 Performance comparison between three individual models and four ensem-
ble models (five datasets) . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4 Comparison between two models: using the probability input and adding
tweet vector to the input (Stacking Classifier) . . . . . . . . . . . . . . . . 27

3.1 Favorite-to-retweet ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.2 Model performance given different labeling groups . . . . . . . . . . . . . 47

4.1 LSTM for text classification . . . . . . . . . . . . . . . . . . . . . . . . . 63

4.2 Joint-LSTM for text classification . . . . . . . . . . . . . . . . . . . . . . 64

4.3 Contextual-LSTM for text classification . . . . . . . . . . . . . . . . . . . 65

4.4 Hierarchical-LSTM for text classification . . . . . . . . . . . . . . . . . . 66

4.5 Comparison of ability to incorporate contextual features between different


models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.6 Hybrid-LSTM for text classification . . . . . . . . . . . . . . . . . . . . . 71

4.7 Summary of the activity distributions for followers of popular accounts . . 80

5.1 Commercial tweets that are posted for the same product . . . . . . . . . . . 83

xii
5.2 Overview of the constrained generation process . . . . . . . . . . . . . . . 86

5.3 Dependency parsing of a commercial tweet . . . . . . . . . . . . . . . . . 87

5.4 Embed the constraint directly to text sequences . . . . . . . . . . . . . . . 88

5.5 Examples of text content with single and multiple constraints . . . . . . . . 91

xiii
Chapter 1: Introduction

1.1 General Introduction

Social media has grown dramatically in the past decade, in terms of both its user base

and influence [68]. The various types of resources, regions, platforms, languages, and

domains enrich the ability of social media to generate abundant information about innu-

merable topics. The user base spans an array of people, from everyday users to official

organizations, from famous artists to the heads of countries [35]. The ease of use and wide

range of users has standardized the world such that it is possible for nearly everyone to

obtain useful information. The speed with which information is delivered further enlarges

the impact on all platforms. However, one thing does not change too much – the major

profits and earnings of social media platforms through advertisement.

Advertising has a long history and has proved to be necessary and inevitable for all
1
types of businesses. A recent survey shows that 73% of marketers think social media

marketing is effective and 89% of marketers believe social media is important to their

marketing strategy. With the growing population and expanding information networks,

advertising must be more effective and efficient. Facing these challenges, social media
1
https://buffer.com/state-of-social-2019

1
platforms have become the best place to post commercial information and advertisements

[152].

Various forms of advertisements are used through social media platforms, including

text, audio, images, and videos. Rich messages such as audio and video can provide more

comprehensive, useful and interesting information to social media users [110, 173]. Rich

messages also have a higher information bandwidth, that is, they express more ideas more

fully than simple text messages. Nevertheless, rich messages have disadvantages. They

consume more resources and all devices may not have the ability to compose and send rich

messages. Readers look mostly at the text content for detailed and critical information. Fi-

nally, it is easier to generate an effective advertisement using text than using audio or video.

For all these reasons, text messages are still the most widely used means of delivering ad-

vertisements. Given the above, this dissertation focuses on text as the only advertising

medium. We also focus on Twitter data in all the projects described. We believe that the

way text data are used in most social media platforms is similar, therefore our research and

our models can be expanded easily to other platforms.

To form a successful social media advertising system, several features are required for

the system:

1. It can locate the appropriate target audiences.

2. It can generate good and attractive content.

3. It can predict the performance of the advertisement.

4. It can efficiently analyze the feedback of the advertisements.

The core of an advertisement is the information about the product or promotion that

needs to be disseminated. Given that the primary requirement of any advertisement system

2
is to help improve the effectiveness of social media advertising of the product being pro-

moted, this dissertation examines the several parts of the social media advertising process,

in order to fill the gaps in improved utilization and impact of advertising through social

media.

1.2 Related Work

To date, most existing work on tweets focused on the analysis or summarization of

single tweets, tweet streams, or authors. Sentiment analysis [27, 140, 111, 45] is the most

popular task. Such analysis is relatively accurate and efficient because of the large user base

and their intention of expressing personal opinions. As a standard classification task, the

work on sentiment analysis on tweets help setup the basic knowledge for many following

tasks. Exploring the political views of people [26, 25, 9, 151] is another practical use of

tweet analysis. In addition, tweet analysis has become an important source of feedback for

companies to monitor the opinions of their users. In addition to Twitter’s own analytics

system,2 a wide range of tools such as Hootsuite,3 Klear,4 and SocialBakers5 can help

analyze and understand information posted on Twitter.

Analysis done on the large population of Twitter users, also known as user profiling,

is another area that can provide insights and information. User interest [175, 10, 64], as

the most popular attribute, has been explored widely based on the content of tweets as

well as the relationship with other users. Some work [1, 93] has focused on modeling

Twitter users for different recommendation tasks. Other work [123, 61, 61] has been built
2
analytics.twitter.com
3
hootsuite.com
4
klear.com
5
socialbakers.com

3
to profile Twitter users for a variety of purposes. Analysis of users requires analysis on a

large volume of tweets as the base, but it also includes other information such as the meta

data of the users and the relationship across users. In our work, however, we focus on

user-independent features, so that the models can be applied without knowing any of the

network or personal information about the user.

Information extraction is another useful application utilizing tweets. The real-time as-

pect and large user base of Twitter allow for the extraction of information accurately and

efficiently. Event detection and extraction [138, 126, 84, 166] aim to extract key infor-

mation regarding events such as the time, place, and description. In addition to regular

social events, the detections of earthquakes [36, 141], crimes [160], and festivals [80] are

also practical in real life. Events and news are served as common pivots to locate poten-

tial paraphrases in Twitter. However, it still requires manual work to create a high-quality

paraphrase dataset for tweets. Therefore, we use a large sentential paraphrase dataset con-

structed from a general domain and transfer the knowledge to a tweet paraphrase task.

Unlike analysis and information extraction through Twitter, the work on generating

tweets falls behind because of the lack of appropriate model, data, and use case. News

tweets have been the first to utilize automatic tweet-generation models. This requires the

support of traditional news articles, and generates the tweet using a summarization of the

articles[92]. Another summarization-based model generates indicative tweets to introduce

a link in the tweet content [144]. In summary, the generation of tweets usually requires

a strong support resource or a clear purpose of the generation. On the other hand, the

paraphrase generation of tweets relies on existing tweets. Given some properties of tweets

(length, hashtags, etc.), a constrained/controllable paraphrase generation model is desired.

4
1.3 Summary of Contributions

In order to improve the utilization of advertising through social media, our work focuses

on multiple aspects to fulfill certain tasks automatically or assist human agents for better

decision-making. To this end, the following efforts are made to fulfill the needs.

We introduce a mixed-classification model to label tweets given a mixed-domain re-

quirement. The designated labels are formed across different domains, and typical classi-

fication models work well on labels that belong to the same domain. We use an ensemble

framework to combine the probabilistic outputs of several classification models to improve

the performance of the mixed-classification task. In addition, we propose a way to map the

output probability distributions from one domain to our target domain.

To predict the potential influence of a commercial tweet, we construct a set of style

features and generate the prediction using certain classifiers. These features do not contain

the inherent meaning of the tweet, therefore the model is generalized for most commercial

tweets. To complete the study, we collect a dataset of commercial tweets that contains orig-

inal tweets posted by multiple official accounts of popular companies in different fields. We

conduct an ablation analysis on the features and reveal the importance of these features to

have a successful commercial post. Relying on the use of linguistic features, we demon-

strate that a commercial post can be modified accordingly to draw more attention from its

audiences.

We introduce a system to profile users based on their offline activities. Based on the

reported location, we create a tweet dataset labeled with the related user offline activities for

the experiment. We research several existing methods that include contextual information

in LSTM-based models, and propose a hybrid-LSTM model that can take different types of

contextual features to improvement the performance of recognizing user activities. Using

5
this model, we justify the relation between the characteristics of a company and the major

offline activities of its followers on Twitter.

In order to generate commercial tweets automatically, we build a constrained para-

phrase generation framework for commercial tweets. The constraints cover key elements

in the content which should be kept in the generated paraphrase. We utilize a sentential

paraphrase dataset constructed on a general domain and apply the trained model to a col-

lection of commercial tweets. The hard constraints are identified and embedded directly

into the content data. Language models are used to learn from the constraint-embedded

data, and the framework can provide solid improvement using the constraint information

and transferred knowledge.

Our proposed models and datasets demonstrate a meaningful attempt for systems that

can improve the effectiveness of social media marketing and advertising. The models and

datasets cover different steps of successful social media advertising, therefore our contri-

butions are complete and comprehensive.

6
Chapter 2: Mixed Domain Tweet Classification

2.1 Introduction

The initial step we consider in this chapter is better understanding user feedback on

social media. Twitter has become an invaluable resource for businesses because it can

provide high-volume opinion streams in real-time from real users. Extremely useful infor-

mation can be derived from tweets, such as complaints or compliments about a service, or

interest in purchasing a product. For example, a tweet, “I just had one of your grass-fed

ribeyes. I have only one thing to say: blah,” expresses disappointment about the steak,

whereas, “When does the Wild Madagascar Vanilla come out? Can’t wait,” shows inter-

est in a product. Whether complimentary or critical, Twitter feedback is now seen as an

effective replacement for marketing surveys, with the added benefit that this feedback is

received rapidly (almost in real-time) from active customers.

However, Twitter streams are uncategorized and noisy and, practically speaking, useless

in its raw form. In order to be useful, tweet data should be cleansed and categorized.

Businesses seek to have the tweets classified with a set of specific, predetermined topics.

Manual processing is not possible because of the complexities of the classification task and

the sheer volume of data, hence automated processing is essential.

7
For the task of assigning labels to a text, many automated classification and topic mod-

eling systems have been proposed and applied. Latent Dirchlet Allocation (LDA) [13]

along with its variations Labeled-LDA (LLDA) [133] and Linked-LDA [12], are popular

topic-modeling approaches that generate a probability distribution over topics for each doc-

ument. Classical classifiers such as Naive Bayes [97] and Maximum Entropy [114] have

also been applied, and these classifiers generate a probabilistic output over a list of labels.

However, these probabilistic outputs are not understandable for use in real business cases;

what is needed is a discriminating output that simply presents the most appropriate label –

not necessarily the label with the highest probability. Moreover, another challenge is that

the classification categories or purposes are not defined over the same domain (e.g., when

the need is to classify messages that are about new promotions as well as customer feed-

back on products). We consider this a mixed-classification problem. In most cases, a single

classifier does not work well for the mixed-classification problem because of the limitation

that most classifiers are designed and trained to work in a specific setting or for a specific

purpose. For example, features and models selected to classify product feedback often

do not work well for the classification of new promotions. Instead, building two separate

models could potentially improve the overall performance.

This chapter addresses the problems mentioned above. First, we build a mapping

method that converts the probabilistic output of a third-party application programming in-

terface (API) with its own predefined (but hidden) universal label corpus into the domain

of the dataset being investigated. Along with mapping the labels, the method also gener-

ates a new probability distribution that integrates the probability distribution of the original

labels with the confidence of the mapping function. Second, to improve the performance

of a mixed-classification problem, we combine several individual models using ensemble

8
methods that combine the probabilistic output from multiple heterogeneous classifiers. We

do this by considering the probabilities associated with the outputs as the representation

of the document or the reliability of the classification from each individual model. The

stacking ensemble model used can further benefit from the addition of tweet vectors in its

learning step.

We have evaluated our methods using a real-world (industry-supplied) dataset. The

specific approaches we combine include the LLDA model, Naive Bayes classifier, and the

third-party text classification API. The proposed Stacking+ Ensemble methods improves

the accuracy to against the best individual model in the ensemble as well as the baseline

ensemble models. As an average across all datasets, we noted a 29.1% reduction in the

number of inaccurate predictions compared to using the best baseline ensemble method for

combining the three methods mentioned above.

2.2 Related Work

We first compare our work with other efforts on classifying social media postings and

ensemble methods. Many efforts have attempted to apply traditional models to social media

data. However, difficulties arise in applying these models directly to this special domain

and results largely have not been satisfactory [188]. Therefore, many research efforts have

focused on improving the performance of traditional models on the new domains.

Latent Dirchlet Allocation (LDA) is a well-established model built for structured data.

To solve the problem of the limited size of any tweet post, one popular solution is to ag-

gregate tweets together into a macro-document [104], whereas the Author-Topic model

merges author meta data into the model [139]. Moreover, a Temporal-LDA (EM-LDA)

model brings temporal influence to tracking the transition of topics in social media [163].

9
Multiple variations also exist that incorporate supervision into the LDA model, such as

LLDA and supervised LDA (sLDA) [100]. LLDA allows multiple labels for each docu-

ment by a one-to-one mapping between labels and the latent topics, whereas sLDA only

gives a single label to each document based on the mixture of the latent topics. LLDA has

been used as a supervised classifier, for example, Ramage et al. [132] used it to characterize

microblogs.

Bayesian classifiers have been explored for many years, and their simplicity and effi-

ciency make them popular in many classification tasks. A multivariate Bernoulli model

using binary word features [76, 69], and a multinomial model with unigram word count

features [82, 108, 101] are early approaches to classify text documents. For the case of

unstructured data, many works utilize the Naive Bayes model for sentiment classification

[45, 119], topic classification [78], and spam detection [102].

Ensemble methods have been long studied as a mechanism to improve the performance

of machine learning algorithms. Dietterich [31] gives a detailed introduction to ensemble

methods, covering manipulation of the output hypotheses, the training data, the input fea-

tures, the output targets, and adding randomness to the problem. Randomness is the basis

for certain ensemble models such as RandomForest [89], whereas Adaboost [40] constructs

ensembles by manipulating the training data. However, these methods do not benefit from

combining the capabilities of different individual models, which can be critical when fac-

ing the problem of mixed classification. Thus, combining heterogeneous classifiers can be

beneficial, and has been explored in the past, for example, by Tsoumakas et al. [150] and

Klein et al. [67]. In addition, Product of Experts (PoE) is developed to combine multiple

probabilistic outputs by multiplication, aiming to produce a sharper distribution than by

using individual ones [51, 52]. Raftery et al. [131] combine different probabilistic models

10
by extending the Bayesian Model Averaging (BMA) method and learning the weight from

the data, with the assumption that the outputs of involved models follow a predefined prob-

ability distribution. This approach has been applied to tasks like the prediction of political

events [109], stock prices [11], weather [44], and exchange rates [171]. Moreover, Kolter et

al. use an ensemble method, Dynamic Weighted Majority, to track concept drift [71, 70].

2.3 Domain Conversion of Probabilistic Output

The ensemble of classifiers involves using a third-party API and mapping its output to

a certain domain. Specifically, we have chosen the taxonomy function from the Alchemy

API6 suggested by Quercia et al. [129] to categorize documents into different topic classes

with associated probabilities. The Alchemy taxonomy function is built on a deep neural

network model, with its predefined universal topic corpus. However, the predetermined

nature of its domain introduces a challenge in using it for our goals, which is to classify

tweets over a set of domain-specific topics. To address this mismatch, we have developed

a mapping methodology that maps the probabilistic output of Alchemy from the universal

domain to the task-specific domain.

2.3.1 Mapping labels

Our goal is to find the pattern of relationships between domains based on the idea of co-

occurrence of different labels given the same document. The overall process of the domain

conversion is shown in Figure 2.1. Assume that the API gives a label corpus of size n

containing labels x1 , x2 , . . . , xn in domain X, and the problem-specific domain Y has a

corpus of m labels y1 , y2 , . . . , ym . For a document d from the training document collection

D, assume that the API outputs a list of k labels xd,1 , xd,2 , . . . , xd,k (in domain X) with the
6
http://www.alchemyapi.io/

11
Figure 2.1: Label mapping across different domains

associated normalized probabilities pxd,1 , pxd,2 , . . . , pxd,k . Let the actual h labels (assigned by

a domain expert for training purposes) for the document be yd,1 , yd,2 , . . . , yd,h (in domain

Y ).

Using the above information, we build a mapping relation between all possible pairs

of labels from the two domains. A mapping confidence score is given to each mapping

relation from label xi in domain X to label yj in domain Y as follows,

C(xi , yj )
conf idence(xi , yj ) = (2.1)
C(xi )

Here, C(xi ) is defined as the partial document count for label xi , and C(xi , yj ) is the

partial document count where xi and yj are assigned as labels for the same document in

their domains.

X
C(xi ) = C(xd,i ) (2.2)
d∈D
X
C(xi , yj ) = min(C(xd,i ), C(yd,j )) (2.3)
d∈D

12
where the partial count C is calculated as:

C(xd,i ) = pxd,i for domain X (2.4)


1
C(yd,j ) = for domain Y (2.5)
h

where h is number of actual labels associated with the document d in domain Y .

To clarify, in the above calculations, we use the normalized probability associated with

each assigned label for each document as the partial count. However, for domain Y , be-

cause the labels do not have a probability distribution associated with them, we assume

they are equally important, so we give a uniform count (1/k) to all the assigned labels.

After building all possible relations seen in the training data from label xi to label yj ,

for each label xi , the mapping function is generated as

mapping(xi ) = argmax(conf idence(xi , yj )) (2.6)


yj

where the mapping for xi is chosen with the highest mapping confidence score between xi

and the target label yj . For an unknown label xi in domain X during inference, we use the

candidate label yc that has the largest total partial count as the mapped label in domain Y .

Thus, this label mapping process generates a many-to-one mapping function from domain

X to domain Y .

2.3.2 Mapping probabilities

Let score(xi , yj ) denote the highest confidence score associated with the mapping from

xi to yj . We next generate a new probability distribution in domain Y after the mapping.

For a single document, we have mapped k labels x1 , x2 , . . . , xk from domain X to k labels

y1 , y2 , . . . , yk in domain Y with the learned mapping function.

13
The mapped probability pyd,i for the mapped label yi from document d is calculated as:

1
pyd,i = × (pxd,i × score(xi , yi )) (2.7)
Z
Xk
Z = (pxd,i × score(xi , yi )) (2.8)
i=1

The new mapped probability combines the original distribution in domain X with the

score of the applied mapping function as the confidence of the mapping. In this case, the

mapping function transfers the probabilistic output from one domain to another domain

with a new probability distribution.

Considering the rare cases where an unknown label xi is encountered, we assign a very

small new mapped probability to the candidate label yc in domain Y before normalization.

However, if no mapping function can be found for all the labels for a document, the small

probability is assigned to the candidate without normalization, which indicates the unrelia-

bility of this mapped label. For the cases in which the mapping function converts different

labels from domain X to the same label in domain Y , the new probability score for that

label yi would be the summation of the score pyd,i over the multiple mapped labels.

2.4 Ensemble of Probabilistic Models

Hardware Stationery XACTO FoamDisplay Craft


LLDA 0.3056 0.2906 0.2715 0.1315 0.0009
NaiveBayes 0.1216 0.0107 0.2954 0.1375 0.4348

Table 2.1: Motivating ensemble method: output probabilities of different topics

We first describe an initial study where we apply LLDA and Naive Bayes classifier

to the dataset. Table 2.1 shows an example of the results obtained – the dataset contains

14
five labels and the numbers shown are the probability distribution outputs for each model

given the same tweet. If we use these models individually, the natural deterministic outputs

will be Hardware for LLDA and Craft for NaiveBayes. However, the actual label is

XACTO, and it can be derived by the highest sum of probabilities from the two models. In

general, it is intuitive that different models could be suitable for different task domains.

Therefore, we can improve the classification process by combining multiple models, and it

forms the idea of the ensemble models.

We propose a set of practical approaches to create ensemble models. Our starting point

is the ideal Bayesian voting model, which has been shown to be theoretically optimal [31].

Specifically, it computes the label y as

X
y = argmax P (cj |mi )P (mi |T ) (2.9)
cj ∈C
mi ∈M
X
= argmax P (cj |mi )P (T |mi )P (mi ) (2.10)
cj ∈C
mi ∈M

where C is the collection of all classes, M is the collection of all involved models, T is the

training data, and the different P terms represent conditional probabilities.

The problem with the above voting model is that it is not feasible in practice because

the weights P (T |mi )P (mi ) associated with each model output related to a certain class

are impossible to determine. More precisely, P (T |mi ), as the local performance, shows

how an individual model fits the training data, whereas the prior P (mi ), as the global

performance, represents the general fit of the individual model. P (T |mi ) can be estimated

by the performance of the model on a certain dataset, however, P (mi ) cannot be calculated

or estimated.

To address this problem, we developed a different weighting method that learns the

weights from the provided data. In addition, we proposed that another method, Stacking+

15
Ensemble, builds on the well-known ensemble method [170], but improves it further by

including the document vector for better discriminative ability.

2.4.1 Dynamic Weighting

Instead of using model-specific weights like BMA [65, 54], we borrow the idea of

dynamic weights from Kolter et al. [71, 70], which learns the weight directly from the data.

However, Kolter’s method has certain limitations for this task, specifically it jointly trains

the weights and the involved models, and also drops the model that does not contribute

to the correct final output, while adding a new model to the ensemble to optimize the

performance. In comparison, the probability distributions of the individual models are

fixed in our case, and it also rely on a fixed set of individual models to create the ensemble

system.

Therefore, our approach is as follows. We assume that there is a real value weight wi

exists for each model mi , which works as a multiplier to the probabilities P (cj |mi ) for

cj ∈ C, and the final output class is the one with the highest weighted sum. The overall

process for Dynamic Weighting is listed as Algorithm 1. Unlike Kolter’s method, which

triggers the update based on the local prediction, the weight update happens only when

the ensemble model makes an incorrect prediction on a training case. The update is then

applied to the weight where its corresponding model gives an incorrect local prediction.

We treat the update as a multiplication with a single constant learning rate – note that a

subtraction will not ensure that the weights are always positive. Because the update only

decreases some weights, we renormalize all the weights to prompt the other weights and

ensure consistency during training. Finally, the output weights are the average weights seen

in the entire training process, which is an efficient approach to avoid over-fitting.

16
Algorithm 1 Dynamic weighting algorithm
mi , wi : individual model i and its weight
yd : true class for case d
cj : the class j
u: update count
si : summation for weight mi
Λd : ensemble prediction for case d
λdi : local prediction of model i for case d
β: learning rate, 0 ≤ β < 1
n, e: number of models, epochs
initialize wi ← n1 , i = 1, . . . , n
for count = 1, . . . , e do
for all case d in thePtraining set do
Λ ← argmax ni=1 P (cj |mi )wi
d
cj ∈C
d
if Λ 6= yd then
for i = 1, . . . , n do
λdi ← argmax P (cj |mi )
cj ∈C
if λdi
6 yd then
=
wi ← wi β
end if
end for
W ← re − normalize(W )
for i = 1, . . . , n do
si ← si + wi
u←u+1
end for
end if
end for
end for
return weight wi = sui , i = 1, . . . , n

17
Because of the limited size of the training data, we use a single weight instead of a

vector weight as a general case, which demotes or promotes all the classes at the same

time. Our test also shows that adding a prior P (T |mi ) to the corresponding model does not

improve the overall performance.

2.4.2 Stacking+ Ensemble

Similar to most probability-based ensemble methods, the Dynamic Weighting model

treats the output distributions as the likelihood of the labels provided by each individ-

ual model. However, results from our initial evaluation show that a simple arithmetical

combination of the probability distributions cannot generate the correct output. The arith-

metical combination is sensitive to the type of probability distribution from the individual

models, and one irregular distribution could have a huge impact on the overall system. In

other words, balanced distributions and sparse distributions require very different ensemble

methods.

To make the probabilistic-based ensemble model more general and adaptive to different

kinds of distributions, we introduce the stacking classifier to the ensemble process. This

classifier is used to combine the output from the individual models and generate the final

decision. Unlike the previous ensemble methods that treat the probability from the indi-

vidual models as an indicator for reliability, the Stacking Ensemble method takes it as a

representation of the document provided by the individual model and maps it to the output

label.

Figure 2.2 shows the overall design for the proposed Stacking+ Ensemble model, which

builds on the idea of stacked generalization [170]. There are two layers of classifiers in

the model. In the first layer, the classifiers are the involved individual ones that generate

18
Figure 2.2: Process of Stacking+ ensemble model

different kinds of probability distributions. In the second layer, the classifier takes the

concatenation of the probability distributions from the first layer as the input and generate

the final output. We first trained the individual classifiers independently. Then the stacking

classifier is trained using the output distributions from the first layer and the correct label

for each case. This ensures the generality of the ensemble method and that all the classifiers

are trained independently.

In order to further improve the differentiation and mapping ability of the traditional

stacking model, we add an n-gram vector of the document to the input of the stacking clas-

sifier. This forms the Stacking+ Ensemble model, in which the stacking classifier takes the

appended feature of the probability distribution and the document vector. On top of the tra-

ditional stacking model, Stacking+ Ensemble enriches the representation of the document

to the stacking classifier by including the additional feature.

19
The stacking classifier can be chosen from a wide range of machine learning models,

and the input probabilities can be handled in different ways inside the models. This so-

phisticated process ensures the ability of this model to deal with more complicated input

situations. Thus, the Stacking+ Ensemble method is less sensitive to the individual models

than most existing probability-based ensemble models. In addition to thinking of the two-

layer system as an ensemble model, the classifiers along with their output in the first layer

can also be viewed as a special process of feature extraction for the document, which then

serves as part of the input to the classifier on the second layer.

2.5 Experiments

2.5.1 Data preparation

The experiment is implemented with a real-world tweet dataset that contains tweets

collected from normal consumers mentioning certain brands over a seven-month period.

The brands, the number of labeled tweets, and the size of their topic corpus are shown

in Table 2.2. Because the data are obtained from a social media platform, the content of

the tweet is normalized by removing the mentioned usernames, tokenizing the links, and

removing redundant punctuation. We keep the stop words because we find that removing

these words does not necessarily help this classification task.

No. of Tweets No. of Topics


Elmer’s 34918 5
Chili’s 54883 19
Bath and Body Works 43529 40
Domino’s 29552 22
Triclosan 14673 28

Table 2.2: Tweet dataset - brands and number of labeled tweets

20
Each of the vendors has provided a predetermined topic corpus used for classification

on its tweet stream. The topic corpus represents the interests of the brand and classifying a

tweet into topics helps in processing the feedback received from the customers. The ven-

dors also provided a set of keywords and logic rules to perform the labeling/classification

process. However, the limitation of such strict labeling is that only a small fraction of

tweets can be assigned a label – in fact, this is what motivates the need for the automated

tweet classification. At the same time, the tweets that are labeled using the client- or brand-

specific rules can now be used for training and evaluating the quality of the classifiers. One

complication with the labeled tweets is that a single tweet can be assigned more than one

label without any relative ranking, so we treat these labels equally.

2.5.2 Baselines

Several existing and popular ensemble methods take probability distributions from in-

dividual models as the input and generate the result. Some of the well-known examples are

BMA and Mixture of Gaussians. These methods are similar in that they combine the prob-

ability distributions using summation but differ in the specifics of how they archive this.

Another approach multiplies the distributions and renormalizes, and it is called Products of

Experts (PoE) [52]. Each individual model is considered as an expert, and the distributions

of the experts are combined as:


Q
m pm (d|θm )
p(d|θ1 ...θn ) = P Q (2.11)
c m pm (c|θm )

where d is a data vector, c covers all possible vectors in the data space, θm is all the

parameters of individual model m, and pm (d|θm ) stands for the probability of d given by

model m.

21
Because the probabilities are combined by multiplication, PoE can generate much

“sharper” distributions than the individual expert models. Therefore, the correct output is

easier to extract from such a combination of individual distributions. Fitting a PoE model

requires optimizing the likelihood of the data, which involves tuning the settings of the

individual models. To compare the performance of different ensemble methods, we fix the

setting of all involved individual models, i.e., all the ensemble methods share the same set

of probability distributions from the individual models as the input.

In addition to the PoE model, we also implemented a simple ensemble model that gen-

erates the final output based on the weighted summation of the input distributions (Weighted

Sum). In this method, the weights are (statically) determined by normalizing the individual

performance (accuracy) of the involved models.

2.5.3 Experiment design

For experimenting with different ensemble approaches, we use three individual models:

Labeled LDA (LLDA), Naive Bayes classifier (Naive Bayes/NB), and Alchemy API with

the mapping function for output conversion (Alchemy/Alc) that we have introduced. The

individual models and the ensemble methods are evaluated with a five-fold cross validation.

LLDA is typically considered a topic modeling method, but because it generates a prob-

ability distribution across topics for each document, it can also be used as an individual

model for this task. We use the Stanford Topic Modeling Toolkit7 for both the training

and inference steps of the LLDA model, with Gibbs Sampling set to run for 1000 itera-

tions. The Naive Bayes classifier is shown to work well on small datasets [63] or short

documents [157], and it fits the situation of our experiments. Alchemy API returns three
7
http://nlp.stanford.edu/software/tmt/tmt-0.4/

22
topics for each document; thus the output conversion generates exactly three labels with

corresponding probabilities for each tweet.

Given the casual writing style of tweets, we trim the word vector for the Naive Bayes

classifier to eliminate the most and least frequent words in the corpus. With this extra step,

we can reduce the influence from nondictionary words, typos, or words that are too com-

mon or too rare to carry much differentiation information. The tweets are then featurized

using a combination of unigram and bigram models for the NaiveBayes classifier. We set

the learning rate β for the Dynamic Weighting model to be 0.9 and stop the learning pro-

cess after 30 epochs. These existing classifiers are implemented using scikit-learn packages

[17], and all the codes and data for the experiment are accessible publicly 8 .

In order to make an ensemble method effective, the involved individual models need

to be diverse [31]. In other words, there should be certain cases where different individual

models do not give the same output (and among such cases, no single model is always

correct). We validate this requirement for the experiment by building an ideal ensemble

system that marks a case correct whenever at least one involved model is correct. The

accuracy of the ideal system is considerably better than that of any individual model, which

shows that these models are sufficiently diverse.

Because the choice of the stacking classifier is flexible, we use the Maximum Entropy

classifier for its simplicity and efficiency.

2.5.4 Experiment results and analysis

In our experiments, we use accuracy to represent the performance, where accuracy

is simply the fraction of the test data records where the model predicts the correct class.

In our evaluation, all individual and ensemble models output one deterministic label for
8
https://goo.gl/RfjjBH

23
each tweet. Because all the true labels assigned to the same tweet are considered equally

possible, we simplify the evaluation process by considering our system to be correct if the

predicted label is among the actual true labels. We found that 71.2% of the tweets in the

dataset have only one label, so the simplification does not have a significant impact on the

results we report here. We first build each of the three individual models and report the

accuracy of the predictions. Next, we apply the two baseline ensemble models – Weighted

Sum and PoE, along with Stacking Ensemble, the proposed Dynamic Weighting and the

Stacking+ Ensemble model, for all possible combinations of the three individual models.

Results from Ensemble Models

Figure 2.3 shows the performance for the three individual models and four ensemble

methods across five datasets. For the ensemble methods, we report results for the three

possible pairwise combinations of the individual methods and the combination of all three

methods (all).

Overall, the Stacking Ensemble method generates the best performance, which im-

proves over the individual models and outperforms the other ensemble methods. Among

all three individual models, LLDA and NaiveBayes have comparable performance, while

Alchemy with the mapping function has lower performance (we hypothesize that this is

because Alchemy is not adapted to these datasets or domains).

As stated previously, PoE combines the distributions by multiplication. To avoid the

product of probabilities becoming 0, we smooth the distribution by assigning a very small

probability to those labels that have probabilities of 0 and renormalize the distribution.

If Alchemy is involved in the pairwise combination, PoE performs worse than the better

individual model. Because Alchemy generates only three labels for each document, the

24
Figure 2.3: Performance comparison between three individual models and four ensemble
models (five datasets)

25
probabilities for the other labels are smoothed to a very small value. Thus, the multipli-

cation of the distributions indicates that the probability of most labels is still a very small

number, except for the three labels generated by Alchemy. As the consequence, if the other

involved models do not have relatively high probabilities for the three labels generated

by Alchemy, the final distribution of probabilities is not helpful in labeling. On the other

hand, LLDA and NaiveBayes provide better distributions, and thus PoE can improve the

final performance by a small margin.

For most cases, the baseline of Weighted Sum and the proposed Dynamic Weighting

method both improve the performance of individual models by combining their outputs.

However, the combination of Naive Bayes and Alchemy using the Weighted Sum model

is the only case where the performance is worse than NaiveBayes alone. The special dis-

tribution generated by Alchemy is the main reason. In general, the Dynamic Weighting

method outperforms the Weighted Sum method, especially when an unreliable model such

as Alchemy is involved. The weights of the Weighted Sum method are limited to the per-

formance of the individual models – the difference of the weights would not be too large in

most cases. On the other hand, as a data-adapted model, Dynamic Weighting can learn a

more precise and adaptive weight for each model, leading to a wider range for the weights.

Unlike the previous three ensemble methods, Stacking Ensemble does not rely on a

direct combination of the probability distributions. This ensemble method utilizes another

classifier and the classifier allows the capability to bypass or reflect the true effect of in-

corporating a special distribution. It has the best overall performance among all compared

ensemble methods, and it also has a reasonable improvement over any involved individual

model. In addition, the improvement is stable across all combinations, which shows that

26
Figure 2.4: Comparison between two models: using the probability input and adding tweet
vector to the input (Stacking Classifier)

this method is less sensitive to the individual models and their distributions. Especially for

the cases where Alchemy is involved, it was still able to generate reasonable improvement.

For all models or their combinations, lower accuracy is seen for two brands: BathAnd-

BodyWorks and Triclosan, and we think large number of topics/classes and small number

of cases for each topic class are the main reasons. It is natural that the prediction is harder

when there are more classes to choose from and, even more so, when sufficient training

data is not available.

Comparison with Adding Tweet Vector

In addition to the probability input feature, we further append the tweet vector to the

input of the stacking classifier and report a performance comparison between the Stacking

and Stacking+ Ensemble model in Figure 2.4. The additional tweet vector is the same as

the one featurized for the training of the Naive Bayes classifier.

27
The result shows that the addition of the tweet vector can help the Stacking+ Ensem-

ble model further improve performance. The improvement by adding the tweet vector is

solid across all ensemble combinations and datasets, and it reaches as much as 20.9% over

the Stacking Ensemble. It is clear that this improvement is obtained with the cost of an

exponentially increased feature size for the ensemble classifier.

Overall, compared against the baseline Weighted Sum ensemble method applied to

three individual models, the Stacking+ Ensemble model had a reduction of 33.9%, 49.4%,

22.1%, 37.5%, and 2.7%, respectively, in the number of inaccurate predictions, for Elmer’s,

Chili’s, Bath and Body Works, Domino’s, and Triclosan, respectively. The Stacking+ En-

semble model results in significant improvements for four of the five datasets in the experi-

ment. Taken as an average across Elmer’s, Chili’s, and Domino’s (three datasets where we

had sufficient training data for each label), we have an average accuracy of 0.8339 when

combining all three models with weighted sum, whereas the Stacking Ensemble model im-

proved the accuracy to 0.8702, and Stacking+ Ensemble increased the accuracy to 0.8995.

Thus, on the average, there was a 39.5% reduction in the number of inaccurate predictions.

2.6 Summary

This work introduces new ensemble classifiers – a Dynamic Weighting system, a Stack-

ing+ Ensemble model with additional tweet vectors. These ensemble systems can combine

probabilistic models to improve the performance of a deterministic output. We demon-

strate the effectiveness of this approach on a real-world tweet classification task, which is a

mixed-classification problem that benefits from combining classifiers that are designed for

different domains or purposes. The unique characteristic of the stacking ensemble method

eases the requirement that the distributions generated by the involved models need to be

28
compatible and thus it is more adaptive and flexible. Through detailed evaluations with

real-world datasets, we demonstrate significant improvements over the use of individual

models or the existing arithmetic-based ensemble methods.

29
Chapter 3: Commercial Tweet Influence Prediction

3.1 Introduction

Our next work focuses on the success of commercial tweets, specifically, the influence

of commercial tweets. The rapid growth of social media is driving the increased use of

social platforms for advertising. Many companies have official accounts on social media

platforms to maintain customer relationships, spread news, and attract more attention. In

fact, companies have used their official accounts on Twitter to post commercial tweets

that are primarily visible to their followers. For example, “Is your New Year’s resolution

to travel more? Check out these up-and-coming destinations - https://t.co/I36OS2hBnF

https://t.co/rju66wtv5a” is an advertisement tweet posted by the Travel Channel.

Analysis of tweets has attracted significant attention from the data-mining community

in recent years. The massive volume, real-time nature, large geographical coverage, and

public availability of Twitter data have led to this heightened interest. Mining Twitter data

has been demonstrated to be useful for tasks such as earthquake detection [141], stock

market prediction [15], public health applications [121], and open-domain event extrac-

tion [138].

In using social media to advertise their products and promotions, companies are seek-

ing more engagement from their readers as part of maintaining an effective online strategy.

30
This need has led to a new class of services being offered to help the companies build

stronger relationships with their customers. Social Customer Relationship Management

(CRM), compared with traditional CRM, aims to provide a closer and more direct commu-

nication between the company and its customers through different social platforms. More

broadly, increasing the effectiveness of advertising through social media continues to be an

intriguing and open question for various corporations. Many approaches exist to measure

the influence of an individual account on a social platform, such as the widely used Klout

Score [134]. For a single company, the focus is more often on the effectiveness of its mes-

sages propagation. Thus, there is now a need to understand how to raise the influence of a

particular post.

This chapter shows that the influence of a particular commercial post can be measured,

predicted, and made more effective. The techniques presented here may be used within a

system to help companies craft commercial messages for social platforms, with the goal of

maximizing the influence of the posts on specific audiences. In order to improve the writing

of a commercial post, predicting the potential influence and effectiveness of a given text is

an essential and important first step.

The primary contribution is to answer the following question: “Can we learn what

makes an effective advertising post on Twitter?” In doing so, we address the following

challenges:

• How can we quantify whether a given commercial post on Twitter has been success-

ful?

• How can we distinguish the influencing (decorative) elements of a commercial post

from its inherent meaning?

31
• What features best model the composition of a tweet with respect to its influence?

• What is the specific effect of each of the specific features in improving the influence

of the tweet?

• Can a tweet be engineered to have more influence?

We consider a commercial post to be successful if it can generate enough influence.

To measure influence, we use the direct reactions that a tweet gets from its readers – such

as retweeting and marking as favorite. We focus on commercial posts from the official

accounts of various companies (brands) in different fields. Although pictures and/or videos

can be included to enrich a post, we believe that the text content provides the most important

and straightforward information. Note that the product or promotion information is usually

determined before crafting the post, so we focus on the other influencing elements of the

post. First, we label the commercial tweet based on the influence it generated. Then, we

extract a small set of features to capture the structure and comprehensive representation of a

commercial post. More specifically, the feature set is designed to show the construction of a

post and it does not include the core information that is related to the promotion or product.

Next, we address the problem of predicting whether a commercial post would be successful

through a binary classification model given the feature set. In addition, we conduct a feature

analysis to determine which features have the most impact on the influence prediction. We

provide a case study that shows the potential usage of the prediction system.

To the best of our knowledge, this is the first model that seeks to analyze and predict

the performance of commercial social media posts. We believe that this work will serve as

an essential foundation for advertising-related social media analysis.

32
3.2 Related Work

Advertising through social media is growing rapidly and is drawing more attention.

Yin et al. [183] used the concept of a propagation tree to reveal patterns of advertisement

propagation and present a set of metrics to measure the effectiveness of an advertisement in

terms of extent of its propagation. In contrast, our focus is on tweet content and measuring

the advertisement by its influence on the readers. Li et al. [85] proposed a diffusion mecha-

nism for advertisement delivery through microblog platforms, based on a set of user-related

features. Our goal is to model the influence of a commercial tweet based on static textual

features before it has been posted.

In the last decade, researchers have examined the influence of specific users and their

posts through social media and attempted to understand how to quantify such influence.

Anger et al. [3] looked into the indicators of influence on Twitter for an individual user.

Bakshy et al. [7] conducted a study quantifying the influence of Twitter users by tracking

the diffusion of their posts through reposts. Cha et al. [19] also focused on user influence

and proposed a link-based model to measure influence. Ye and Wu [181] proposed a model

to measure message propagation and its social influence through Twitter, as well as the

longitudinal influence over time and across users. Unlike these network-based models or

diffusion models that track the spread of tweets, we construct a simple matrix that checks

only direct reactions to tweets by their readers (such as favorites and retweets). Moreover,

in order to improve a given post, we focus on the specific tweet, instead of the identity of

the author.

Popularity prediction also has attracted much interest and among many models, pre-

dicting retweets has been the most common one. Most efforts (such as [125] and [112])

utilized simple surface features of tweets to predict retweeting. Peng et al. [122] also

33
included relationship features, and Yang et al. [179] added trace and temporal features

to build a factor graph model. Zaman et al. [184] and Gao et al. [41] predicted future

popularity by observing the dynamics of retweeting, while Xu et al. [176] and Lee et al.

[79] focused on retweet activities on certain users. In addition to retweet prediction, Artzi

et al. [4] predicted whether a tweet will receive replies from its readers, and Suh et al.

[148] showed the relation between certain features and the retweet rate. In contrast to these

efforts, which focused on a single reaction (such as retweeting) as the measurement, our

work creates a comprehensive metric to measure the popularity (influence) of commercial

tweets. Furthermore, our model captures only the structural and style elements of the post,

rather than checking every detailed word of it.

Distributed representation has become popular in text-related research. Mikolov et al.

[107] initiated the work on representing words with lower dimension vectors, which are

trained to predict context words given the current word. Le et al. [77] leveraged [107] to

represent a paragraph using a dense vector, which is trained to predict words in the para-

graph given the paragraph itself. The idea of dense representation also has been brought

to tweet-related tasks. Tang et al. [149] built a word embedding for Twitter sentiment

classification. Given the informal use of words in tweets, two character-based tweet2vec

models have been introduced: Dhingra et al. [29] constructed a tweet vector representation

to predict hashtags, and Vosoughi et al. [156] use a CNN-LSTM encoder-decoder model

to generate tweet embeddings.

3.3 Influence Prediction

This section describes the process of labeling the influence of commercial tweets, ex-

tracting features and building classification models.

34
3.3.1 Data labeling

In order to classify commercial tweets as successful or unsuccessful, we first need to

quantify the influence of a tweet. The influence of a commercial tweet can be represented by

the level of engagement from the readers. Retweets and marking tweets as favorites are the

most widely used functions that allow a reader to express his or her interest or excitement

about the tweet. Thus, counts of retweets and favorites can be used as direct measurements

of reader engagement and they have been used in some models as the indicator for tweet

influence [19, 181].

Influence Score

Our work combines the count of retweets and the count of favorites in order to measure

tweet influence. Both reactions reflect the interest of the user after reading the tweet. We

want the counts of each reaction to have equal impact in determining the influence. How-

ever, users retweet and mark as favorite with different frequencies, which leads to different

scales for the two counts. To balance the scale of the two counts, we compute the ratio of

favorite-to-retweet counts across all tweets in the dataset, as shown in Figure 3.1.

Most tweets have favorite-to-retweet ratios around 2, and the mean of the ratios across

the dataset is 2.5. Thus, we empirically weight retweet count by 2 to ensure that retweets

and favorites have equal influence in the final measurement. In fact, we think this multiplier

may reflect the fact that marking a favorite requires one click, while retweeting requires two

clicks.

However, retweet and favorite counts are highly influenced by the popularity of the

author account, which can be identified as the number of followers. Note that we are

not interested in the absolute influence created by a post, but the relative influence that

35
Figure 3.1: Favorite-to-retweet ratio

the author account is able to generate. To create a normalized influence for all general

commercial posts, we eliminate the impact of account popularity by normalizing the score

by the number of followers of the account. Therefore, the influence score in our work is

calculated as:

2 × RetweetCount + F avoriteCount
(3.1)
F ollowerCount

Because the influence score is normalized by the number of followers, we include only

direct reactions of a tweet from its readers for influence measurement, rather than tracking

the propagation of the tweet.

Retweet, favorite, and follower counts are all dynamic attributes in the sense that, for a

given tweet or account, they change with time. Further, tweets are time-sensitive, and the

attention they receive only lasts for a short period of time. Willis et al. [168] have shown

that most retweets happen in the first 20 hours after the original post. Thus, to provide

stable data, we record account information at the time of tweet posting and collect tweet

data three weeks after posting.

36
Separation from Inherent Meanings

A basic property of commercial posts is that they are used to spread certain information.

In general, companies have decided the content of the promotion or products in advance

of constructing the tweet, and such information should be considered as fixed. Therefore,

the inherent meaning of the post should not be included in order to predict the influence of

spreading such information. Thus, we design a process to distinguish the core meaning of

the post from other style features. Although both elements could affect the successfulness

of the commercial post, we want to focus the study on only the style features. This is done

in the following manner.

First, we conduct a part-of-speech (POS) tagging [43] on the tweet, and extract nouns,

verbs, adverbs, and adjectives. These words or phrases are considered as key words. In

most cases, these key words are the carrier of the inherent meaning of the post. Next,

we group the tweets given these key words, using certain clustering methods. The goal is

to group posts that are writing about similar products or promotions. Given that the core

meaning of the posts is similar in some ways, the model can study the relation between the

remaining style features and the generated influence.

We also note that the overall distribution of the influence scores is biased toward smaller

scores. Thus, if we use the influence score directly and define the task as a regression

problem, the result may also be skewed toward very small scores. To remove this bias,

we treat the labeling process as a binary classification problem, where the top 50% of the

tweets (higher scores) are labeled as positive, and the bottom 50% are labeled as negative.

Because the labeling process is performed independently for each group generated from

the previous step, it is able to reduce the impact of the influence caused by the inherent

meaning of the post from the labels.

37
3.3.2 Classification model

As mentioned above, and given the binary labeling of tweets, we use a binary classifier

for the prediction model. To explore differences in impact of the proposed features from

general contextual features, we use the typical n-gram model as the baseline. More specifi-

cally, the baseline comprises n-gram features up to length 5, including all tokens that appear

in the dataset more than once. This approach is the de-facto standard for text classification

tasks such as sentiment classification and topic categorization [2]. Unlike previous work

[117, 2], we find that Maximum Entropy (MaxEnt) [97] works better than Support Vector

Machine (SVM) [62] with the baseline model for classifying the dataset. We also apply the

state-of-the-art character-based tweet embedding model [29] as a comparison. The tweet

embedding model is trained to predict the hashtags. Hashtags in commercial tweets gener-

ally contain information about products or promotions, which makes the embedding model

a good fit for commercial posts. Because SVM performs better generally, we apply it to the

tweet embedding model as well as to the proposed comprehensive set of features that we

will describe in the following sections.

3.3.3 Features

Although our ultimate goal is to improve a tweet to be more influential, we first focus on

predicting the influence of a commercial tweet. Unlike predicting the cascade of retweeting

[22], we do not include any observation of the diffusion of the post, but extract features that

are available instantly. More specifically, given the broad message that is being captured

by the tweet, we want to model how the high-level structural and meta information impact

its influence.

38
We propose a set of features that works as a high-level representation of the post. Be-

cause the features do not include the inherent meaning of the post, we name them style

features. The proposed features are built to capture the structural and syntactic character-

istics of a post, and also includes certain pertinent information about the posting account.

Nasir et al. [112] have looked into similar features in order to predict retweeting for gen-

eral posts. Building on their work, we construct the feature set shown in Table 3.1 for

commercial post influence prediction.

Feature Values Feature Values


Complexity Features Post Meta Features
Tweet length Integer Day of week Categorical
Readability score Continuous Time of Day Categorical
Parse tree depth Continuous Mention Features
Parse tree head count Continuous Verified username Binary
Element Features Username follower count Continuous
Usernames Binary Punctuation Features
URLs Binary Question marks Binary
Hashtags Binary Exclamation marks Binary
Author Meta Features Other Features
Post count Continuous Contain digits Binary
Favorite count Continuous POS dist (3) [0, 1]
Listed count Continuous Sentiment Continuous

Table 3.1: Style features

Element Features

Usernames, hashtags, and links (URLs) are often used to deliver important information,

especially considering the length limitation of tweets.

39
Usernames mentioned in the tweet are usually used to refer to specific users, or al-

ternatively, used to send the tweet to the users. It is a common way to attract readers by

mentioning a popular user in the post.

Hashtags serve to identify a certain topic and are often treated as symbols across tweets

that share the same idea. For commercial tweets, it is also common to use hashtags as

representations of certain products or events. Thus, the information carried by the hashtag

is critical and has a big impact on the influence of the tweet.

URLs work as an extension to the tweet in order to include detailed and richer informa-

tion. For commercial posts, they play a critical role in pointing readers to additional infor-

mation and details. Therefore, they can potentially increase the chance of being retweeted

or marked as a favorite for the tweet.

Note that the intention of the system is not to alter the inherent meaning of a post, and

to look for general features that affect the influence of commercial tweets, so we do not

look into the actual content or the semantics of these elements. Instead, these features are

represented as binary indicators, and these elements are tokenized for other processes.

Punctuation Features

Rhetorical questions are popular hooks for commercial posts, and question marks serve

to demarcate such hooks. Exclamation marks are often used to express strong emotions.

Commercial tweets are written more formally than general tweets; thus, the use of such

punctuations marks is deliberate. We use a binary feature for each punctuation mark to

represent its existence in a tweet.

40
Complexity Features

The complexity of a tweet indicates the ease (or difficulty) of reading, understanding,

and interpreting the content. We measure such complexity using four features.

Tweet length is a straightforward indicator of complexity. The 140-character limitation

is applied to the number of characters of a single tweet, but the tense of a verb, the proper

name, and even the URL, skew the count of characters. Therefore, instead of counting

characters, we use the number of tokens to represent the length of a tweet. It is used both

as a feature and the normalization factor of other features. The analysis of our commercial

tweet dataset showed that the average number of tokens is 15.2, with a standard deviation of

5.1. This shows a significant level of variation in tweet length, and thus it has the potential

to become an indicator.

Readability is a measure of the difficulty of reading and comprehending a piece of text.

It has been used as an indicator for the quality of social media content [2]. In this work, we

use the Coleman-Liau Index [24] as the readability feature. The score is calculated as:

CLI = 0.0588L − 0.296S − 15.8 (3.2)

where L is the average number of letters per 100 words, and S is the average number of

sentences per 100 words. The resulting score is an approximation of the U.S. grade level

needed to understand the text. Similar to tweet length, readability score captures the surface

complexity of the post.

The dependency parse tree of a tweet shows the structure of the text, with both word-

and phrase-level relations. We build the dependency parse tree for each tweet using the

Twitter-specific model proposed by Kong et al. [72]. The parse tree is able to capture the

intrinsic and structural property of the tweet. Given such a parse tree, the depth of a tweet

41
is the number of levels starting from the root node to the bottom of the tweet parse tree.

Thus, parse tree depth can be used as a feature to measure the dependency complexity of

the tweet. Parse tree head count is the number of syntactic roots contained in the tweet

parse tree. Each root leads to an individual fragment of the tweet, which is considered to be

a complete and meaningful portion. It is not necessarily equal to the number of sentences

because complete fragments can be separated in many ways. In general, a commercial

tweet contains a single topic. Therefore, a tweet with more heads in the parse tree tends to

have a higher density of information about the topic. We use the head count to serve as a

feature to measure the density complexity of the tweet. Note that both parse tree depth and

head count are normalized by the length of the tweet.

Mentions Features

As described above, the usernames mentioned in commercial tweets could help attract

readers. In most cases, influence is driven by the popularity of the usernames and their

linked accounts. Therefore, in addition to the existence of usernames, we also use the pop-

ularity of the usernames’ accounts as a feature. We use two attributes of the mentioned

username to measure its popularity: whether it is a verified account, and its follower count.

Verified usernames belong to persons whose accounts have been certified as genuine, which

often are associated with “famous people” [98]. The verification of an account indicates

its popularity, and the username verification feature is set to have a binary value. However,

only a small portion of Twitter accounts are verified and thus the applicability of this indi-

cator is limited. Username follower count, on the other hand, is a quantifiable estimator of

the popularity available for all accounts. The username follower count feature is calculated

as the average number of followers across all usernames mentioned in the post.

42
Meta Features of the Post

Previous work has shown that the posting time of a tweet influences the retweet or

response potential [4]. Therefore, we include both the day of week and the time of day as

meta features for the tweet. For both features, we use the local time, and further map the

time into four periods (consisting of six hours each).

Meta Features of the Author

The author of a tweet has been shown to have a significant impact on the influence

of general tweets [7, 19]. We want to extend such impact to the relation between official

accounts and commercial tweets. To prevent over-fitting and to make the model more

general, we chose attributes that do not reveal the identity of the author account. Post

count is the number of tweets that this official account has posted, while favorite count is

the number of tweets that this account has marked as favorites. Both counts represent the

vitality of the account. Listed count is the number of users that include this account in their

interest lists and it can indicate the popularity of this official account on the social platform.

To eliminate the impact associated with the history of the account on these attributes, we

normalize the post count by the number of days between the registration of the account and

the posting date, normalize favorite count by the post count, and normalize listed count by

the number of followers of this account.

Other Features

Sentiment classification has been studied comprehensively [117] and has been used in

previous tweet influence analysis efforts [112]. The sentiment of a tweet is a potential

factor that induces the attention of the readers. Because commercial tweets mostly contain

products or event-related information, they usually convey a nonnegative sentiment in their

43
text. Thus, sentiment may provide less differentiation ability, but the numeric value of the

score can still be used as a measurement of the strength of the corresponding sentiment.

We use the Affective Norms for English Words (ANEW), a microblog-based sentiment

word list [113], to generate the sentiment score. Because the output sentiment score is a

summation of all the scores assigned to each word, we normalize the output score by the

length of the tweet.

A POS tagger labels each word with a certain usage type, given the context of the word.

The POS tag feature has been shown to be useful for many types of social text mining tasks

[117, 73]. To capture the critical information of a commercial post, we use 5 from the

list of 25 Twitter-specific POS tags [43]: common noun, proper noun, verb, adjective, and

adverb. These five tags are then clustered into three POS categories: 1.) common noun and

proper noun as the noun category; 2.) adjective and adverb as the descriptor category; and

3.) verb as the verb category. Similar to extracting meaningful content for labeling, we use

the Gimpel & Owoputi Twitter POS tagger to generate the sequence of POS tags for each

tweet [43]. To represent the writing style of the post, POS distribution features are then

calculated as the normalized POS category counts across all three categories.

Digits in the commercial post often carry meaningful information – such as useful

statistics or emphasis on key ideas. The binary feature of containing digits captures this

role of digits in commercial tweets.

3.4 Experiments

In this section, we describe our experiments that show the performance of different

models. All evaluations are performed using five-fold cross validation tests.

44
Gap Amazon Gilt BlackBerry Google
Nordstrom Best Buy Jeep KraftFoods Disney
AT&T Applebee’s Dell Comcast LEVIS
Macy’s AppStore (Apple) JC Penney Delta H&M
Starbucks Travel Channel FedEx Yahoo Motorola
SamsungMobile Microsoft Target Sears AmericanExpress
Netflix GEICO WholeFoods

Table 3.2: Collected brands

3.4.1 Data preparation

We build a commercial tweet dataset that contains originating tweets (i.e., no replies

or retweets) posted by the official accounts of 33 companies (Table 3.2). During a 12-

month period, 63,421 tweets were collected using the public Twitter API. We found that

most official accounts are very active in communicating with customers through retweeting

and replies, but they are cautious in posting original commercial tweets, which generates

a limited amount of useful data for the experiment. The source code and dataset for the

experiment have been made available.9

Outliers are removed in two steps. Certain announcements, for example, the release

of a new iPhone, have an outsized influence simply because of the information, and thus

the representation may not be the reason for their success. Tweets that are related to such

major announcements or events are found and excluded by keywords. Other attributes may

also cause an unpredictable influence, such as a reference to a song that is currently very

popular. To remove such outliers, for each label group, we compute the z-score of each

post based on the influence score and remove those whose z-scores are larger than 2.
9
https://goo.gl/Y1LFLA

45
3.4.2 Experiment design

We first show the difference caused by using different grouping methods to separate

the inherent meaning from the decorative elements of the post and choose a specific group-

labeling method for model performance analysis. Then we list the performance of the

proposed model, the n-gram baseline model, and the tweet embedding model given the

commercial posts. In order to show more details on the attributes that affect the influence

of the commercial tweets, we conduct a feature-importance analysis on the proposed style-

feature model. Finally, we set up a case study with a set of real commercial posts and apply

the prediction model to the cases.

3.4.3 Group label analysis

Because the model is built to predict the performance of a commercial post, the core

meaning of the post is considered as fixed. Therefore we group the posts based on their

core parts as described before, so that the prediction model can focus on the style parts of

the tweet. In the experiment, we first extract the key words for each tweet, and then apply

three group methods to the posts:

• simGroup binary featurizes the key words using a binary representation and cluster

them with k-means++,

• simGroup emb featurizes the key words using Word2Vec provided by Gensim with

pretrained word vectors [136], then averages the word vectors to generate the vector

for this tweet, and clusters them using kmeans++,

• topicGroup applies Latent Dirichlet Allocation model [13] to get the topic distribu-

tion for each tweet, and groups the tweets based on the topic with highest probability.

46
After generating the groups of tweets, labels are assigned to each group individually and

mixed together for the following prediction task. Because the data size is limited, we

test these different group-labeling methods with three, five, and seven groups. In order to

explore the difference caused by various labeling groups, we apply each labeling result to

the n-gram model, proposed style-feature model, and tweet embedding model as mentioned

in the explanation of “Separation from Inherent Meanings”.

Figure 3.2: Model performance given different labeling groups

Figure 3.2 shows the F1 score of three grouping methods with different numbers of

groups given each prediction method. Overall, simGroup binary and simGroup emb gen-

erate comparative performance, while topicGroup does not fit well for this task. Further-

more, simGroup emb is more stable than the other two grouping methods with different

numbers of groups. As shown in the performances of style-feature and embedding model,

it is more suitable to use the pretrained word embeddings than the simple binary represen-

tation to group tweets. The small portion of isolated key words included in the grouping

process and the limited data size also justify the use of word embedding for tweet clus-

tering. Therefore, to get the best performance, we choose simGroup emb with five groups

47
as the meaning separation method, and use the labels generated from it for the following

experiment. The performance of different prediction models will be further studied in the

following section.

# Size Comment Example


Two simple ingredients, two very unique
1 10360 Food-related
drinks. http://sbux.co/1nobwF9
This pup doesn’t have time to chase his
2 22579 Stories & Suggestions tail, because he’s too busy traveling
the world http://yhoo.it/1WutqAR
Happy birthday Emma Stone! We can’t
wait to celebrate with you at the
3 12651 Entertainment & Events
#AFIFest premiere of #LALALAND 11/15.
http://soc.att.com/LALALAND
Learn how to securely mobilize your biz;
4 16062 Electronics & Tech join us at BlackBerry Mobility Summit
Benelux http://blck.by/1LJDo2b
5 357 Special cases #blizzard2016 #Jeep

Table 3.3: simGroup emb group samples

Table 3.3 shows samples from the group assignment generated by simGroup emb with

5 groups. Group 5 is a particular case where the posts only contain special elements such as

hashtags and urls. Unlike the other groups, the actual meaning of the hashtags may not fall

into the same category. However, the limited size of this group ensures that this labeling

process is still convincible. On the other hand, the other four groups work as expected, so

that the labeling process focuses on the style parts of the posts. For example, the labeling

within Group 1 represents how the construction and style of the posts is related to the

influence of posts that carry food-related information.

We have also tried to group posts by other attributes such as its author account.

48
Accounts that belong to the same category are grouped together. For example, Ama-

zon, Google, and Yahoo are grouped together as technology companies. However, results

show that this intuitive grouping method performs much worse than learning the clustering

directly from the data. In most cases, a single official account does not post commercials

about only one type of products or promotions. In fact, posting commercials through social

media is far more flexible and easier than other approaches. Therefore, companies tend

to post a more comprehensive set of commercials through social platforms than traditional

ones.

3.4.4 Experiment results and analysis

After generating the labels using simGroup emb with five groups as described in the

previous section, we use the full dataset for the performance test. As described in Sec-

tion 3.2.2 , we apply our proposed feature model to an SVM classifier with Radial Basis

Function kernel. We also apply the baseline approach with a MaxEnt Classifier, and the

state-of-the-art tweet embedding model with the same SVM classifier for comparison. To

analyze the importance of the features, we conduct an ablation analysis on the proposed

style features. We are designing the system to help companies identify (or even craft)

commercial tweets that are likely to have a large influence. For this reason, we report the

precision, recall, and F1 score for the positive cases.

Table 3.4 shows the performance of the baseline method, embedding model, and our

proposed style-feature model, as well as the variation in contribution of the proposed fea-

tures. In general, the style-feature model outperforms both n-gram baseline and the em-

bedding model in terms of F1 score. More specifically, the proposed model tends to have

a much higher recall than the other two models, while a lower precision than the others.

49
Feature Precision Recall F1
Baseline (n-gram) 0.7597 0.7733 0.7664
Embedding 0.7616 0.8158 0.7878
Style (full) 0.7268 0.8708 0.7923
- Author meta -0.0839 -0.1062 -0.0938
- Elements -0.0097 -0.0032 -0.0071
- Punctuation -0.0010 -0.0062 -0.0032
- Mentions +0.0137 -0.0244 -0.0024
- Contain digit -0.0013 -0.0027 -0.0019
- POS dist -0.0005 -0.0006 -0.0006
- Sentiment -0.0002 -0.0005 -0.0003
- Post meta +0.0010 +0.0014 +0.0012
- Complexity +0.0111 -0.0113 +0.0017

Table 3.4: Model performance and ablation analysis

The proposed model does not include any particular meaning of the commercial post or

the identity of the mentioned usernames and author account. But the result shows that it

has more capability to predict the potential influence of a commercial post than traditional

content models such as n-gram and embedding models. Moreover, without looking into

the actual core content and the identities, it also reduces the risk of over-fitting the model

to a specific dataset. In this case, a more general model would work better on an unknown

commercial post.

Further, our proposed feature set is much more compact than the n-gram features, and it

is also more compact than the embedding model. The style-feature model is not only more

general and adaptive, but also more efficient and effective than the content-based n-gram

model or embedding model in predicting successful commercial tweets.

Table 3.5 lists the predicted labels from three models and the text of several sample

tweets where the true label is positive (label 1). More specifically, it shows the sample

50
Ngram Emb Dec Tweet
Community. Connection. Celebration. Today, and
1 1 1 0 every day. #LGBTHistoryMonth
http://soc.att.com/2dAG6sI
Any terrain. Any season. Anytime.
2 1 1 0
pic.twitter.com/RnhHJWlvgF
Pro tip: Sweet bedding = sweet dreams.
3 1 1 0
http://mcys.co/2cx8pGf
Avocados + salt + lime + . What goes in your
4 1 0 0 guacamole? Super Fast Guac: http://bit.ly/1Y9oJOX
#CincodeMayo
Due to forecasted winter weather in the Pacific
5 0 1 1 Northwest, we’ve issued a travel waiver for February
3rd. More info: http://bit.ly/2iPzTuS
OBAP’s dedication to aspiring pilots inspires us, which
6 1 0 1 is why we’re proud to support their programs that mold
the future of aviation.
7 0 1 1 Oh hey @trollhunters @Stranger Things
#18thcenturyproblems #PrideAndPrejudice
8 1 0 1
#NowOnNetflix

Table 3.5: Prediction samples from different models where the true labels are positive

tweets where the style-feature model has different predictions from the content models

(n-gram or embedding model).

Most of the posts where the n-gram and embedding models correctly predict as positive

while the proposed model does not are constructed in an informal way. Many of them are

not constructed as a complete sentence. They are either the combination of several isolated

words and phrases such as Tweet 1 and 2, or written in a special form such as Tweet 3

and 4. Although they have been adapted to the task of tweets, both the POS tagger and

dependency parser are not able to work well on such incomplete sentences, which further

affects the performance of the style-feature model. A bag-of-words assumption does not

carry any order or dependency information; therefore it is less sensitive to these special

51
cases. Thus, the proposed style-feature model generates a lower precision than the n-gram

and embedding models.

On the other hand, the proposed model is able to successfully predict more positive

cases than the other models. We note that most tweets the proposed model predicts as

positive while the n-gram or embedding models fail to predict as positive occur in two

situations:

• The construction of the post is complicated, which usually means a complex sentence

such as Tweet 5 and 6.

• The major body of the post is built of special elements such as hashtags, urls or

username mentions such as Tweet 7 and 8.

The complexity features and the structure analysis, such as tweet parsing, in the proposed

model help locate and extract the posts that have positive influence. Content-based n-gram

and embedding models do not work well on longer and more complex sentences.

As expected, the ablation analysis shows that the author meta feature has the biggest

impact on the final prediction in terms of F1 score. The special elements used in the post

are other attributes that contribute meaningfully to the final prediction. They are very com-

mon and useful in commercial tweets. Moreover, the mentioned usernames and types of

punctuation have considerable impact as well. The use of these two attributes is also more

popular and effective in commercial posts than regular tweets. Complexity is shown to

have an impact in generating a higher recall; we have the same result from the previous

analysis. The sentiment feature is shown to be less of a differentiator. As mentioned be-

fore, commercial tweets are written to be non-negative, and most commercial tweets have

very limited sentiment difference. Finally, we found the post meta feature working in an

52
Tweet Label
exclusive swag! starting tomorrow, you’re entered to win a custom
1 gecko-themed console controller every time you post using 1
#GEICOGaming.
love the #GEICOGaming turnout. remember, every single post
2 this weekend enters you to win an exclusive gecko-themed 1
console controller!
every post (!) using #GEICOGaming this weekend makes you
3 0
eligible for a custom console controller. bring it!
starting tonight at midnight, every social post containing
4 #GEICOGaming enters you to win a custom console controller! 0
get in while you can.
exclusive swag, limited opportunity! every post (!) using
5 #GEICOGaming this weekend makes you eligible for a custom 1
console controller. bring it!
check this great opportunity! starting this midnight, every social
6 post containing #GEICOGaming enters you to win a custom 1
console controller!

Table 3.6: Commercial tweets about a raffle event for winning console controllers

unexpectedly way, such that removing this feature improves the model. This shows that the

posting time of commercial tweets is not as useful as the posting time of regular tweets [4].

3.5 Demonstrating Use of the Framework: A Case Study

To demonstrate a real use of our prediction model, we pick four commercial tweets

posted by GEICO (1 through 4), and two modified tweets (5 and 6). These tweets, about

a raffle event in which one can win a console controller, are shown in Table 3.6. The label

column lists the prediction from the proposed model, and they all agree with the true labels

for the real tweets (1 as positive and 0 as negative). We exclude the post meta feature to

ensure the consistency across all cases.

53
The four real tweets deliver the same core information about the raffle. However, they

differ in their success in generating influence. The positive cases include additional phrases

before the core information that serve as hooks to raise readers’ interest. Our model is able

to correctly capture that such hooks are indeed effective.

The positive real tweets are found to have higher readability scores than the negative

ones, mainly owing to the use of additional phrases and subtler use of words. Although a

higher readability score generally implies that the tweet is more difficult to follow, it can

also mean a more precise and attractive expression of the message. The sample tweets

show a positive impact of such an expression on the influence of the tweets. In addition, we

note that the positive cases contain a greater number of nouns than verbs. Although nouns

such as “swag” and “turnout” do not contain core information, they are useful in drawing

more attention.

Samples Tweets 5 and 6 are created from samples Tweets 3 and 4, with the addition

of certain hook phrases to the beginning of the posts. Minor changes are also made to the

main body to meet the length limitation. These modifications lead to an increase in the

number of parse tree heads, and an increase in readability and sentiment scores as well.

With these modifications, the proposed system predicts that the modified tweets will have a

positive influence. In other words, these changes help the tweets have more influence while

still conveying the same information.

The above case study shows a successful use of the system to predict the influence of

real commercial posts. Most pertinently, it shows that one can use the system to craft a

tweet till it is predicted to be successful.

54
3.6 Summary

This research describes a comprehensive feature model to predict the potential influ-

ence of a commercial tweet to its audiences. The proposed model does not include the

inherent meaning of the post and it relies on only the construction, style and meta features

of the post. It ensures the generality of the model such that it can be adapted to most com-

mercial posts. Unlike some previous work, the model does not need any observation of the

diffusion of the post, and therefore the prediction can be made instantly before posting a

commercial tweet. The experiments show that our techniques can provide a useful and sta-

ble performance in predicting the tweets with successful influence while using only a small

set of features. The proposed style-feature model outperforms the content-based n-gram

and embedding models in terms of F1 score. We also show that among all the features, au-

thor meta data has the largest contribution, while the special elements, punctuation marks,

and username mentions contained in the post have comparable contribution as well.

55
Chapter 4: Offline Activity Recognition

4.1 Introduction

Precise real-time user targeting is another critical step to the success of social media

advertising. Social media platforms are able to build rich profiles from the online presence

of users by tracking activities such as participation, messaging and website visits. The

important question we seek to address in this work is, “Can we tell what the user is actually

doing when he/she tweets?” For example, is he/she dining, watching a movie, or studying

in a library? By knowing the activities of a user, such as whether he/she visits restaurants

or travel frequently, more precisely targeted advertisements and marketing strategy can be

directed to them.

Social media users are primarily driven by their interests to write posts. Extracting

these interests from posts has been quite successful [106, 64]. We now seek to unearth the

offline activities that the user is engaged in when he/she posts. Unlike interests, the offline

activities can provide a close to real-time view into the user. As an example, building

interest profiles may tell us that a user likes watching movies, thus ads related to certain

types of movies may evoke his/her attention. However, being able to detect offline activities

can tell us that a user is watching a movie right now, thus ads related to popcorn and beer

56
may have immediate appeal. In other words, knowing the activity a user is engaged in can

enable very effective targeted advertising.

Content Location Activity


1 Just Landed in Looondon Airport Traveling
2 We’ve been trapped in London for 12 hours Airport Traveling
3 Ready @Tomlovestorun1? I’m not so sure Airport Traveling
4 Happy national tequila day! Night Club Entertaining

Table 4.1: Sample tweets with reported locations

Detecting a user’s activity from a tweet could be difficult. To illustrate this, Table 4.1

shows a set of sample tweets along with their reported locations and their assigned activity

labels. The keyword “landed” in Tweet 1 is sufficient to identify the correct location of the

user (airport) and his/her activity (traveling). Tweet 2 needs some inference to understand

the situation of its author – being stuck in a major transportation center. This situation can

still be extracted from the content of the tweet. Tweet 3 contains no information at all of its

activity – travelling. Further, a naive model may identify the activity of Tweet 4 as dining,

because the tweet talks about a drink. However, the author is actually entertaining at a

nightclub. We have observed that it is quite common to post tweets with content that may

clearly indicate one type of activity, while the author is actually engaged in a different type

of activity.

These examples show that the semantic content of a social media post does not, by

itself, always provide meaningful information related to the activity that the author is en-

gaged in while posting. Additionally, user-reported locations are very useful in determining

such activities. For example, [177, 88, 87] have shown correlation between activities and

57
the check-in locations of the posts. However, very few tweets contain such location infor-

mation.

Our goal is, therefore, to build a model that is able to recognize user activities not only

for cases where a clear indicator exists in the content, but also for cases where activity

information is latent and not directly usable. Therefore, the model should work without the

help of author-provided location information.

Returning to Table 4.1, it is clear that content alone is not sufficient to extract the correct

offline activity for Tweets 3 and 4, and additional context knowledge is needed. For exam-

ple, the additional knowledge of post time of Tweet 4 (midnight) dramatically increases

the possibility that the author is being entertained at a night club rather than eating at a

restaurant. Historical information is also contextual. Knowing that a post prior to Tweet 3

is about heading home allows us to infer that the author sent this post while traveling. Thus,

we posit that, in order to recognize offline activity, a richer contextual model is required,

consisting of additional background information.

To show that such inference can be handled effectively, this work focuses on the fol-

lowing research questions:

• How can we identify and appropriately label the offline activities of tweets?

• What contextual information (i.e. other than the content) assists in recognizing ac-

tivities?

• How can we effectively recognize user activities using the contextual features?

We address these questions through novel techniques as well as enhancements to ex-

isting techniques. We start by using a Long Short-Term Memory (LSTM) network [53] to

model only the content of tweets. LSTM is designed to handle sequential data, and it has

58
been shown to provide a reasonable performance on tweet classifications [59, 83, 161]. To

further improve the model, we explore and analyze the inclusion of other contextual fea-

tures with different variations of LSTM model. Based on the analysis and comparison, we

propose a hybrid LSTM model that properly handles the contextual features to improve the

outcome. For evaluation, we create a labeled dataset by collecting tweets where users have

reported their location. For the activity classification task, our proposed model is able to

reduce the error by 12% over the content-only models and 8% over the existing contextual

models.

Finally, we present an orthogonal validation of the proposed hybrid model with a real-

case application. Our model forms an analysis of the activities of the followers of several

well-known Twitter accounts, and the analysis demonstrates strong relationships to the

expected characteristics of these accounts. To the best of our knowledge, this is the first

work that seeks to recognize offline activities using a author-independent model. It is also

the first work that looks into and compares different LSTM-based models with respect to

their abilities to work with contextual features.

4.2 Related Work

User profiling on social media has been a popular area, and it is useful for personal-

ization, recommendation, and advertising. Research has been conducted on user profiling

based on the posts and interactions between the users. Rao et al. [135] used linguistic

features to profile users to extract gender, age, regional origin, and political orientation.

Lee et al. [81] built a user profile model based on certain types of words to improve new

59
recommendations. Certain efforts [8, 5, 96] characterize users based on their online com-

munication and webpage-visiting activities. Detecting life events [182, 30] from tweets has

also been addressed.

The problem of inference and prediction of real-life activities of users has not received

much attention. To date, there are mainly two types of works have been conducted on the

extraction of offline activities of users: prediction of a future activity (activity prediction)

and recognition of the current activity (activity recognition). Activity prediction considers

all features as historical data, whereas activity recognition focuses on current activities.

Early works on activity prediction [180, 115, 87] relied on the history of check-in locations

provided by the user. Later work [177, 88] added temporal information to the analysis

of activities given location data. None of the work utilized the post content of the users,

which is the major focus of our models. Weerkamp et al. [165] predicted future activities by

summarizing tweet topics where a future time frame is mentioned. To recognize the current

activities, Song et al. [146] built a framework that incorporated the similarity measurement

between the bag-of-word-based classifiers of different users by comparing the decisions

of the classifiers. It assumes that friends on social platforms are connected through their

activities. Relations in user interest are quite common among friends, however, we think

offline activities do not necessarily hold the same assumption. In contrast, our belief is that

contextual information provided by the same author is more relevant in recognizing offline

activities.

For the task of text mining, LSTM [53] has been used widely for modeling sequential

data. Greff et al. [47] performed a comparison across eight content-based LSTM variants,

and demonstrated that these variants have only limited improvements. To improve per-

formance, Bi-directional LSTM (BiLSTM)[142] and LSTM with a Convolutional Neural

60
Network (CNNLSTM)[189] are introduced to capture more appropriate information. Re-

cently, attention mechanisms have been added to LSTM [162, 91] to strengthen the ability

of handling long-dependencies. In order to incorporate external information, Ghosh et al.

[42] built a contextual LSTM model that adds the contextual feature into the calculation of

each gate function. Yen et al. [182] utilized a multi-task LSTM and included contextual

information by simply concatenating the features. Finally, hierarchical LSTM models have

been built [190, 59] that stack LSTM models with different levels of sequential data. In

general, the effectiveness of each model is highly reliant on the input data and features;

thus, none of the models appear good enough to work with all types of contextual data.

We look into the capabilities of several contextual models with respect to different contex-

tual features and create with a hybrid model that takes advantage of the success of these

models.

4.3 Working with Contextual Features using LSTM

In this section, we first describe the process of creating and assigning activity labels to

tweets. Then we show the work exploring several models that are built based on LSTM to

include contextual features.

4.3.1 Activity labeling

Similar to the labeling approaches of [87] and [146], we design an automatic labeling

process that uses the reported location of the tweets to assign labels. The reported location

is highly predictive in relation to the activities of the tweet. Essentially, we categorize lo-

cations and use predetermined rules to map locations to activities. Note that we also create

additional mapping rules to overcome errors brought by locations that could be involved in

multiple activities.

61
4.3.2 Contextual learning with LSTM

A typical approach to improving model performance is to include additional, and hope-

fully, more useful features. We therefore examine several popular LSTM-based models that

used contextual features including static features such as time of post, sequential features

such as POS tags, and historical features such as the most recent tweets from the same

author. The sequence of POS tags allows better understanding of the content, beginning

with the positioning of words. The timing of the post and historical tweets may provide

useful background knowledge of the target tweet. Because the goal of the system is to pro-

vide real-time recognition of activities associated with a given target tweet, we utilize only

tweets posted prior to the target tweet. We do not include the topics of the tweet because

prior work [178, 104] showed topics to be ineffective.

Original LSTM

Sequential models such as LSTM and Gated Recurrent Unit (GRU) [23] are ideal for

text processing because they consider the order and dependencies of tokens. Given that

LSTM and GRU have comparable performance, we use LSTM as the baseline to improve

by including contextual features.

it = σ(Wxi xt + Whi ht−1 + Wci Ct−1 + bi ) (4.1)

ft = σ(Wxf xt + Whf ht−1 + Wcf Ct−1 + bf ) (4.2)

ct = ft Ct−1 + it tanh(Wxc Ct + bc ) (4.3)

ot = σ(Wxo xt + Who ht−1 + Wco ct + bo ) (4.4)

ht = ot tanh(ct ) (4.5)

62
where i, f , and o are the input gate, forget gate, and output gate, respectively, x is the

input, c is the cell memory, b is the bias, and h is the output.

A simplified architecture of the LSTM model used for a text classification problem is

shown in Figure 4.1. The output of the embedding layers is a sequence of vectors that

represent the input sequence. LSTM outputs a flat vector representation for the entire input

sequence, and it is fed into another layer to generate the classification output. For our

activity recognition task, the tweet content is the input and the activity label is the output.

Figure 4.1: LSTM for text classification

Joint-LSTM

Similar to the idea of Yen et al. [182], we design a Joint-LSTM (J-LSTM) model

to concatenate the flat representations of the sequential input of content and contextual

features before feeding it to the output layer.

Figure 4.2 shows an example design of Joint-LSTM model. The sequence of POS tags

and the post time of the tweet shown in the figure are the direct contextual features that

have direct relation to the target tweet. The POS tag sequence is generated from the word

sequence and is fed into the model using embedding and LSTM layers. Post time is a

feature that is closely related to offline activities. We treat post time as a sequence of size

1 to be able to use it flexibly in multiple models. We find little difference in terms of the

overall performance between this approach and other approaches, such as feeding the time

63
Figure 4.2: Joint-LSTM for text classification

directly into a dense layer. In addition, the J-LSTM model in Figure 4.2 includes historical

tweets. They are modeled similar to the target tweet, and they share the same embedding

layer with the target tweet. Because the concatenation happens to the flat representation of

the input sequences, J-LSTM suffers from the weakening of sequential information for the

contextual and content features.

Contextual-LSTM

Ghosh et al. [42] propose a Contextual LSTM (C-LSTM) model to handle contextual

information. They add the contextual feature directly to the decision function of each gate,

64
as shown in the following equations.

it = σ(Wxi xt + Whi ht−1 + Wci Ct−1 + bi + WEi E) (4.6)

ft = σ(Wxf xt + Whf ht−1 + Wcf Ct−1 + bf + WEi E) (4.7)

ct = ft Ct−1 + it tanh(Wxc Ct + bc + WEi E) (4.8)

ot = σ(Wxo xt + Who ht−1 + Wco ct + bo + WEi E) (4.9)

ht = ot tanh(ct ) (4.10)

where i, f and o are the input, forget, and output gates, respectively, x is the input, c is

the cell memory, b is the bias, h is the output, and E represents the contextual features.

Figure 4.3: Contextual-LSTM for text classification

The implementation of C-LSTM is quite simple. It concatenates the embedded se-

quences of the contextual features with the embedded sequence of the content, and the

concatenation is sent to an LSTM layer. Figure 4.3 shows an example of C-LSTM model

that takes POS sequence, post time sequence, and historical tweets as contextual features.

65
To form the concatenation properly with all the input embeddings, static features such as

post time are duplicated and transferred into a sequence of the same value. Using the same

input and embedding settings as the J-LSTM model, the embeddings of the target tweet

content and the contextual features are concatenated before sending to the LSTM layer.

Therefore, C-LSTM requires the contextual features to have certain relationship with the

content at every timestep.

Hierarchical-LSTM

Existing Hierarchical LSTM (H-LSTM) models such as [190] are used mainly to model

contents at different levels of details. In addition, Huang et al. [59] used the structure to

incorporate social context such as retweets and replies. In contrast, we utilize a similar

H-LSTM structure, but include the historical tweets from the same author in chronological

order.

Figure 4.4: Hierarchical-LSTM for text classification

66
Figure 4.4 shows the structure of the H-LSTM model. Each LSTM segment on the

individual level handles a single tweet sequence. The input to the sequence level LSTM

is a propagation of historical tweet representations where the first one is the oldest tweet

and the last one is the target tweet. Because the tweet representations in the sequence

level are formed in chronological order, the sequence can be modeled to learn the historical

background of the activity label of the target tweet. To further utilize the historical tweets,

we also add a self-attention mechanism [154] to the LSTM on the sequence level. All tweet

contents share the same embeddings across the model. The hierarchical structure strictly

limits the type of features that can be used, therefore tests on other contextual features such

as post time and POS tag sequence result in disappointing performances.

4.4 Our Proposed Hybrid-LSTM Model

4.4.1 Including historical tweets

In this section, we first analyze the three popular models described in the previous sec-

tion with respect to their ability to incorporate contextual features. Based on the analysis,

we propose a hybrid LSTM model to better support rich contextual learning.

We conduct a comparison on a development dataset using J-LSTM, C-LSTM, and H-

LSTM with features of POS tag sequence, post time, and historical tweets. Details on the

construction of the dataset will be covered in the experiment section. These features are

used to explore a more general conclusion for the capability of the contextual models. The

accuracies shown in Figure 4.5 are weighted-average scores across all labels to handle the

imbalanced dataset. In addition, Table 4.2 lists several sample tweets that will be used in

the ensuing analysis.

67
Figure 4.5: Comparison of ability to incorporate contextual features between different mod-
els

The bottom right chart in Figure 4.5 shows the use of three models in handling the most

recent five historical tweets. We test with different numbers of historical tweets and find

that the relative performances of different models are similar. Tweet 1 in Table 4.2 was

posted while watching a baseball game and the author posts only baseball-related tweets.

It is surprising that H-LSTM has the worst performance as the structure is designed specif-

ically for historical data. H-LSTM also cannot recognize the correct activity for Tweet 1.

Attention mechanism aims to handle historical information more appropriately, but it does

not help generate any improvement. The utilization of chronological order in including

historical tweets may not be applicable to activity recognition on the target tweet. In other

68
1 Nice day for a game. Less nice was Warren’s first inning.
2 Biggest flag I’ve seen in person. Very cool. #NeverForget #911
3 We made it. #BEmediaday
4 The wait is over! #GreatBarrierReef #Ashes #GoldCoast

Table 4.2: Sample Tweets for Model Analysis

words, the habit of posting tweets may not form a chronological dependency chain across

historical tweets.

C-LSTM incorporates historical information by a stepwise concatenation of the tweet

sequences. We believe that historical tweets have hidden information related to the target

tweet, but such information is unlikely to be effectively captured in a word-to-word style.

Similar to C-LSTM, J-LSTM does not carry any order information. The merging of the

information for J-LSTM happens at the level of entire tweets, thus it relies on sharing of

the complete information among historical tweets. Because the historical tweets of Tweet

1 also contain a lot of baseball-related words, J-LSTM and C-LSTM are able to recognize

the correct activity of Tweet 1. In addition, the historical tweets of Tweet 2 are very diverse

in terms of the length, topic, and writing style. Therefore, C-LSTM is not able to filter

the noise while J-LSTM still works by combining the complete information. Based on this

analysis, we think that a simple combination of complete recent tweets could better support

the classification of the target activity.

4.4.2 Including direct contextual features

Because H-LSTM is introduced to include historical tweets, we apply only J-LSTM and

C-LSTM to the contextual features of POS tags and post time (see remaining three charts

in Figure 4.5). In general, C-LSTM performs better in handling both features. Because

C-LSTM is designed to incorporate features at each step of the input sequence, it generates

69
a larger improvement with stepwise features such as POS tags. When dealing with static

features like post time, C-LSTM adds the same information to the gate decision for each

input step of the content sequence. On the other hand, J-LSTM incorporates this contextual

information to the representation of the entire target tweet.

Tweet 3 is relatively short, but the post time of 6:19 a.m. would help to recognize the

activity of traveling. After segmenting the hashtags in Tweet 4, and knowing the tokens

are proper nouns, we understand that the author is traveling to Australia. For both tweets,

C-LSTM performs better by including the contextual information more accurately with

the corresponding words. Therefore, with deeper and more precise incorporation at each

timestep, C-LSTM is more suitable in handling direct contextual features.

4.4.3 Hybrid-LSTM

The analyses above show that historical features are better handled by concatenation

at the flat representation level and direct contextual features work better with stepwise

concatenations. In order to handle rich contextual learning that includes different types of

contextual features, we propose a hybrid LSTM model (HD-LSTM) based on the analysis

above. HD-LSTM aims to cover a wide range of contextual features and utilizes different

modeling layers for different contextual features. With the capability of various layers in

incorporating certain features, HD-LSTM is able to reach a better performance by handling

the contextual features more appropriately.

Figure 4.6 shows a sample design of HD-LSTM that utilizes text input, along with

contextual features of historical information, POS tag sequence, and post time. In particu-

lar, for each tweet component shown in the dashed box, the content sequence and the direct

contextual features are combined with a concatenation of their embeddings. In each dashed

70
Figure 4.6: Hybrid-LSTM for text classification

box, post time is used to mark the moment when the tweet was written, while POS tag se-

quence helps understand how each word was actually used in the tweet. Then the enriched

sequential representation is fed into a LSTM network that generates a flat vector represen-

tation for the tweet component. At this step, each LSTM module learns the representation

for the semantic, syntactic, and temporal information of the tweet. Next, the enriched flat

representations for all tweets are concatenated to form a larger representation that contains

the information from all inputs. This concatenation further includes the historical informa-

tion of the target tweet to improve the overall understanding of an enriched background.

Finally, the concatenated vector is fed into the output layer and generates the label.

The features that belong to the same category across all tweet components share the

same embedding. In our case, all tweet content sequences, POS tag sequences, and post

71
times share the same embeddings, respectively. To further boost the proposed hybrid

model, we also add self-attention to all involved LSTM layers.

4.4.4 Illustrative examples

Content True / Hybrid LSTM w/ Hist w/ POS&T


1 Breakfast of champions Traveling Dining Traveling Dining
I guess the word has gotten
2 Dining Shopping Dining Dining
out about E’s ... so crowded today
Last time I was here was
3 Enhance Enhance Entertain Enhance
pretty sad. #BaptistHospital

Table 4.3: Sample tweets and model predictions

Table 4.3 lists several examples from the development set to show the effect of includ-

ing contextual features on recognizing activities, and the success of the proposed hybrid

model. We use LSTM to show the performance when using content only, J-LSTM to ap-

ply historical tweets, C-LSTM to include both POS tags and post time features, and use

HD-LSTM to combine all these contextual features.

Tweet 1 shows a strong relation to breakfast, however, the true situation is that the

author took a photo of a sandwich while he was waiting at an airport. It is reasonable that

using the tweet content leads to a decision of “dining” activity, and it holds the same even

if the post time is considered. The most recent two historical tweets from the author talked

about leaving the hotel and arriving at the airport. Thus, including the historical tweets

becomes very useful in recognizing the correct “traveling” activity. Tweet 2 describes a

situation where the author is surrounded by many people. With only this clue, it is possible

that the author was shopping at a mall, having a dinner, or waiting at a train station. The

post time of 12:07 p.m. on a Sunday increases the possibility of having a meal, and the

72
model makes the correct decision. The true activity of the author is dining in a cafeteria,

and “E” is the name of the place. Because “E” is an unusual name for a cafeteria, it becomes

hard for the content-only model to utilize this information. In addition, recent tweets from

the author discuss having fun with friends, which also helps determine the correct activity.

Tweet 3 has a strong indicator that the author is at a hospital and the content-only model

can generate the correct output. However, including the historical tweets only results in an

incorrect result of “entertaining.” Several historical tweets mention drinking wine, which

could mislead the historical model. Those historical tweets are all posted at night, while

the post time of the target tweet is early in the morning. Considering this, the hybrid model

is able to give the correct decision by distinguishing the different topics between the target

tweet and the historical tweets.

4.5 Experiments

In this section, we describe our experiments that explore the performance of different

LSTM-based models, focusing on comparing their abilities to incorporate contextual fea-

tures in a tweet classification task. As we have stated throughout, the contextual features

include the POS tag sequence and post time of a tweet, as well as the most recent historical

tweets from the same author. Although author identity has proven to be helpful in many

tasks [7, 19], we do not include it because it could potentially create a strong bias in the

model and it is not general enough in relation to ordinary inference tasks.

4.5.1 Data preparation

Although normally desirable for supervised learning, manual labeling was problematic

for labeling tweets with activities for the following two reasons. First, humans are good at

recognizing surface meaning, especially when no background and external information are

73
required. Thus, manual labeling suffers from the same problem showed by the examples

described in the first section. The activities that cannot be inferred from the content itself

are unlikely to be labeled correctly by humans. Second, a labeled dataset of sufficient size

was highly desirable because the size of the training data is related to the quality of the

model. Although there are certain ways to crowdsource the labeling process, obtaining

sufficient labeled tweets with consistent quality seemed infeasible. Therefore, we label the

activities based on the reported locations.

We started the data collection from defining a list of place categories that are strongly

related to certain activities. Then we used Google Maps API to collect specific places for

each category with detailed coordinates. Finally, we used Twitter API to collect tweets

that are posted with a reported location that is also within a range of 10 meters from the

coordinates of a specific place. We removed duplicates and only included the tweets that

have reported location type as Point of Interest (POI). POI indicates that an activity can be

conducted at this location [87]. To further clean the data, we removed tweets that contain

less than three tokens or tweets where more than 70% of the tokens are mentioned user-

names. Hashtags are useful elements in tweets and sometimes they can be strong indicators

of locations or activities. However, such use of hashtags may also lead to over-fitting the

model, and the unique manner of creating hashtags makes them less useful to unseen ones.

To prevent this problem while preserve the meaning, we removed the hashtag signs and

segmented the hashtag content so that the hashtags are separated into ordinary words.

Table 4.4 shows the relationship between the predetermined place categories and activ-

ities. As mentioned, additional rules are used to improve the labeling quality. For example,

tweets that have the noun keyword “ceremony” at location “stadium” should be labeled as

“enhancement.”

74
Activity Tweet Count Locations
Enhancement 3848 hospital, library, dentist, doctor, school, university, etc
Traveling 12371 airport, bus station, train station, lodging, etc
Dining 3934 bakery, liquor store, bar, restaurant, meal delivery, cafe
Entertaining 11457 aquarium, movie theater, museum, night club, etc
Shopping 4045 department store, book store, convenience store, etc
Sporting 10028 stadium

Table 4.4: Location - activity label mapping

Although the data collection process is initialized with the same number of requests for

each activity type, the process results in an imbalanced dataset. In our test, down-sampling

or over-sampling the dataset does not reveal any considerable difference in the overall

performance. Therefore, training data are processed with different weights with respect

to different classes, and the metrics are calculated as the weighted average across classes

(one consequence is that F1-score may not fall in between precision and recall values). The

training, development, and test sets are randomly divided randomly with ratios of 0.6, 0.2

,and 0.2, respectively.

4.5.2 Experiment design

To show the improvement of using contextual features, we also experiment with other

content-only LSTM-based models, i.e., BiLSTM [142], CNNLSTM [189], and LSTM with

self-attentions (LSTM+Att). Unlike certain previous tasks [156, 29], using a word-level

model results in a better performance than a character-level model in our task. We apply the

idea of transfer learning to initialize tweet content embeddings using GloVe [124] before

training. This creates a more domain-specific word embedding compared with using fixed

pretrained embeddings, and it also generates better performance compared with randomly

initialized embeddings. Additionally, POS embeddings are initialized randomly. Post time

75
is represented as day of the week and time of the day, and we set four six-hour time periods

per day. Tweet content embeddings have 200 dimensions, while POS tags, time, and day

are all mapped to embeddings of 20 dimensions.

When testing with different numbers of historical tweets, we found that including the

five most recent tweets as the contextual feature yields the optimal performance for most

models. We note that H-LSTM is much more sensitive to the number of historical tweets

compared with other models. POS tags are generated using a tweet-specific tagger [116],
10
and the models are built mainly using Keras [20]. We use 200 nodes for all the LSTM

networks in the experiment with a dropout rate of 0.2, categorical cross-entropy as the loss

function, apply Adam optimization for training, and set a mini-batch of size 100. Softmax

function is used in all output layers, and all models are tuned with different epochs for

optimal performance.

4.5.3 Experiment results and analysis

Table 4.5 lists the performance of different models. For contextual features, “Direct”

refers the use of POS tag sequence and post time features in addition to target tweet content,

while “All” denotes the use of POS sequence and post time with the content of both the

target tweet and the five most recent historical tweets.

Models that only use the target tweet content generate results with only limited im-

provement over the original LSTM. In contrast, the use of contextual features boosts the

performance. The post time is more useful than the POS tag sequence, and the benefit of

including historical tweets varies with the method of incorporation.

LSTM uses only the content of tweets and reaches a reasonable performance for the

task given that it has six labels. Bi-LSTM adds the ability to understand the content in
10
Source code is available at https://goo.gl/o9dsBh

76
Content-only
LSTM BiLSTM CNNLSTM LSTM+Att
Recall 65.62 66.62 65.62 66.99
Precision 65.25 66.02 65.01 66.66
F1 64.96 65.71 65.06 66.56
J-LSTM
Time POS Direct Hist=5 All
Recall 66.65 66.00 66.12 67.30 66.91
Precision 65.76 65.40 65.94 67.03 67.88
F1 65.98 65.54 65.98 67.04 67.19
C-LSTM
Time POS Direct Hist=5 All
Recall 66.73 66.85 67.01 66.77 66.80
Precision 66.62 66.33 66.53 66.74 67.61
F1 66.29 66.30 66.61 66.33 67.06
H-LSTM
Hist=5 Hist=5+Att
Recall 65.16 65.56
Precision 66.62 65.53
F1 65.69 65.44
HD-LSTM w/ Hist=5
Time POS Direct Direct+Att
Recall 67.68 67.70 68.70 69.74
Precision 68.60 67.06 68.13 70.00
F1 68.03 67.22 68.23 69.84

Table 4.5: Comparison of model performance

another order and helps improve the outcome. Adding the convolutional layer does not

provide much improvement. CNN is used to extract information similar to an n-gram

model and the informal use of words in tweets reduces the capability of such information.

As expected, adding attention mechanism considerably helps the performance.

J-LSTM works better to include historical tweets and C-LSTM performs better with

direct contextual features, while H-LSTM does not do well to include historical tweets.

77
Because C-LSTM incorporates the contextual features into every token of the input se-

quence, C-LSTM shows benefits from adding more direct contextual features. All three

contextual models are able to benefit from including historical tweets. It is surprising that

C-LSTM generates a certain level of improvement with historical tweets. C-LSTM in-

cludes the tokens from historical tweets with the tokens from the target tweet at each time

step and it is not intuitively correct that words from different tweets have direct relation-

ships. We think that some hidden attributes across tweets from the same author bring the

improvement, such as the use of certain words while the author is engaged in a particular

activity.

Combining the power of both J-LSTM and C-LSTM, the hybrid model outperforms

both content-only models as well as models that use a fixed method to incorporate con-

textual features. When including all features, the large improvement of HD-LSTM over

J-LSTM and C-LSTM shows the effectiveness of the hybrid model. The reported improve-

ments in performance further strengthen the analysis that was used to build the proposed

model: historical tweets can be handled better by concatenating the complete information

of tweets, and the stepwise concatenation of feature representations works better to include

direct contextual features. It is also obvious that HD-LSTM benefits simply from including

more contextual features. In contrast, using a single method to incorporate more contextual

features does not improve the performance consistently. Finally, HD-LSTM also benefits

from adding a self-attention mechanism to LSTM layers.

4.6 Demonstrating Use of the Approach: A Case Study

In this section, we exhibit a real case where the activity recognition is utilized on a large

volume of tweets. The results validate the effectiveness of the activity recognition model.

78
We find seven popular accounts that all have a large number of followers but are distinct

in their fields of focus. For each account, we collect 10,000 followers randomly and, for

each follower, we collect the most recent 200 tweets. For each tweet, we apply he hybrid

model with POS sequence, post time and historical tweet features to generate a probability

distribution over activities. Then we generate a distribution of activities for each follower

by combining the distributions of the tweets posted by that follower. Thus, we are able

to accumulate the distribution for each follower to generate a probability distribution of

the activity labels over the collection of followers for each popular account. This activity

distribution is used to represent the follower activity profile for this popular account. In

the equation below, pf,t,i is the probability for the ith activity label given a single tweet t

from follower f . The probability Pi for the ith activity for the collection of followers for

an account would be:

1 X 1 X
Pi = pf,t,i (4.11)
Z0 f ∈F Z1 t∈T

Because there are duplications and invalid tweets involved in the dataset, the number

of tweets for each follower used for the model may not be the same. Therefore, we have a

normalization factor Z1 to normalize for each follower, and another factor Z0 to normalize

for each popular account. In addition, F is the set of followers for the account, and T is the

collection of tweets for a given follower.

We train the model using the full dataset from the experiment. Figure 4.7 shows the

results of analysis for these popular accounts. To make the graph more understandable,

we present the probability for each activity label over popular accounts and therefore the

probabilities for each activity label do not sum up to 1. The imbalanced dataset used to

train the model creates certain trends in different activity labels, but the comparison within

each activity label can still be useful to draw conclusions.

79
Figure 4.7: Summary of the activity distributions for followers of popular accounts

It is straightforward to see that espn has a high probability for “Sporting” and Trav-

elEditor holds the peak in “Traveling.” khanacademy and ClevelandClinic represent ed-

ucational and medical needs and lead to an obvious result of the highest probabilities in

“Enhancement.” It is interesting that ClevelandClinic has the second highest amount of

attention of its followers for travel. The need of expanding medical services from the team

and the need of heading to medical facilities from the patients could cause such increasing

attention in “Traveling”. WholeFoods and sprinkles, as a food market chain and famous

cupcake bakery, have the highest involvement of both “dining” and “shopping” for their

followers. It shows that the followers of WholeFoods also care about personal enhance-

ment other than foods. YouTube has a high involvement of “Entertaining” for its followers,

80
while the peak of sprinkles indicates that the interest in cupcakes could lead to the interest

in entertainment.

These observations and conclusions provide validation of the usefulness and effective-

ness of the activity recognition model.

4.7 Summary

We present a methodology for including contextual features to improve the perfor-

mance of content-based LSTM models, with an application of recognizing offline activ-

ities of a user when posting tweets. Our contributions include a location-based method to

label tweets with offline activities, as well as an analysis and exploration of the different

ways of including direct and historical contextual features with LSTM and effectiveness of

each technique. We propose a hybrid LSTM model that combines and takes advantage of

the various methods to include contextual features. Our experiments show that including

contextual information improves performance over the content-only models. Further, the

hybrid model is able to incorporate the contextual features more effectively than existing

methods. The amount of improvement shows the importance of choosing the right method

for including certain types of contextual features. Finally, we validate our activity recog-

nition model by using it to derive an activity analysis of the followers for several popular

Twitter accounts.

81
Chapter 5: Constrained Paraphrase Generation for Commercial
Tweets

5.1 Introduction

Our last work aims to the core of social media advertising – crafting commercial tweets.

Social media has become an extremely popular platform for corporate marketing and ad-

vertising [153]. Generating attractive yet precise commercial tweets has become a critical

challenge for the companies. In order to maximize their effect, multiple commercial posts

containing the same information are often sent to their target audiences. Figure 5.1 gives

an example of multiple commercial tweets containing the same information about a new

product Spicy Chicken McNuggets, posted on the same day. While capturing the same es-

sential information, these tweets are worded differently, in order to not look repetitive and

capture the interest of more potential customers.

At present, all the commercial posts are still crafted manually, making social media

presence management a substantial investment for companies [164]. We believe much of

the work can be assisted by systems that help generate new commercial posts with the

same meanings yet with different phrasing. Such an approach could assist in automatically

creating distinct commercial posts, as well as crafting attractive ones.

Our research focuses on paraphrase generation for commercial tweets that preserve

the original meaning while being diverse. Paraphrase generation has been studied widely,

82
Figure 5.1: Commercial tweets that are posted for the same product

along with other text generation tasks such as Machine Translation [60], Summarization

[99], Text Simplification [143], Question Answering [34], and others. Recently, the use of

Deep Neural Networks (DNNs) has helped models learn and understand more sophisticated

hidden factors in generating text content [130]. It mainly involves the Seq2seq models [6],

Generative Adversarial Networks [46], and the emerging Transformer-based generation

models [154]. The ability to model the process of text generation, especially the delivery

of semantics, is growing fast.

Controlled paraphrase generation is similar to the problem of text generation but adds

certain specific requirements. The early work focused on attributes such as sentiment or

writing style with the goal of enriching the generated text [58, 185]. In later efforts, more

specific requirements were added, such as the choice of words [18, 186]. Recent work

in this area has incorporated structural information to Graph-to-text generation tasks [28,

145].

The focus of our work is the paraphrasing of commercial tweets. This problem is

distinct from prior work in paraphrase generation in having hard constraints [55, 127, 186].

83
Unlike the latent controllable attributes, such as writing style, hard constraints are those that

require certain words or phrases to be kept in the generation. For example, the highlighted

parts in Figure 5.1 are considered hard constraints that must be maintained in the generated

paraphrase.

In order to address the problem described above, this chapter proposes a Constraint-

Embedded Language Modeling (CELM) framework for generating paraphrases of com-

mercial tweets in ways that meet hard constraints while encouraging diversity in the gen-

erated text. Specific components of our work include utilizing a large paraphrase dataset

and showing its compatibility by applying the learned knowledge to commercial posts on

social media, introducing an automatic process to identify the hard constraint in a con-

tent and embed the constraint directly into the text data, and showing that the embedded

constraint information can help learn a causal language model and results in performance

improvements.

To the best of our knowledge, this is the first work that embeds generation constraint in

the learning process of language models. The proposed constrained generation framework

outperforms the existing CopyNet structure [48] across multiple evaluation metrics. At the

same time, we show that the constraint-embedded data can enhance the performance of

CopyNet.

5.2 Related Work

Over decades, the topic of paraphrase generation has taken a similar research path as

other text generation tasks. Linguistic knowledge was first introduced with hand-crafted

rules to build the system [103]. Statistic models were also used with shallow linguistic

features[187], while syntactic and semantic information was explored to help the modeling

84
of the paraphrase generation [38, 74]. The success of Deep Neural Network in Machine

Translation has been matched by its efficacy in paraphrasing as well. Learning from a large

parallel corpus, standard encoder-decoder structure can model the source text as a hidden

representation and generate the target paraphrase based on that [128, 95, 33]. Word Embed-

ding Attention was added to better model the semantics of the words [94]. An evaluator is

later introduced to build reinforcement learning frameworks to improve the performance of

paraphrase generation [86]. Another approach to improve the generative performance is ap-

plying Variational Autoencoder (VAE) to the encoder-decoder structure [49, 16]. Recently,

the powerful Transformer structure [154] was applied to paraphrasing tasks [169, 37]. In

addition to word sequences, Wang et al.[158] also applied Transformer to the correspond-

ing frame and role label sequences to improve the generation performance. In this chapter,

we build on the general approach of using Transformer-based language models for para-

phrase generation tasks, since these models have the most promising performance.

Conditional or constrained paraphrase generation requires certain attributes or elements

to be included in the output text. This additional step helps the paraphrasing to be more

task-oriented and improves the generation quality. Attention mechanism is utilized to build

Pointer Net [155] and CopyNet [48] to specifically locate the relation between the words

in the source and target sequences – this can be a promising approach for meeting hard

constraints as generated text can include certain words from the source text. Cao et al.[18]

trained a separate alignment table to limit the vocabulary used in the decoding process.

Hu et al. [58] incorporated discriminators and a latent code to the VAE Encoder-Decoder

model to control the attributes incorporated in the generated text. Chen et al. [21] added

two hidden codes to represent the semantic and syntactic attributes, which they used to

control the semantic similarity and writing style, respectively. To create text content for

85
adversarial attacks, Wang et al. [159] included a separate controlled attribute to a encoder-

decoder framework. Generative Adversarial Networks (GAN) are combined with Trans-

former to incorporate the writing style that is extracted from a reference text to the output

text [185]. Keskar et al.[66] built an explicit relationship between subsets of training data

and the generative model using control codes. Recent work [105, 186] treated the con-

strained generation in a special way by inserting words based on the pre-defined keywords.

This ensures the persistence of the keywords, and it also fixes the order of the words. In

addition to the keywords, the generated text only relies on the learning domain, and it

cannot take context information per generation task, such as the source sequence. How-

ever, we want a model that can be flexible in terms of the order of these keywords in the

output paraphrase. Several methods [55, 127, 57] haven been proposed to handle the hard-

constrained generation by modifying the decoding and inference stage. Our work focuses

on the same requirement of hard constraints and the constraint information can be learned

through language models.

5.3 Constraint-Embedded Language Modeling (CELM) for


Paraphrase Generation

Figure 5.2: Overview of the constrained generation process

86
Language models are typically used to understand the writing of natural language text

and to generate natural text based on the learned knowledge. Language models can learn

from the text, including grammar rules, word usage, and writing styles. Towards our goal

of paraphrase generation with hard constraints, we believe the models can also learn these

constraints. To be more specific, we propose to embed the constraint information in the

text content and let the language model learn such constraints.

Figure 5.2 shows the overall workflow of our proposed CELM framework. Specifi-

cally, given the original text, hard constraints are identified automatically, and then these

constraints are embedded into the text sequence. A causal language model is used to gen-

erate the output given the embedded text sequence, and the extracted hard constraints are

realized in the output. The rest of this section describes these steps in more details.

5.3.1 Constraint identification

Instead of using latent variables to control the constrained generation, we embed the

specific constraint information directly in the content. While it may be feasible to assign

the constraints manually, we want the system to identify the words automatically where the

constraints should be applied.

Figure 5.3: Dependency parsing of a commercial tweet

87
We explore the constraints for commercial tweets starting from certain nouns and nu-

merical representations. We rely on the syntactic dependency parse tree of a given text

sequence to identify the constraint in commercial tweets, as a dependency parse tree can

give more information than part-of-speech (POS) tags. The structure of commercial tweets

is generally simple and the constraints are usually focused on the proper nouns and num-

bers. For example, Figure 5.3 shows the result of dependency parsing of a commercial

tweet. The name Ridley Scott is clearly a constraint in the text. In order to keep certain

information accurate, we mark the numerical representation 6 as another constraint. We

limit the proper nouns to be the root, subject, or object in a dependency relation, while

allowing numerical representations that are number modifiers. A preliminary test shows

that this approach results in a 98% precision and 75% recall to identify the hard constraints

for commercial tweets.

5.3.2 Constraint embedding

Figure 5.4: Embed the constraint directly to text sequences

Similar to [120], we embed the constraint in the text sequence by replacing the cor-

responding token or phrases with their constraint types. Initially we mark two constraint

types, which are proper nouns and numerical representations. Figure 5.4 shows an example

of replacing a proper noun phrase with a special token to embed the constraint information.

88
We treat the token or phrase having the same value to be the same constraint. Therefore, it

is possible that the same constraint occurs multiple times in a single text sequence.

Although the dependency relation is used to identify the constraint, we do not directly

adding such relationship information to the constraint type. We believe these relations

can be learned by an effective language model. Therefore, we omit these relations out

from constructing the specific constraint information to increase the model flexibility. Less

limitation in constructing the constraint also increases the size of training samples where

the same constraint type appears.

5.3.3 Causal language modeling

Language models try to generate the current word wi given the context words wc :

P (wi ) = P (w|wc ) (5.1)

In this work, we rely on the popular causal language model that imitates the writing

habits of humans and utilizes only the information generated previously. Therefore, context

words are the words that have been generated previously in the text sequence.

With the growing capability of deep learning models, it becomes possible for language

models to train on extremely large datasets. GPT-2 [130] has been shown to successfully

fulfill many text-generation tasks, such as summarization and translation. Similar to [169,

50], we utilize the pre-trained GPT-2 model as the language model to generate paraphrase

for a given text sequence. GPT-2 is a pre-trained causal language model that focuses on

generating the most appropriate token to form coherent writing. Therefore, we form single

sequences from the the paraphrase pairs to fine-tune the model, so that the model learns

to perform paraphrase generation. Paraphrase pairs are concatenated with a special token

89
(such as ”>>><<<”) to identify the paraphrase activity as well as the separation of the

two text sequences.

We treat the tokens that represent constraint types like ordinary tokens in language

modeling. Along with the paraphrase separation token, these special tokens become part of

the corpus. Because these special tokens have a much higher occurrence than regular words

in the dataset, the paraphrase activity and corresponding constraints are learned more easily

through fine-tuning.

5.3.4 Decoding and generation

One of the goals for CELM framework is to generate paraphrase commercial tweets

with enough diversity. The model should avoid using many of the tokens from the source

text. Therefore, instead of a greedy approach or a beam search [147], we sample the tokens

to generate multiple paraphrases for each source sequence. Greedy approaches focus on

the output sequences that have the highest probabilities. However, these approaches often

result in generating outputs that have similar text content.

In this work, we apply Top-k sampling [39] that generates each token randomly based

on the conditional probability of the most likely k tokens. The probabilities of the top k

tokens are redistributed.

wi ∼ P (w|w1:i−1 ) f or w ∈ Vtop−k (5.2)

We also introduce the use of Top-p method, which limits the sampling pool to be the

smallest possible set of tokens whose probability summation exceeds the probability p.

Renormalization is also applied to the limited probability set. Unlike the Top-k approach,

Top-p builds a dynamic sampling pool where less tokens are included when the entropy of

90
the probability distribution is lower.

Vtop−p = argmin |Vp |


Vp ⊆V
X
where P (w|w1:i−1 ) ≥ p (5.3)
w∈Vp

Combining these, each token is inferred by sampling from a dedicated set that meets

both Top-k and Top-p requirements.

wi ∼P (w|w1:i−1 )

f or w ∈ Vtop−k ∩ Vtop−p (5.4)

Because the constraints are embedded in the text sequence, the generated sequences

are expected to have the constraint tokens so that they can be realized to the actual values.

Based on the number of constraints for each constraint type, the final realization is handled

in different ways. Figure 5.5 shows the examples for both cases.

Figure 5.5: Examples of text content with single and multiple constraints

Single-constraint Realization: For cases where only one constraint is identified in each

text content, the model simply replaces the constraint tokens back to their actual values.

91
Multiple-constraint Realization: When multiple constraints are extracted for one con-

straint type, the model goes through all possible permutations of the actual values to replace

the certain constraint tokens. It ensures that every actual value takes at least one constraint

token of its type. Then, the output sequence that has the highest semantic similarity to the

original text is picked as the final generation.

5.4 Experiments

In this section, we report results from a set of experiments designed to demonstrate the

capability of learning and incorporating hard constraints through causal language models.

Our proposed CELM framework identifies hard constraints automatically from commercial

tweets, embeds the constraints in the tweet, relies on GPT-2 for training and inference, and

realizes the constraint tokens to generate the final output.

To form the comparison, we utilize CopyNet [48] as the baseline. CopyNet is the state-

of-the-art language model that is designed specifically to handle the necessity of making

sure certain tokens from the input sequence are kept in the output sequence. CopyNet can

learn the constraint relation through pairs of raw text samples. Additionally, we provide

the constraint-embedded data and apply constraint realization to CopyNet, as well as test

CELM without embedded constraints.

5.4.1 Data preparation

Unlike other text-generation tasks, sources for sentential paraphrase datasets are lim-

ited. In particular, the sizes of the datasets are often small. Popular datasets such as the

one reported by Dolan et al.[32] are generated by human annotators in the news domain.

For the Twitter domain, two paraphrase datasets [174, 75] are constructed either manually

or by relying on a strong assumption of sharing the same links. They suffer from size and

92
domain limitations, which make them not ideal to train a generation model for commercial

tweets.

Fortunately, we find that the writing style of commercial tweets is more formal and

closer to day-to-day writings. Therefore, we trained the model using the paralleled machine-

translated (PMT) paraphrase dataset [167]. This dataset is automatically constructed using

a neural machine translation model, and we are able to use about 5 million sentential para-

phrase pairs for training. Constructed based on CzEng [14], the PMT dataset covers a wide

range of fields including tweets. Because of the formal writing style seen in commercial

tweets, we find that the cases in PMT are considerably compatible with commercial tweets.

Table 5.1 lists certain examples that demonstrate this compatibility in terms of the use of

proper nouns and imperative sentences.

Dataset Sample Content


he put it up for sale in his stand in Hong Kong.
PMT pairs
and he put it on sale at a trade fair in Hong Kong.
CommTweet Take a look at the recent PRIV launch in Hong Kong
remember that promotion we were talking about? - yeah.
PMT pairs
remember when we talked about the promotion?
CommTweet Sliding into your “promotions” email folder like. . .

Table 5.1: Examples for data compatibility between PMT and CommTweet

We apply the dependency parser from spaCy [56] and embed the constraint tokens

where the same hard constraint words are located in both sequences of a paraphrase pair.

We build two training sets from the result: 1) include only the constraint-embedded para-

phrase pairs where at least one matched constraint pair is located (ONLY); 2) besides the

pairs from ONLY, include the original text from PMT for all other paraphrase pairs (MIX).

The ONLY set focuses on learning the hard constraint, whereas the MIX set also provides

93
more sources to learn general paraphrase generation. Additionally, to compare the perfor-

mance, we use the entire original dataset as the third training set (ORI). The ORI and MIX

sets both contain about 5,000,000 sentence pairs each, and ONLY set has 800,000 pairs.

We utilize the knowledge learned from the PMT dataset and apply it to our commercial

tweets dataset (CommTweet). CommTweet comprises original tweets (not retweets or

comments) from 35 verified official accounts of popular brands. The same constraint iden-

tification method is applied, and we further split the dataset into two subsets. One subset

(SINGLE) contains the commercial tweets where only one constraint is identified for each

constraint type (proper nouns and numerical representations). The other subset (MULTI)

contains cases where more than one constraint is found for each type. SINGLE has 31,922

tweets and MULTI includes 13,191 tweets.

Two preprocessing steps were added to these datasets. First, links and hashtags are

special elements in commercial tweets, and they are intended to be the same regardless of

the content of the tweet. Therefore, we remove the links from the data, and segment the

hashtags so that they can be part of the content to get involved in paraphrase generation.

Second, from all datasets, we remove the tweets that are extremely short.

5.4.2 Experiment design

Consistent with the goals of paraphrasing, we use several metrics to capture the di-

versity and the semantic similarity of the generated text compared to the original text .

Measuring the word usage is an effective way to demonstrate the diversity of the writing,

and particularly, Witteveen et al. [169] show that ROUGE-L [90] is useful in determining

94
the uniqueness of the generation. We also include uni-gram BLEU [118] to measure the di-

versity. Meanwhile, the semantic similarity is measured by computing the cosine similarity

between the sentence embeddings [137] of the text sequences.

Besides these metrics, we introduce coverage measurement to check the percentage of

the hard constraints that have been observed, and perplexity to quantify the coherency of the

writing of the generated paraphrase. The perplexity score is generated by running inference

on the pre-trained GPT2-medium model. Note that we use GPT2-small to generate this part

of the experiment results due to its efficiency. Therefore, the perplexity score can only be

used to compare the performance results that are generated using the same language model

(CopyNet or GPT-2).

As suggested in [169], we fine-tune the GPT-2 model for a small number of epochs

to give the model enough exposure to the task while avoid over-fitting on the new data.

CopyNet is trained from scratch, but we also leave some allowance when training and

validating so that the model is not overfitted and can be applied to the test data in another

domain. The training sets (ORI, ONLY, MIX) and test sets (SINGLE, MULTI) are identical

when used for experiments related to CopyNet and GPT-2 models. We set a maximum

length of 140 characters for the generated paraphrases.

5.4.3 Experiment results and analysis

Tables 5.2 and 5.3 list the performance of the CopyNet baseline model the proposed

CELM framework with GPT-2 using original and constraint-embedded data. As stated

earlier, BLEU and ROUGE-L are used to measure the diversity of the generation, and

a lower score represents a more diverse generation. Similarity shows the quality of the

paraphrase in terms of how much semantic information it keeps from the original content,

95
Baseline (CopyNet) CELM (GPT-2)
ORI ONLY MIX ORI ONLY MIX
BLEU 0.532 0.587 0.637 0.213 0.267 0.262
ROUGE-L 0.750 0.817 0.863 0.234 0.295 0.294
Similarity 0.774 0.843 0.877 0.781 0.828 0.817
Coverage 0.477 0.876 0.845 0.261 0.912 0.865
Perplexity* 1300 1144 863 177 306 309

Table 5.2: Model performance on SINGLE set

Baseline (CopyNet) CELM (GPT-2)


ORI ONLY MIX ORI ONLY MIX
BLEU 0.489 0.458 0.582 0.196 0.324 0.330
ROUGE-L 0.782 0.593 0.774 0.214 0.318 0.328
Similarity 0.739 0.761 0.840 0.725 0.842 0.837
Coverage 0.419 0.787 0.717 0.180 0.880 0.896
Perplexity* 1464 1440 1254 181 536 760

Table 5.3: Model performance on MULTI set

and coverage checks the ability of the model to meet the hard constraint. Finally, perplexity

compares the writing of the generation against the knowledge of the language model, and

a lower value is related to a more coherent writing.

In general, CopyNet tends to reuse a lot of word sequences directly from the input

content, and it results in much higher BLEU and ROUGE-L scores. We cannot rely on

perplexity scores to compare the quality of generated content between the two language

models, but it seems to us that CopyNet can output text content that is more fluent and

coherent. CopyNet uses word-level embeddings, whereas GPT-2 is built on sub-word level

representations, so GPT-2 is more likely to generate more morphological errors.

When CopyNet and CELM are handling data without specific constraints embedded,

their generations are comparable in terms of similarity to the original content. Because

96
CopyNet is designed specifically to handle sequences with hard constraints, it has a better

chance than CELM to correctly keep the designated tokens. On the other hand, the genera-

tions of CopyNet are much less diverse than those of CELM. The fact that CopyNet is more

likely to repeat large portions of sub-sequences from the source in the output improves the

coverage but also harms the diversity.

When constraint information is embedded into the data, CELM shows a much larger

improvement over CopyNet in terms of coverage. It shows that both models are not well-

designed to handle the hard constraint directly from the original content, so that they im-

prove from the embedding of the constraints in the content. Meanwhile, CELM keeps the

major advantage of generation diversity over CopyNet. We believe the pre-trained knowl-

edge of GPT-2 contributes to such an advantage of CELM. Considering the remarkable

coverage improvement, it still does not help CopyNet ease the tendency of repeating sub-

sequences from input to output text.

Comparing the two types of data where constraints are embedded reveals that training

on only the pairs (ONLY) where hard constraints are found offers the best performance

in terms of coverage score. Including additional pairs where no constraint information

is embedded (MIX) helps CopyNet improve the similarity score, but further reduces the

diversity in the generation. Equipped with pre-trained knowledge, CELM does not have

much difference in similarity and diversity scores when MIX data is used. The changes in

perplexity scores when using ORI, ONLY and MIX datasets are reversed between CopyNet

and CELM. For CELM, we think the embedded constraint tokens break the pre-trained

knowledge of language modeling, resulting in a worse perplexity. CopyNet, on the other

hand, is trained from scratch, so that the embedded constraints help in understanding the

paraphrase generation and in improving the overall writing.

97
Most observations from the test on the SINGLE dataset remain the same when test-

ing on the MULTI dataset. Both models get lower coverage scores because it is harder to

handle more constraints in the content. CopyNet has better diversity but worse similarity

measurement on MULTI test set, whereas CELM has the opposite. CopyNet suffers from

including more embedded constraints to generate the corresponding constraints, but it re-

sults in a more diverse generation. CELM has more power to learn from the additional

constraint information, but includes more repeated tokens from the input as well.

Model Content
Original No fork, no fire, no problem. smores day
Baseline (CopyNet) No fork, no fire , no no.
No fire and smores, no trouble. sm hours are coming up. it’s like
CELM (GPT-2)
the days of the day.
Original Looks like most Americans lack savings to cover emergencies
Baseline (CopyNet) Looks like most Americans are missing savings
People like most of the Americans lack the savings to pay for
CELM (GPT-2)
emergency situations.

Table 5.4: Sample paraphrase generations from the models

Table 5.4 lists two samples from the CommTweet dataset to demonstrate the character-

istics of the generation of both models. The first commercial tweet does not contain any

identified hard constraints. It is obvious that CopyNet repeats a lot of words from the origi-

nal text whereas CELM generates more diverse writing. The second example contains one

hard constraint and both models cover it. CopyNet keeps the main structure of the content

and replaces some tokens. CELM rewrites most of the text and still maintain the constraint

as well as the meaning.

Overall, the experiments show that embedding the hard constraint into the content

can dramatically increase the capability to keep the constraint in the output, regardless

98
of whether the language model is designed specifically to handle such a requirement. A

pre-trained large language model can generate paraphrase with greater diversity and com-

patible similarity. With more constraints identified in the original content, the model may

face the trade-off between diversity and similarity.

5.5 Summary

In this chapter, we introduce a framework to handle constrained paraphrase generation

of commercial tweets. The framework includes a process to identify hard constraints, em-

bed the constraints in the text content, generate output using a causal language model, and

realize the constraints to form the final paraphrase. We also show that the model trained

on a general domain dataset can work compatibly on a dataset of commercial tweets. The

experiments demonstrate that the constraint-embedded data can help generation models to

create better paraphrases in terms of semantic similarity and diversity, while meeting con-

straints. The improvement applies to both general language models and models specifically

designed for constrained generation.

99
Chapter 6: Conclusions and Future Work

6.1 Conclusions

Our work focuses on different aspects of utilizing Twitter for better advertising. It

involves analyzing user feedback, predicting the influence of commercial tweets, profiling

users based on their offline activities, and generating paraphrases for commercial tweets.

These contributions help reduce human effort and increase the efficiency in multiple steps

of social media advertising.

We propose an ensemble model to combine the predicted probabilities of multiple mod-

els to form a result that improves the performance in a mixed-classification task. The model

focuses on the difficulty caused by distinct requirements and definitions for class labels of

different domains, and it utilizes the fact that different models treat the labels differently to

build the ensemble model. The model includes a tweet vector with the probabilistic output

from several classifiers to further improve the performance of the ensemble model.

We define an influence score to measure the level of attention a commercial post draws

from its audiences. To predict whether a commercial tweet will have enough influence, we

create a set of style features and apply them to a classifier. The style features focus on the

creation of tweets and do not include the inherent meaning of the tweets. Therefore the

model is generalized to any commercial tweets. The ablation analysis on these meta and

100
linguistic-based features discovers the secret of crafting successful commercial tweets, and

it can be used to guide the modification of commercial tweets.

The recognition of offline activities provides a unique view for user profiling. We ex-

plore the existing LSTM-based structures that can include features in addition to the target

tweet content. We propose a hybrid-LSTM model to efficiently include contextual and his-

torical information with the target tweet to recognize user activities. A case study using the

model reveals the fact that the field of the company can be reflected by the major activities

of its followers.

We discover the effectiveness of embedding constraint information into text content and

generating paraphrases for commercial tweets using a causal language model such as GPT-

2. The hard constraints are critical in paraphrase generation so that the key information

can be preserved intact in the generated commercial tweets. We show that the knowledge

learned from a general domain can be transferred and applied to the domain of commercial

tweets. The proposed CELM framework can generate paraphrase tweets that are semanti-

cally similar to the original tweet and diverse in terms of the text to provide more choices.

Our work covers the use of statistical classification models such as SVM, neural net-

work classification models like LSTM, and sequence generation models such as GPT-2.

We explicitly explore the effectiveness of including contextual information to these mod-

els. The contextual information offers additional help by incorporating tweet content, part-

of-speech tags, post time, or historical data. We also demonstrate the use of embedding

constraints directly into the text content and generating paraphrases with these hard con-

straints. We discover ways to map outputs from one specific domain to a different one and

show that learned knowledge can be transferred to another compatible domain. Finally, our

101
contribution includes the collection of the datasets of commercial tweets and tweets with

user activity labels.

6.2 Future Work

The work of mix-classification models for tweets can be extended in several dimen-

sions. Additional variants of the ensemble classifier can be explored, with the goal of

better utilizing the probability distribution generated by the individual models for a more

effective combination of the models. For example, one can focus on improving the general-

ity of the new ensemble method in handling additional types of probability distributions as

the input or developing methods that learn the characteristics of each classification model

on a given dataset and use this knowledge in combining multiple models. Another direction

can be to incorporate more individual models as well as additional context features to the

ensemble method.

On the basis of the influence prediction system for commercial tweets, a suggestion

system can be built that helps companies generate better commercial tweets in terms of the

influence on their audiences. Similar to the example shown in the case study, the suggestion

system can propose potential modifications, using the prediction system to determine which

modifications lead to a more successful post. Other techniques can be explored to separate

the writing of the tweet from the commercial information it carries. Working on a given

informational basis is an essential task when advertising through social platforms.

Based on our work of the activity recognition model, we intend to identify more con-

textual features and explore their abilities using additional models. The current labeling

process relies on the reported location of each tweet. Thus, determining better ways to

improve location accuracy could potentially increase the quality of our work. During our

102
experiments, we found that some images attached to tweets may be useful in identifying

activities. While it is currently not common to include images as contextual features for

text data, we believe this could be a promising direction of research.

To extend the constrained paraphrase generation model, we plan to explore more types

of hard constraints and their impact when embedding into the content. One approach to

distinguish the constraint types is to cluster all the candidates for the hard constraint and

assign type tokens accordingly. We also want to discover ways to include dependency infor-

mation into the constraint itself, but maintain enough flexibility in utilizing the constraints.

Exploring better solutions to handle multiple constraints for each type in a sequence is

another direction of potential research.

Furthermore, our work relies on the assumption of author independence, which does

not take into consideration any author identity information. We have concerns about model

bias when author identity is included. However, a well-designed author representation can

be used to include characteristic-related information while keeping the model unbiased. In-

fluencers are the usernames mentioned in the posts to raise attention to the advertisements,

and more companies have realized the importance of using influencers in their commer-

cials. Explore the relation between the characteristics of the influencer and the success of

the commercial tweets will become a meaningful topic. With the emerging use of graph

models, our solutions can benefit from incorporating tweet-level or author-level relation-

ships. Some of the proposed features or models can be converted and applied to Graph

Neural Network (GNN) models [172], and the propagation of the feature information can

help improve the performance. In addition, the datasets we created for our work are based

on certain special requirements and focus on unique perspectives. Besides the standard

103
evaluation methods, some human evaluations can improve the reliability of the experiment

results as well as the quality of the datasets.

Our work provides limited contribution to the field of social media marketing and ad-

vertising. We try to overcome several problems in the field, but more meaningful questions

and challenges are still open to be solved. We hope that our work gives an explicit direction

for the research in this field and leads to further accomplishment in the future.

104
Appendix A: Implementation and Datasets

The implementation codes for all four steps of the social media loop are published on

my personal GitHub page. They are organized as four separate projects, and each one

contains the complete implementation for the experiment and analysis.

The two datasets that are created and used in the work are also publicly available.

CommTweet dataset contains the commercial tweets posted by the official accounts of 36

companies listed in Table A.1. Commercial tweets refer to the original tweets that are not

retweets or comments. They are the tweets companies use to post marketing or advertising

information.

Gap Amazon Gilt BlackBerry


Nordstrom Best Buy Jeep KraftFoods
AT&T Applebee’s Dell Comcast
Macy’s AppStore (Apple) JC Penney Delta
Starbucks Travel Channel FedEx Yahoo
SamsungMobile Microsoft Target Sears
Netflix GEICO WholeFoods Google
Disney LEVIS H&M Motorola
AmericanExpress McDonald’s American Airlines JetBlue

Table A.1: Companies that are included in the CommTweet dataset

ActivityTweet dataset includes ordinary tweets with activity labels listed in Table A.2.

These labels represent the offline activity that the author was engaged in when the tweet

105
was posted. Tweets in the dataset all have reported locations which are used to determine

the corresponding activity labels.

Enhancement Traveling Dining


Entertaining Shopping Sporting

Table A.2: Activity labels in ActivityTweet dataset

Finally, the links to the implementation codes and datasets are listed in the following

table.

Implementation Codes
https://github.com/renhaocui
CommTweet Dataset
https://1drv.ms/u/s!AhCHbLu6TCc8hMYirS6lFOVRUDSttw?e=WIenmB
ActivityTweet Dataset
https://1drv.ms/u/s!AhCHbLu6TCc8hMYju-IT9PDbt9LIKg?e=EnLhmY

Table A.3: Links to the related resources

106
Bibliography

[1] Fabian Abel, Qi Gao, Geert-Jan Houben, and Ke Tao. Analyzing user modeling
on twitter for personalized news recommendations. In International Conference on
User Modeling, Adaptation, and Personalization, pages 1–12. Springer, 2011.

[2] Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis, and Gilad
Mishne. Finding high-quality content in social media. In Proceedings of the 2008
International Conference on Web Search and Data Mining, pages 183–194. ACM,
2008.

[3] Isabel Anger and Christian Kittl. Measuring influence on twitter. In Proceedings
of the 11th International Conference on Knowledge Management and Knowledge
Technologies, page 31. ACM, 2011.

[4] Yoav Artzi, Patrick Pantel, and Michael Gamon. Predicting responses to microblog
posts. NAACL HLT 2012, page 602, 2012.

[5] Mohamed Faouzi Atig, Sofia Cassel, Lisa Kaati, and Amendra Shrestha. Activity
profiles in online social media. In Advances in Social Networks Analysis and Mining
(ASONAM), 2014 IEEE/ACM International Conference on, pages 850–855. IEEE,
2014.

[6] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine trans-
lation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473,
2014.

[7] Eytan Bakshy, Jake M Hofman, Winter A Mason, and Duncan J Watts. Everyone’s
an influencer: quantifying influence on twitter. In Proceedings of the fourth ACM
international conference on Web search and data mining, pages 65–74. ACM, 2011.

[8] Fabrı́cio Benevenuto, Tiago Rodrigues, Meeyoung Cha, and Virgı́lio Almeida. Char-
acterizing user behavior in online social networks. In Proceedings of the 9th ACM
SIGCOMM conference on Internet measurement conference, pages 49–62. ACM,
2009.

107
[9] Adam Bermingham and Alan Smeaton. On using twitter to monitor political sen-
timent and predict election results. In Proceedings of the Workshop on Sentiment
Analysis where AI meets Psychology (SAAIP 2011), pages 2–10, 2011.
[10] Parantapa Bhattacharya, Muhammad Bilal Zafar, Niloy Ganguly, Saptarshi Ghosh,
and Krishna P Gummadi. Inferring user interests in the twitter social network. In
Proceedings of the 8th ACM Conference on Recommender systems, pages 357–360.
ACM, 2014.
[11] Monica Billio, Roberto Casarin, Francesco Ravazzolo, and Herman K Van Dijk.
Bayesian combinations of stock price predictions with an application to the amster-
dam exchange index. 2011.
[12] István Bı́ró, Dávid Siklósi, Jácint Szabó, and András A Benczúr. Linked latent
dirichlet allocation in web spam filtering. In Proceedings of the 5th International
Workshop on Adversarial Information Retrieval on the Web, pages 37–40. ACM,
2009.
[13] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the
Journal of machine Learning research, 3:993–1022, 2003.
[14] Ondřej Bojar, Ondřej Dušek, Tom Kocmi, Jindřich Libovickỳ, Michal Novák, Mar-
tin Popel, Roman Sudarikov, and Dušan Variš. Czeng 1.6: enlarged czech-english
parallel corpus with processing tools dockered. In International Conference on Text,
Speech, and Dialogue, pages 231–238. Springer, 2016.
[15] Johan Bollen, Huina Mao, and Xiaojun Zeng. Twitter mood predicts the stock mar-
ket. Journal of Computational Science, 2(1):1–8, 2011.
[16] Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz,
and Samy Bengio. Generating sentences from a continuous space. arXiv preprint
arXiv:1511.06349, 2015.
[17] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller,
Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grob-
ler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux.
API design for machine learning software: experiences from the scikit-learn project.
In ECML PKDD Workshop: Languages for Data Mining and Machine Learning,
pages 108–122, 2013.
[18] Ziqiang Cao, Chuwei Luo, Wenjie Li, and Sujian Li. Joint copying and restricted
generation for paraphrase. arXiv preprint arXiv:1611.09235, 2016.
[19] Meeyoung Cha, Hamed Haddadi, Fabricio Benevenuto, P Krishna Gummadi, et al.
Measuring user influence in twitter: The million follower fallacy. Icwsm, 10(10-
17):30, 2010.

108
[20] P.W.D. Charles. Project title. https://github.com/charlespwd/
project-title, 2013.

[21] Mingda Chen, Qingming Tang, Sam Wiseman, and Kevin Gimpel. Controllable
paraphrase generation with a syntactic exemplar. arXiv preprint arXiv:1906.00565,
2019.

[22] Justin Cheng, Lada Adamic, P Alex Dow, Jon Michael Kleinberg, and Jure
Leskovec. Can cascades be predicted? In Proceedings of the 23rd international
conference on World wide web, pages 925–936. ACM, 2014.

[23] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi
Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representa-
tions using rnn encoder-decoder for statistical machine translation. arXiv preprint
arXiv:1406.1078, 2014.

[24] Meri Coleman and Ta Lin Liau. A computer readability formula designed for ma-
chine scoring. Journal of Applied Psychology, 60(2):283, 1975.

[25] Michael D Conover, Bruno Gonçalves, Jacob Ratkiewicz, Alessandro Flammini, and
Filippo Menczer. Predicting the political alignment of twitter users. In 2011 IEEE
third international conference on privacy, security, risk and trust and 2011 IEEE
third international conference on social computing, pages 192–199. IEEE, 2011.

[26] Michael D Conover, Jacob Ratkiewicz, Matthew Francisco, Bruno Gonçalves, Fil-
ippo Menczer, and Alessandro Flammini. Political polarization on twitter. In Fifth
international AAAI conference on weblogs and social media, 2011.

[27] Nadia FF Da Silva, Eduardo R Hruschka, and Estevam R Hruschka Jr. Tweet sen-
timent analysis with classifier ensembles. Decision Support Systems, 66:170–179,
2014.

[28] Marco Damonte and Shay B Cohen. Structural neural encoders for amr-to-text gen-
eration. arXiv preprint arXiv:1903.11410, 2019.

[29] Bhuwan Dhingra, Zhong Zhou, Dylan Fitzpatrick, Michael Muehl, and William W
Cohen. Tweet2vec: Character-based distributed representations for social media.
arXiv preprint arXiv:1605.03481, 2016.

[30] Thomas Dickinson, Miriam Fernández, Lisa A Thomas, Paul Mulholland, Pam
Briggs, and Harith Alani. Identifying important life events from twitter using se-
mantic and syntactic patterns. 2016.

[31] Thomas G Dietterich. Ensemble methods in machine learning. In Multiple classifier


systems, pages 1–15. Springer, 2000.

109
[32] William B Dolan and Chris Brockett. Automatically constructing a corpus of sen-
tential paraphrases. In Proceedings of the Third International Workshop on Para-
phrasing (IWP2005), 2005.

[33] Li Dong, Jonathan Mallinson, Siva Reddy, and Mirella Lapata. Learning to para-
phrase for question answering. arXiv preprint arXiv:1708.06022, 2017.

[34] Nan Duan, Duyu Tang, Peng Chen, and Ming Zhou. Question generation for ques-
tion answering. In Proceedings of the 2017 Conference on Empirical Methods in
Natural Language Processing, pages 866–874, 2017.

[35] Yogesh K Dwivedi, Kawaljeet Kaur Kapoor, and Hsin Chen. Social media marketing
and advertising. The Marketing Review, 15(3):289–309, 2015.

[36] Paul S Earle, Daniel C Bowden, and Michelle Guy. Twitter earthquake detection:
earthquake monitoring in a social world. Annals of Geophysics, 54(6), 2012.

[37] Elozino Egonmwan and Yllias Chali. Transformer and seq2seq model for para-
phrase generation. In Proceedings of the 3rd Workshop on Neural Generation and
Translation, pages 249–255, 2019.

[38] Michael Ellsworth and Adam Janin. Mutaphrase: Paraphrasing with framenet. In
Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphras-
ing, pages 143–150, 2007.

[39] Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation.
arXiv preprint arXiv:1805.04833, 2018.

[40] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line


learning and an application to boosting. Journal of computer and system sciences,
55(1):119–139, 1997.

[41] Shuai Gao, Jun Ma, and Zhumin Chen. Modeling and predicting retweeting dynam-
ics on microblogging platforms. In Proceedings of the Eighth ACM International
Conference on Web Search and Data Mining, pages 107–116. ACM, 2015.

[42] Shalini Ghosh, Oriol Vinyals, Brian Strope, Scott Roy, Tom Dean, and Larry
Heck. Contextual lstm (clstm) models for large scale nlp tasks. arXiv preprint
arXiv:1602.06291, 2016.

[43] Kevin Gimpel, Nathan Schneider, Brendan O’Connor, Dipanjan Das, Daniel Mills,
Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A
Smith. Part-of-speech tagging for twitter: Annotation, features, and experiments.
In Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies: short papers-Volume 2, pages 42–47.
Association for Computational Linguistics, 2011.

110
[44] Tilmann Gneiting and Adrian E Raftery. Weather forecasting with ensemble meth-
ods. Science, 310(5746):248–249, 2005.

[45] Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using
distant supervision. CS224N Project Report, Stanford, 1:12, 2009.

[46] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,
Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In
Advances in neural information processing systems, pages 2672–2680, 2014.

[47] Klaus Greff, Rupesh K Srivastava, Jan Koutnı́k, Bas R Steunebrink, and Jürgen
Schmidhuber. Lstm: A search space odyssey. IEEE transactions on neural networks
and learning systems, 2017.

[48] Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. Incorporating copying mech-
anism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393, 2016.

[49] Ankush Gupta, Arvind Agarwal, Prawaan Singh, and Piyush Rai. A deep generative
framework for paraphrase generation. arXiv preprint arXiv:1709.05074, 2017.

[50] Chaitra Hegde and Shrikumar Patil. Unsupervised paraphrase generation using pre-
trained language models. arXiv preprint arXiv:2006.05477, 2020.

[51] Geoffrey E Hinton. Products of experts. In Artificial Neural Networks, 1999. ICANN
99. Ninth International Conference on (Conf. Publ. No. 470), volume 1, pages 1–6.
IET, 1999.

[52] Geoffrey E Hinton. Training products of experts by minimizing contrastive diver-


gence. Neural computation, 14(8):1771–1800, 2002.

[53] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural com-
putation, 9(8):1735–1780, 1997.

[54] Jennifer A Hoeting, David Madigan, Adrian E Raftery, and Chris T Volinsky.
Bayesian model averaging: a tutorial. Statistical science, pages 382–401, 1999.

[55] Chris Hokamp and Qun Liu. Lexically constrained decoding for sequence generation
using grid beam search. arXiv preprint arXiv:1704.07138, 2017.

[56] Matthew Honnibal and Ines Montani. spacy 2: Natural language understanding
with bloom embeddings, convolutional neural networks and incremental parsing. To
appear, 7(1), 2017.

[57] J Edward Hu, Huda Khayrallah, Ryan Culkin, Patrick Xia, Tongfei Chen, Matt Post,
and Benjamin Van Durme. Improved lexically constrained decoding for translation
and monolingual rewriting. In Proceedings of the 2019 Conference of the North

111
American Chapter of the Association for Computational Linguistics: Human Lan-
guage Technologies, Volume 1 (Long and Short Papers), pages 839–850, 2019.

[58] Zhiting Hu, Zichao Yang, Xiaodan Liang, Ruslan Salakhutdinov, and Eric P Xing.
Toward controlled generation of text. arXiv preprint arXiv:1703.00955, 2017.

[59] Minlie Huang, Yujie Cao, and Chao Dong. Modeling rich contexts for sentiment
classification with lstm. arXiv preprint arXiv:1605.01478, 2016.

[60] William John Hutchins and Harold L Somers. An introduction to machine transla-
tion, volume 362. Academic Press London, 1992.

[61] Kazushi Ikeda, Gen Hattori, Chihiro Ono, Hideki Asoh, and Teruo Higashino.
Twitter user profiling based on text and community mining for market analysis.
Knowledge-Based Systems, 51:35–47, 2013.

[62] Thorsten Joachims. Text categorization with support vector machines: Learning
with many relevant features. Springer, 1998.

[63] A Jordan. On discriminative vs. generative classifiers: A comparison of logistic


regression and naive bayes. Advances in neural information processing systems,
14:841, 2002.

[64] Pavan Kapanipathi, Prateek Jain, Chitra Venkataramani, and Amit Sheth. User in-
terests identification on twitter using a hierarchical knowledge base. In European
Semantic Web Conference, pages 99–113. Springer, 2014.

[65] Robert E Kass and Adrian E Raftery. Bayes factors. Journal of the american statis-
tical association, 90(430):773–795, 1995.

[66] Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard
Socher. Ctrl: A conditional transformer language model for controllable generation.
arXiv preprint arXiv:1909.05858, 2019.

[67] Dan Klein, Kristina Toutanova, H Tolga Ilhan, Sepandar D Kamvar, and Christo-
pher D Manning. Combining heterogeneous classifiers for word-sense disambigua-
tion. In Proceedings of the ACL-02 workshop on Word sense disambiguation: recent
successes and future directions-Volume 8, pages 74–80. Association for Computa-
tional Linguistics, 2002.

[68] Johannes Knoll. Advertising in social media: a review of empirical evidence. Inter-
national journal of Advertising, 35(2):266–300, 2016.

[69] Daphne Koller and Mehran Sahami. Hierarchically classifying documents using
very few words. Technical report, Stanford InfoLab, 1997.

112
[70] J Zico Kolter and Marcus A Maloof. Dynamic weighted majority: An ensemble
method for drifting concepts. The Journal of Machine Learning Research, 8:2755–
2790, 2007.
[71] Jeremy Z Kolter, Marcus Maloof, et al. Dynamic weighted majority: A new ensem-
ble method for tracking concept drift. In Data Mining, 2003. ICDM 2003. Third
IEEE International Conference on, pages 123–130. IEEE, 2003.
[72] Lingpeng Kong, Nathan Schneider, Swabha Swayamdipta, Archna Bhatia, Chris
Dyer, and Noah A Smith. A dependency parser for tweets. In Proceedings of the
Conference on Empirical Methods in Natural Language Processing, Doha, Qatar,
to appear, volume 4, 2014.
[73] Efthymios Kouloumpis, Theresa Wilson, and Johanna D Moore. Twitter sentiment
analysis: The good the bad and the omg! Icwsm, 11:538–541, 2011.
[74] Raymond Kozlowski, Kathleen F McCoy, and K Vijay-Shanker. Generation
of single-sentence paraphrases from predicate/argument structure using lexico-
grammatical resources. In Proceedings of the second international workshop on
Paraphrasing, pages 1–8, 2003.
[75] Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. A continuously growing dataset of
sentential paraphrases. arXiv preprint arXiv:1708.00391, 2017.
[76] Leah S Larkey and W Bruce Croft. Combining classifiers in text categorization. In
Proceedings of the 19th annual international ACM SIGIR conference on Research
and development in information retrieval, pages 289–297. ACM, 1996.
[77] Quoc Le and Tomas Mikolov. Distributed representations of sentences and docu-
ments. In Proceedings of the 31st International Conference on Machine Learning
(ICML-14), pages 1188–1196, 2014.
[78] Kathy Lee, Diana Palsetia, Ramanathan Narayanan, Md Mostofa Ali Patwary, Ankit
Agrawal, and Alok Choudhary. Twitter trending topic classification. In Data Mining
Workshops (ICDMW), 2011 IEEE 11th International Conference on, pages 251–258.
IEEE, 2011.
[79] Kyumin Lee, Jalal Mahmud, Jilin Chen, Michelle Zhou, and Jeffrey Nichols. Who
will retweet this?: Automatically identifying and engaging strangers on twitter to
spread information. In Proceedings of the 19th international conference on Intelli-
gent User Interfaces, pages 247–256. ACM, 2014.
[80] Ryong Lee and Kazutoshi Sumiya. Measuring geographical regularities of crowd
behaviors for twitter-based geo-social event detection. In Proceedings of the 2nd
ACM SIGSPATIAL international workshop on location based social networks, pages
1–10. ACM, 2010.

113
[81] Won-Jo Lee, Kyo-Joong Oh, Chae-Gyun Lim, and Ho-Jin Choi. User profile extrac-
tion from twitter for personalized news recommendation. In Advanced Communica-
tion Technology (ICACT), 2014 16th International Conference on, pages 779–783.
IEEE, 2014.

[82] David D Lewis and William A Gale. A sequential algorithm for training text clas-
sifiers. In Proceedings of the 17th annual international ACM SIGIR conference
on Research and development in information retrieval, pages 3–12. Springer-Verlag
New York, Inc., 1994.

[83] Jia Li, Hua Xu, Xingwei He, Junhui Deng, and Xiaomin Sun. Tweet modeling with
lstm recurrent neural networks for hashtag recommendation. In Neural Networks
(IJCNN), 2016 International Joint Conference on, pages 1570–1577. IEEE, 2016.

[84] Rui Li, Kin Hou Lei, Ravi Khadiwala, and Kevin Chen-Chuan Chang. Tedas: A
twitter-based event detection and analysis system. In 2012 IEEE 28th International
Conference on Data Engineering, pages 1273–1276. IEEE, 2012.

[85] Yung-Ming Li, Ya-Lin Shiu, et al. A diffusion mechanism for social advertising over
microblogs. DECISION SUPPORT SYSTEMS, 54(1):9–22, 2012.

[86] Zichao Li, Xin Jiang, Lifeng Shang, and Hang Li. Paraphrase generation with deep
reinforcement learning. arXiv preprint arXiv:1711.00279, 2017.

[87] Defu Lian and Xing Xie. Collaborative activity recognition via check-in history.
In Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Location-
Based Social Networks, pages 45–48. ACM, 2011.

[88] Dongliang Liao, Weiqing Liu, Yuan Zhong, Jing Li, and Guowei Wang. Predicting
activity and location with multi-task context aware recurrent neural network. In
IJCAI, pages 3435–3441, 2018.

[89] Andy Liaw and Matthew Wiener. Classification and regression by randomforest. R
news, 2(3):18–22, 2002.

[90] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text
summarization branches out, pages 74–81, 2004.

[91] Yang Liu, Chengjie Sun, Lei Lin, and Xiaolong Wang. Learning natural lan-
guage inference using bidirectional lstm model and inner-attention. arXiv preprint
arXiv:1605.09090, 2016.

[92] Elena Lloret and Manuel Palomar. Towards automatic tweet generation: A compar-
ative study from the text summarization perspective in the journalism genre. Expert
Systems with Applications, 40(16):6624–6630, 2013.

114
[93] Chunliang Lu, Wai Lam, and Yingxiao Zhang. Twitter user modeling and tweets
recommendation based on wikipedia concept graph. In Workshops at the Twenty-
Sixth AAAI Conference on Artificial Intelligence, 2012.

[94] Shuming Ma, Xu Sun, Wei Li, Sujian Li, Wenjie Li, and Xuancheng Ren. Query
and output: Generating words by querying distributed word representations for para-
phrase generation. arXiv preprint arXiv:1803.01465, 2018.

[95] Jonathan Mallinson, Rico Sennrich, and Mirella Lapata. Paraphrasing revisited with
neural machine translation. In Proceedings of the 15th Conference of the European
Chapter of the Association for Computational Linguistics: Volume 1, Long Papers,
pages 881–893, 2017.

[96] R Dean Malmgren, Jake M Hofman, Luis AN Amaral, and Duncan J Watts. Char-
acterizing individual communication patterns. In Proceedings of the 15th ACM
SIGKDD international conference on Knowledge discovery and data mining, pages
607–616, 2009.

[97] Christopher D Manning, Hinrich Schütze, et al. Foundations of statistical natural


language processing, volume 999. MIT Press, 1999.

[98] Alice Marwick et al. To see and be seen: Celebrity practice on twitter. Convergence:
the international journal of research into new media technologies, 17(2):139–158,
2011.

[99] Mani Maybury. Advances in automatic text summarization. MIT press, 1999.

[100] Jon D Mcauliffe and David M Blei. Supervised topic models. In Advances in neural
information processing systems, pages 121–128, 2008.

[101] Andrew McCallum, Kamal Nigam, et al. A comparison of event models for naive
bayes text classification. In AAAI-98 workshop on learning for text categorization,
volume 752, pages 41–48. Citeseer, 1998.

[102] Michael Mccord and M Chuah. Spam detection on twitter using traditional classi-
fiers. In Autonomic and trusted computing, pages 175–186. Springer, 2011.

[103] Kathleen McKeown. Paraphrasing questions using given and new information.
American Journal of Computational Linguistics, 9(1):1–10, 1983.

[104] Rishabh Mehrotra, Scott Sanner, Wray Buntine, and Lexing Xie. Improving lda
topic models for microblogs via tweet pooling and automatic labeling. In Proceed-
ings of the 36th international ACM SIGIR conference on Research and development
in information retrieval, pages 889–892. ACM, 2013.

115
[105] Ning Miao, Hao Zhou, Lili Mou, Rui Yan, and Lei Li. Cgmh: Constrained sentence
generation by metropolis-hastings sampling. In Proceedings of the AAAI Conference
on Artificial Intelligence, volume 33, pages 6834–6842, 2019.

[106] Matthew Michelson and Sofus A Macskassy. Discovering users’ topics of interest
on twitter: a first look. In Proceedings of the fourth workshop on Analytics for noisy
unstructured text data, pages 73–80. ACM, 2010.

[107] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of
word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.

[108] Tom M Mitchell. Machine learning. 1997. Burr Ridge, IL: McGraw Hill, 45:995,
1997.

[109] Jacob M Montgomery, Florian M Hollenbach, and Michael D Ward. Improving pre-
dictions using ensemble bayesian model averaging. Political Analysis, 20(3):271–
291, 2012.

[110] Mor Naaman. Social multimedia: highlighting opportunities for search and mining
of multimedia data in social media applications. Multimedia Tools and Applications,
56(1):9–34, 2012.

[111] Preslav Nakov, Alan Ritter, Sara Rosenthal, Fabrizio Sebastiani, and Veselin Stoy-
anov. Semeval-2016 task 4: Sentiment analysis in twitter. In Proceedings of the 10th
international workshop on semantic evaluation (semeval-2016), pages 1–18, 2016.

[112] Nasir Naveed, Thomas Gottron, Jérôme Kunegis, and Arifah Che Alhadi. Bad news
travel fast: A content-based analysis of interestingness on twitter. In Proceedings of
the 3rd International Web Science Conference, page 8. ACM, 2011.

[113] Finn Årup Nielsen. A new anew: Evaluation of a word list for sentiment analysis in
microblogs. arXiv preprint arXiv:1103.2903, 2011.

[114] Kamal Nigam, John Lafferty, and Andrew McCallum. Using maximum entropy
for text classification. In IJCAI-99 workshop on machine learning for information
filtering, volume 1, pages 61–67, 1999.

[115] Anastasios Noulas, Salvatore Scellato, Cecilia Mascolo, and Massimiliano Pontil.
An empirical study of geographic user activity patterns in foursquare. ICwSM,
11:70–573, 2011.

[116] Olutobi Owoputi, Brendan O’Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider,
and Noah A Smith. Improved part-of-speech tagging for online conversational text
with word clusters. In Proceedings of the 2013 conference of the North American
chapter of the association for computational linguistics: human language technolo-
gies, pages 380–390, 2013.

116
[117] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up?: sentiment clas-
sification using machine learning techniques. In Proceedings of the ACL-02 confer-
ence on Empirical methods in natural language processing-Volume 10, pages 79–86.
Association for Computational Linguistics, 2002.

[118] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method
for automatic evaluation of machine translation. In Proceedings of the 40th annual
meeting of the Association for Computational Linguistics, pages 311–318, 2002.

[119] Ravi Parikh and Matin Movassate. Sentiment analysis of user-generated twitter up-
dates using various classification techniques. CS224N Final Report, pages 1–18,
2009.

[120] Md Rizwan Parvez, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. Build-
ing language models for text with named entities. arXiv preprint arXiv:1805.04836,
2018.

[121] Michael J Paul and Mark Dredze. You are what you tweet: Analyzing twitter for
public health. Icwsm, 20:265–272, 2011.

[122] Huan-Kai Peng, Jiang Zhu, Dongzhen Piao, Rong Yan, and Ying Zhang. Retweet
modeling using conditional random fields. In 2011 IEEE 11th International Confer-
ence on Data Mining Workshops, pages 336–343. IEEE, 2011.

[123] Marco Pennacchiotti and Ana-Maria Popescu. A machine learning approach to twit-
ter user classification. In Fifth International AAAI Conference on Weblogs and Social
Media, 2011.

[124] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vec-
tors for word representation. In Proceedings of the 2014 conference on empirical
methods in natural language processing (EMNLP), pages 1532–1543, 2014.

[125] Sasa Petrovic, Miles Osborne, and Victor Lavrenko. Rt to win! predicting message
propagation in twitter. In ICWSM, 2011.

[126] Ana-Maria Popescu, Marco Pennacchiotti, and Deepa Paranjpe. Extracting events
and event descriptions from twitter. In WWW (Companion Volume), pages 105–106,
2011.

[127] Matt Post and David Vilar. Fast lexically constrained decoding with dynamic beam
allocation for neural machine translation. arXiv preprint arXiv:1804.06609, 2018.

[128] Aaditya Prakash, Sadid A Hasan, Kathy Lee, Vivek Datla, Ashequl Qadir, Joey
Liu, and Oladimeji Farri. Neural paraphrase generation with stacked residual lstm
networks. arXiv preprint arXiv:1610.03098, 2016.

117
[129] Daniele Quercia, Harry Askham, and Jon Crowcroft. Tweetlda: supervised topic
classification and link prediction in twitter. In Proceedings of the 4th Annual ACM
Web Science Conference, pages 247–250. ACM, 2012.

[130] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
Sutskever. Language models are unsupervised multitask learners. OpenAI blog,
1(8):9, 2019.

[131] Adrian E Raftery, Tilmann Gneiting, Fadoua Balabdaoui, and Michael Polakowski.
Using bayesian model averaging to calibrate forecast ensembles. Monthly Weather
Review, 133(5):1155–1174, 2005.

[132] Daniel Ramage, Susan T Dumais, and Daniel J Liebling. Characterizing microblogs
with topic models. ICWSM, 5(4):130–137, 2010.

[133] Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D Manning. La-
beled lda: A supervised topic model for credit attribution in multi-labeled corpora.
In Proceedings of the 2009 Conference on Empirical Methods in Natural Language
Processing: Volume 1-Volume 1, pages 248–256. Association for Computational
Linguistics, 2009.

[134] Adithya Rao, Nemanja Spasojevic, Zhisheng Li, and Trevor Dsouza. Klout score:
Measuring influence across multiple social networks. In 2015 IEEE International
Conference on Big Data (Big Data), pages 2282–2289. IEEE, 2015.

[135] Delip Rao, David Yarowsky, Abhishek Shreevats, and Manaswi Gupta. Classifying
latent user attributes in twitter. In Proceedings of the 2nd international workshop on
Search and mining user-generated contents, pages 37–44. ACM, 2010.

[136] Radim Řehůřek and Petr Sojka. Software Framework for Topic Modelling with
Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges
for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. http:
//is.muni.cz/publication/884893/en.

[137] Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using
siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.

[138] Alan Ritter, Oren Etzioni, Sam Clark, et al. Open domain event extraction from twit-
ter. In Proceedings of the 18th ACM SIGKDD international conference on Knowl-
edge discovery and data mining, pages 1104–1112. ACM, 2012.

[139] Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. The
author-topic model for authors and documents. In Proceedings of the 20th con-
ference on Uncertainty in artificial intelligence, pages 487–494. AUAI Press, 2004.

118
[140] Hassan Saif, Yulan He, and Harith Alani. Semantic sentiment analysis of twitter. In
International semantic web conference, pages 508–524. Springer, 2012.

[141] Takeshi Sakaki, Makoto Okazaki, and Yutaka Matsuo. Earthquake shakes twitter
users: real-time event detection by social sensors. In Proceedings of the 19th inter-
national conference on World wide web, pages 851–860. ACM, 2010.

[142] Mike Schuster and Kuldip K Paliwal. Bidirectional recurrent neural networks. IEEE
Transactions on Signal Processing, 45(11):2673–2681, 1997.

[143] Matthew Shardlow. A survey of automated text simplification. International Journal


of Advanced Computer Science and Applications, 4(1):58–70, 2014.

[144] Priya Sidhaye and Jackie Chi Kit Cheung. Indicative tweet generation: An extractive
summarization problem? In Proceedings of the 2015 Conference on Empirical
Methods in Natural Language Processing, pages 138–147, 2015.

[145] Linfeng Song, Ante Wang, Jinsong Su, Yue Zhang, Kun Xu, Yubin Ge, and Dong
Yu. Structural information preserving for graph-to-text generation. In Proceedings
of the 58th Annual Meeting of the Association for Computational Linguistics, pages
7987–7998, 2020.

[146] Yangqiu Song, Zhengdong Lu, Cane Wing-ki Leung, and Qiang Yang. Collaborative
boosting for activity classification in microblogs. In Proceedings of the 19th ACM
SIGKDD international conference on Knowledge discovery and data mining, pages
482–490. ACM, 2013.

[147] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.
The journal of machine learning research, 15(1):1929–1958, 2014.

[148] Bongwon Suh, Lichan Hong, Peter Pirolli, and Ed H Chi. Want to be retweeted?
large scale analytics on factors impacting retweet in twitter network. In Social com-
puting (socialcom), 2010 ieee second international conference on, pages 177–184.
IEEE, 2010.

[149] Duyu Tang, Furu Wei, Nan Yang, Ming Zhou, Ting Liu, and Bing Qin. Learning
sentiment-specific word embedding for twitter sentiment classification. In ACL (1),
pages 1555–1565, 2014.

[150] Grigorios Tsoumakas, Lefteris Angelis, and Ioannis Vlahavas. Selective fusion of
heterogeneous classifiers. Intelligent Data Analysis, 9(6):511–525, 2005.

[151] Andranik Tumasjan, Timm O Sprenger, Philipp G Sandner, and Isabell M Welpe.
Predicting elections with twitter: What 140 characters reveal about political senti-
ment. In Fourth international AAAI conference on weblogs and social media, 2010.

119
[152] Tracy L Tuten. Advertising 2.0: social media marketing in a web 2.0 world: social
media marketing in a web 2.0 world. ABC-CLIO, 2008.

[153] Tracy L Tuten. Social media marketing. SAGE Publications Limited, 2020.

[154] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.
In Advances in Neural Information Processing Systems, pages 5998–6008, 2017.

[155] Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. Pointer networks. In Advances
in neural information processing systems, pages 2692–2700, 2015.

[156] Soroush Vosoughi, Prashanth Vijayaraghavan, and Deb Roy. Tweet2vec: Learning
tweet embeddings using character-level cnn-lstm encoder-decoder. In Proceedings
of the 39th International ACM SIGIR conference on Research and Development in
Information Retrieval, pages 1041–1044, 2016.

[157] Sida Wang and Christopher D Manning. Baselines and bigrams: Simple, good sen-
timent and topic classification. In Proceedings of the 50th Annual Meeting of the
Association for Computational Linguistics: Short Papers-Volume 2, pages 90–94.
Association for Computational Linguistics, 2012.

[158] Su Wang, Rahul Gupta, Nancy Chang, and Jason Baldridge. A task in a suit and a
tie: paraphrase generation with semantic augmentation. In Proceedings of the AAAI
Conference on Artificial Intelligence, volume 33, pages 7176–7183, 2019.

[159] Tianlu Wang, Xuezhi Wang, Yao Qin, Ben Packer, Kang Li, Jilin Chen, Alex Beutel,
and Ed Chi. Cat-gen: Improving robustness in nlp models via controlled adversarial
text generation. arXiv preprint arXiv:2010.02338, 2020.

[160] Xiaofeng Wang, Matthew S Gerber, and Donald E Brown. Automatic crime predic-
tion using events extracted from twitter posts. In International conference on social
computing, behavioral-cultural modeling, and prediction, pages 231–238. Springer,
2012.

[161] Xin Wang, Yuanchao Liu, SUN Chengjie, Baoxun Wang, and Xiaolong Wang. Pre-
dicting polarities of tweets by composing word embeddings with long short-term
memory. In Proceedings of the 53rd Annual Meeting of the Association for Compu-
tational Linguistics and the 7th International Joint Conference on Natural Language
Processing (Volume 1: Long Papers), volume 1, pages 1343–1353, 2015.

[162] Yequan Wang, Minlie Huang, Li Zhao, et al. Attention-based lstm for aspect-level
sentiment classification. In Proceedings of the 2016 conference on empirical meth-
ods in natural language processing, pages 606–615, 2016.

120
[163] Yu Wang, Eugene Agichtein, and Michele Benzi. Tm-lda: efficient online modeling
of latent topic transitions in social media. In Proceedings of the 18th ACM SIGKDD
international conference on Knowledge discovery and data mining, pages 123–131.
ACM, 2012.
[164] Yuan Wang and Yiyi Yang. Dialogic communication on social media: How organi-
zations use twitter to build dialogic relationships with their publics. Computers in
Human Behavior, 104:106183, 2020.
[165] Wouter Weerkamp, Maarten De Rijke, et al. Activity prediction: A twitter-based
exploration. In SIGIR Workshop on Time-aware Information Access, 2012.
[166] Jianshu Weng and Bu-Sung Lee. Event detection in twitter. In Fifth international
AAAI conference on weblogs and social media, 2011.
[167] John Wieting and Kevin Gimpel. Paranmt-50m: Pushing the limits of paraphras-
tic sentence embeddings with millions of machine translations. arXiv preprint
arXiv:1711.05732, 2017.
[168] Alistair Willis, Ali Fisher, and Ilia Lvov. Mapping networks of influence: tracking
twitter conversations through time and space. Participations: Journal of Audience
& Reception Studies, 12(1):494–530, 2015.
[169] Sam Witteveen and Martin Andrews. Paraphrasing with large language models.
arXiv preprint arXiv:1911.09661, 2019.
[170] David H Wolpert. Stacked generalization. Neural networks, 5(2):241–259, 1992.
[171] Jonathan H Wright. Bayesian model averaging and exchange rate forecasts. Journal
of Econometrics, 146(2):329–341, 2008.
[172] Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu
Philip. A comprehensive survey on graph neural networks. IEEE Transactions on
Neural Networks and Learning Systems, 2020.
[173] Heng Xu, Lih-Bin Oh, and Hock-Hai Teo. Perceived effectiveness of text vs. multi-
media location-based advertising messaging. International Journal of Mobile Com-
munications, 7(2):154–177, 2009.
[174] Wei Xu, Alan Ritter, Chris Callison-Burch, William B Dolan, and Yangfeng Ji. Ex-
tracting lexically divergent paraphrases from twitter. Transactions of the Association
for Computational Linguistics, 2:435–448, 2014.
[175] Zhiheng Xu, Long Ru, Liang Xiang, and Qing Yang. Discovering user inter-
est on twitter with a modified author-topic model. In Proceedings of the 2011
IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent
Agent Technology-Volume 01, pages 422–429. IEEE Computer Society, 2011.

121
[176] Zhiheng Xu and Qing Yang. Analyzing user retweet behavior on twitter. In Proceed-
ings of the 2012 International Conference on Advances in Social Networks Analysis
and Mining (ASONAM 2012), pages 46–50. IEEE Computer Society, 2012.
[177] Dingqi Yang, Daqing Zhang, Vincent W Zheng, and Zhiyong Yu. Modeling user
activity preference by leveraging user spatial temporal characteristics in lbsns. IEEE
Transactions on Systems, Man, and Cybernetics: Systems, 45(1):129–142, 2015.
[178] Shuang-Hong Yang, Alek Kolcz, Andy Schlaikjer, and Pankaj Gupta. Large-scale
high-precision topic modeling on twitter. In Proceedings of the 20th ACM SIGKDD
international conference on Knowledge discovery and data mining, pages 1907–
1916, 2014.
[179] Zi Yang, Jingyi Guo, Keke Cai, Jie Tang, Juanzi Li, Li Zhang, and Zhong Su. Un-
derstanding retweeting behaviors in social networks. In Proceedings of the 19th
ACM international conference on Information and knowledge management, pages
1633–1636. ACM, 2010.
[180] Jihang Ye, Zhe Zhu, and Hong Cheng. What’s your next move: User activity pre-
diction in location-based social networks. In Proceedings of the 2013 SIAM Inter-
national Conference on Data Mining, pages 171–179. SIAM, 2013.
[181] Shaozhi Ye and S Felix Wu. Measuring message propagation and social influence
on Twitter. com. Springer, 2010.
[182] An-Zi Yen, Hen-Hsen Huang, and Hsin-Hsi Chen. Detecting personal life events
from twitter by multi-task lstm. In Companion of the The Web Conference 2018 on
The Web Conference 2018, pages 21–22. International World Wide Web Conferences
Steering Committee, 2018.
[183] Zibin Yin, Ya Zhang, Weiyuan Chen, and Richard Zong. Discovering patterns of
advertisement propagation in sina-microblog. In Proceedings of the Sixth Inter-
national Workshop on Data Mining for Online Advertising and Internet Economy,
page 1. ACM, 2012.
[184] Tauhid R Zaman, Ralf Herbrich, Jurgen Van Gael, and David Stern. Predicting
information spreading in twitter. In Workshop on computational social science and
the wisdom of crowds, nips, volume 104, pages 17599–601. Citeseer, 2010.
[185] Kuo-Hao Zeng, Mohammad Shoeybi, and Ming-Yu Liu. Style example-
guided text generation using generative adversarial transformers. arXiv preprint
arXiv:2003.00674, 2020.
[186] Yizhe Zhang, Guoyin Wang, Chunyuan Li, Zhe Gan, Chris Brockett, and Bill Dolan.
Pointer: Constrained text generation via insertion-based generative pre-training.
arXiv preprint arXiv:2005.00558, 2020.

122
[187] Shiqi Zhao, Xiang Lan, Ting Liu, and Sheng Li. Application-driven statistical para-
phrase generation. In Proceedings of the Joint Conference of the 47th Annual Meet-
ing of the ACL and the 4th International Joint Conference on Natural Language
Processing of the AFNLP, pages 834–842, 2009.

[188] Wayne Xin Zhao, Jing Jiang, Jianshu Weng, Jing He, Ee-Peng Lim, Hongfei Yan,
and Xiaoming Li. Comparing twitter and traditional media using topic models. In
Advances in Information Retrieval, pages 338–349. Springer, 2011.

[189] Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Francis Lau. A c-lstm neural net-
work for text classification. arXiv preprint arXiv:1511.08630, 2015.

[190] Qianrong Zhou, Liyun Wen, Xiaojie Wang, Long Ma, and Yue Wang. A hierarchical
lstm model for joint tasks. In China National Conference on Chinese Computational
Linguistics, pages 324–335. Springer, 2016.

123

You might also like