Aether: Real-Time Recovery and Visualization of Deleted Tweets
Aether: Real-Time Recovery and Visualization of Deleted Tweets
Santa Barbara
Aether
Real-time Recovery and Visualization of Deleted Tweets
by
Committee in charge:
June 2010
The project of Pehr Lorand Hovey is approved.
_____________________________________________________
George Legrady, Committee Chair
_____________________________________________________
Lisa Jevbratt
_____________________________________________________
Alan Liu
June 2010
Aether
Real-time Recovery and Visualization of Deleted Tweets
Copyright © 2010
by
iii
ACKNOWLEDGEMENTS
Working with a topic that spans multiple disciplines can be a daunting task. My
project committee was instrumental in helping me approach the topic from multiple
angles. Professor Legrady continually pushed me to improve the visual component;
which greatly increased the clarity and impact. Professor Jevbratt helped me consider
protocols as art, and how the behind the scenes data systems are just as important as
the visual product. Professor Liu provided a digital humanist perspective on the data
analysis and supported me on the RoSE project, which gave me the programming
experience necessary to make this all happen.
I owe thanks to Larry Zins for providing network support to get my system online
and serving data to the world. Allison Schifani and Dana Solomon helped start all
this Twitter stuff when we worked together on the Urban Sensorium project in
Professor Liu’s Literature Plus class. Karl Yerkes helped me consider how the users
would feel about machines breaking social contracts, an interest we share. Ali Van
Dam helped proofread and put up with the never-ending stream of deleted tweets on
my screen.
iv
ABSTRACT
Aether
Real-time Recovery and Visualization of Deleted Tweets
by
Pehr Lorand Hovey
v
TABLE OF CONTENTS
1
INTRODUCTION
1
1.1.
PROBLEM
STATEMENT
2
1.2.
RELEVANCE
OF
THE
RESEARCH
3
2.
RELATED
WORK
3
2.1.
HUMAN
CONDITION
ONLINE
3
2.2.
RESEARCH
ON
TWITTER
4
2.3.
SAVING
ONLINE
DATA
5
2.4.
SOFTWARE
FOR
DATA
AGGREGATION
&
VISUALIZATION
6
3.
RECOVERING
DELETED
TWEETS
7
3.1.
TWITTER
API
OVERVIEW
7
3.2.
METRICS
FOR
UNDERSTANDING
DELETED
TWEETS
8
3.3.
DESIGNING
A
DATA
GATHERING
FRAMEWORK
10
3.3.1.
Grab:
receiving
and
storing
Tweets
&
Users
12
3.3.2.
Retrieve:
Looking
Back
into
the
Archive
14
3.3.3.
React:
Interacting
with
Client
Applications
14
3.3.4.
Utility
Stages
15
3.4.
CLIENT
APPLICATIONS
15
4.
VISUALIZING
DELETED
TWEETS
18
4.1.
USER‐CENTRIC
VISUALIZATION
19
4.1.1.
SparkTimeline
19
4.1.2.
UserDetails
20
4.1.3.
Lifetime
21
4.1.4.
Departure
21
4.2.
CONTENT‐CENTRIC
VISUALIZATION
22
4.2.1.
RecentText
23
4.2.2.
WordFrequency
23
5.
DISCUSSION
25
5.1.
THE
NATURE
OF
INTERNET
SPEECH
25
5.2.
ANALYZING
DELETED
TWEETS
27
5.3.
NATURAL
LANGUAGE
PROCESSING
FOR
TWITTER
DATA
30
5.4.
DESIGN
ASSUMPTIONS
32
5.5.
TECHNICAL
PERFORMANCE
CONSIDERATIONS
AND
RESULTS
33
5.6.
FUTURE
WORK
35
6.
CONCLUSION
36
REFERENCES
37
APPENDIX
41
A)
TWITTER
DATA
41
vi
B)
AETHER
DATA
42
C)
AETHER
OSC
API
43
C.1)
Server
side
(messages
from
client)
43
C.2)
Client
side
(messages
from
server)
43
vii
LIST OF FIGURES
viii
1 Introduction
1
audience participation without requiring the development of one-off mobile
applications that may be platform dependent. Instead users can simply send a tweet
using the software they may already be familiar with.
2
1.2. Relevance of the Research
Current academic research on Twitter and social networks in general primarily
focuses on how users interact with others and what types of things they say. Not
much has been done with the question of why people delete things after they have
been published. Aether takes on this question while engaging other social aspects
that have been mapped to the online space such as issues of privacy expectations.
On the technical side, Aether represents research into methods for high-volume
real-time data acquisition and processing. It tests various methods for implementing
a multi-stage pipeline and dealing with a never-ending data stream. The client-server
setup of separating the data gathering components from the end product provides an
example for other similar systems that need to have independent components
situated on multiple networks.
2. Related Work
This project operates in both the technical and social realms and has influences and
parallels in multiple fields. Though Twitter is the medium for data acquisition, the
core principal of Aether is examining people and their actions online. While there
has been much research focused on online publishing, not much has been said about
online self-erasure and the implications of people regretting what they said online.
3
Deleting a tweet can be considered an implicit expression of regret. Secret
Regrets [2] posts user-submitted regrets in a blog format. PostSecret [3] publishes
pictures of user-submitted secrets, usually in the form of anonymous postcards sent
through the mail. While the content and tone varies widely, many of the published
cards consist of an admission of something the sender regrets in some fashion.
REGRETS [5] by Mulfinger and Budgett is a multi-year project that asks people
to anonymously submit short descriptions of things that they regret. The project
initially solicited input from physical booths and mobile computing platforms that
went directly to people. The website continues to accept new input and serves as an
ever-growing archive of regret around the world. REGRETS approaches this subject
by focusing on specific, quantifiable regret as vocalized by the feeler. Aether takes a
different angle by looking at implicit examples of regret (people going back and
erasing something they said), and leaves it to the viewer to decide what the person
was regretting. Unlike REGRETS, the subjects do not realize their actions are being
recorded which may avoid response-bias at the cost of less precision since we can
only infer what their thinking may be.
4
symbol to denote a screen name [7]. They reveal how people use Twitter for
conversations and to what extent the current design of Twitter may hamper true
collaboration.
5
back to Twitter’s nascence in 2006. While they will likely provide interface tools for
the data, it remains to be seen if they will make deleted tweets available.
Tweleted [10] was a service that exploited a former bug in Twitter’s systems to
find deleted tweets for a specific user, entered on the website. Sometimes deleted
tweets were not removed from the search API so they would turn up in a comparison
with the regular lookup API. Twitter has since fixed the issue and the service no
longer works.
In some sense, we are all temporary archivers of online data since our web
browsers store temporary copies of every visited webpage in a cache on disk. For
some users, cache clearing is infrequent and they can build up a large collection of
outdated webpages.
6
scene graph, data graph and timing graph to give the developer control over every
aspect of a complex dynamic visualization piece.
Prefuse [1] is a Java library that provides tools for creating interactive data
visualizations. It provides several data structures for storing and searching the data as
well as several different kinds of data plots and filters to transform the data and the
visual presentation. Prefuse Flare is a recent Flash implementation of the same
concept.
7
potentially unlimited amount of time. Once connected, tweets come streaming at you
until you manually disconnect. The full Twitter stream (dubbed the Firehose) is only
available to large-scale “partners” such as Google or the Library of Congress but the
sampled stream still provides an avalanche of data. The publicly available stream is
said to contain a uniform 1/20th sample of all tweets [15]. This interface provides
parameters to filter by keyword or by a specific group of users. This API is the most
difficult to use efficiently since it involves maintaining a potentially unreliable
connection and processing data in real-time as it arrives. If data is not processed fast
enough, Twitter reserves the right to terminate the connection.
Aether uses the Streaming API to get real-time tweet data and deletion notices.
The REST API is used to look up additional information about a user.
8
ephemeral and subject to ambiguity (the same tag meaning multiple things) or
fragmentation (multiple different tags for the same thing) but some research has
looked into using them for more powerful semantic applications [18].
Technically, a hashtag is any text without spaces, preceded by a “#” symbol.
They are often used to tag a tweet about a specific timely event, such as The Oscars
(#oscars) or a particular concert (#okgomay222010 ). The presence of one or more
hashtags in a tweet would suggest that this person was engaging in a global
conversation beyond their close circle of followers. In effect, they amplified their
voice prior to deleting what they said.
Mentions are a way of tagging a specific user by prefixing their screen name with
the ‘@’ symbol. On most twitter clients there is a column or view that automatically
displays other people’s tweets that mention you. Mentioning users is primarily how
people engage in a conversation directly with one or more people. Deleting a tweet
with a mention in it is the equivalent of trying to take back something that was said
face-to-face in real life, though there is no guarantee that the other person saw the
mention.
Followers_count & friends_count: The number of followers is useful since it is
the number of people that will immediately receive any tweet sent, prior to it being
deleted (though no guarantee they will see it before it is deleted). Follower counts
are also interesting because the number is out of direct control of the user—other
people have to voluntarily follow them and can always un-follow them, driving the
number down. Friends are the people this user is following. Some researchers have
looked beyond raw counts to see how the ratio of followers to friends may be a
predictor for user behavior on Twitter [8] and spam detection [19].
Lifetime: The length of time that a tweet was available for the world to see
before it was deleted is an interesting metric because it alludes to the (unknowable)
amount of people that may have seen it before it was rescinded. Longer lifetimes
suggest that more compelling reasons than fixing a simple typo that could have been
seen immediately.
9
Location: Twitter users can enter their physical location to be displayed on their
profile, though there is no requirement for accuracy or even that it be a real place.
Unlike Twitter’s tweet geotagging features, this location typically refers to a
hometown or other regional designation. The location is geocoded to precise latitude
and longitude using Google Maps geocoder service when possible.
Source: The specific software used to post the message to Twitter can be used to
estimate if they were at a desktop computer or using a mobile device.
Time sent, time deleted: The creation time is recorded from the initial tweet
data (reported in UTC time). The deletion time is recorded when the deletion notice
is received, also in UTC time. Times can be converted to user’s local time zone by
using the UTC offset.
Typo text: The Levenshtein edit distance formula[18] is used to compare the
tweet to the set of other tweets that were sent by this user before and after this one. If
the distance is low (<5) for any of the tests then it is likely the deleted tweet was
considered a typo. The tweet that replaced this one is retained for analysis.
User description: The self-reported “about me” section is useful to humanize
them and get a sense of their writing style.
UTC Offset: The time difference (in seconds) between the user’s local time and
UTC / Greenwich Mean Time is used to estimate the user’s current time zone.
Word Frequencies: Counts of how many times every word has been seen in all
deleted tweets are useful to see if a tweet is similar in content to other tweets that
have been deleted.
Additional data beyond what is described above is retained to round out the
visualizations and facilitate experimentation with the data. A detailed rundown of all
data is in the Appendix.
10
platform compatibility and can run on most modern operating systems. There are
many implementations of the Ruby language including one that runs on the Java
Virtual Machine (JRuby) and one by Microsoft that can run in a web browser
(IronRuby). The “standard” implementation is called Matz’s Ruby Interpreter (MRI).
The current stable version of MRI is 1.9.1 while the 1.8 branch is still widely used
and supported. Aether was developed on MRI 1.9.1.
The Aether server has the joint duty of continuously gathering and storing new
data while also serving data to multiple visualization clients, a structure that requires
concurrent programming. Data is shuffled through multiple steps in a modular
pipeline that all execute in parallel. Each stage runs as a completely separate Ruby
process and communicates using Distributed Ruby (DRb) [19], a core Ruby
component that allows any Ruby object to be shared over the network. In theory
each stage could run on a separate computer, though for simplicity they are all run
together on the same testing server.
The Aether pipeline consists of three primary stages: Grab, Retrieve and React.
These stages operate in Producer/Consumer relationships where each stage produces
data that is consumed by the next stage in line. Stages receive data via a push
paradigm whereby the producing stage remotely populates the target stage’s input
queue, rather than a poll method where the target stage would periodically check
upstream for new data.
Unlike some Producer/Consumer systems, there is no way for a consumer to tell
the initial producer, Twitter, to temporarily pause data production in the event that
the consumer cannot keep up. Moreover, the volume of data coming into the system
is not constant and may include temporary spikes that must be accounted for to avoid
data loss. Thus it is imperative that each stage process data as fast as possible to
avoid losing data. Each stage must process different volumes of data with the input
volume generally decreasing in later stages as unneeded data is discarded. Figure 1
shows the data flow in the Aether server with sample volumes from a thirty minute
test run at mid-day.
11
Figure 1: Aether server data flow
{:delete=>{:status=>{:user_id=>69114405, :id=>8408482002}}}
12
All delete notices are immediately forwarded to the Retrieve stage for further
processing.
Each status object contains metadata for the status update as well as information
regarding the sending user. An excerpt of a typical status object is in Figure 3.
{
"id": 13094530000,
"text": "Ok back not getting too bad. Hopefully I'll get some sleep.",
"created_at": "Thu Apr 29 22:53:35 +0000 2010",
"source": "<a href=\"http://ubertwitter.com\"
rel=\"nofollow\">UberTwitter</a>",
"user": {
"friends_count": 127,
"lang": "en",
"created_at": "Thu Nov 20 02:36:19 +0000 2008",
"statuses_count": 7038,
"time_zone": "Edinburgh",
"profile_link_color": "0000ff",
"geo_enabled": true,
"followers_count": 63,
"location": "Anywhere you are!!",
"screen_name": "fifinoir",
"name": "fiona",
"id": 17502304,
"utc_offset": 0,
}}
An important function of the Grab stage is to discard statuses we are not interested
in to save resources. Subsequent stages in the pipeline may perform computationally
expensive processes on each status so it is imperative to minimize the amount of
work to be done and not lose desirable data due to buffer overflows. Each status
object is run through a special skip function that determines if we should discard it or
pass it on to later stages.
The criteria for skipping a status depend on the scope of the project and the
intended uses of the resulting data. It is also important to minimize the complexity of
the skip function since it will be executed more than any other processing function in
the data pipeline. More specific and intensive tests can be deferred to later stages that
process comparatively fewer data objects. The skip function has been shown in
practice to discard as much as 60% of the data, saving considerable resources.
13
Automatically discarding data can be risky depending on the application
requirements so it is important to choose the skip criteria carefully. Details on the
assumptions underlying the skip function criteria are presented in the Design
Assumptions section.
All statuses that pass the skip test are stored in the local memcache instance for
later retrieval, keyed to the unique tweet id.
14
database on startup. The total size of the history list is capped so the oldest events are
eventually discarded from the buffer.
Each fresh Event is popped from the incoming queue and immediately broadcast
to all active clients. This enables each client to receive new deleted data in near real-
time. The new Event is also added to the top of the history list to keep it up-to-date.
React also immediately broadcasts server DataRate updates since they are time-
sensitive.
RateMonitor runs once per second to get an estimate of current data rates. The
current timestamp is bundled with the rates in a Ruby Hash and forwarded to the
React stage for transmission to clients.
15
or even hardware devices with scrolling LCD screens. This project does not attempt
to dictate exactly how a client application should function, but the components
needed for successful interaction with the server API are documented.
The Aether server is designed to maintain connections with an arbitrary number
of client applications that have registered to receive data. The server pushes new
deleted tweets to all registered clients as they are recovered, enabling near-real-time
visualization. The client-server API is built on Open Sound Control (OSC) since it
provides for push-style data flow [20]. Pushing data to the clients as it is created is
important since the flow rate is not consistent and it avoids wasting resources to
constantly check the server for new information. The Server has a well-defined OSC
address space that acts as the external API for the deleted tweet data. Clients must
implement a similarly structured OSC address space to provide specific endpoints to
receive new data. All OSC data is sent across the network using the User Datagram
Protocol (UDP) since it is simple, well supported and efficient.
Clients must first register with the server to enable the server to send data back
automatically. This is accomplished by sending the client’s IP address and active
port number to the server’s /register address. These details are used to assemble a
unique identifying key, which is required for all subsequent server interactions.
Multiple clients from the same IP address can operate simultaneously so long as the
ports are unique. The visualization clients developed for this project random ports
within a certain range to minimize the chance of collisions.
The IP address and port must be accessible by the server in order to establish
successful bi-directional communication. This is not always possible, such as when
the client is behind a router on a private network. In these situations a server-side
technique known as UDP Hole Punching [16] can be used to trick the network into
allowing the communication. On the client side, Universal Plug-n-Play can be used
to automatically open the required port if the network hardware supports it. This
method was successfully used in the Java-based client applications via UPnPlib from
16
SBBI [23]. These methods are not needed when both the client and server are on the
same network but may be required if a client is being run offsite such as at a gallery.
Clients should send periodic /hello keep-alive messages to inform the server that
they are still running and expecting data. This is useful if the server restarted and no
longer has a record of the active client. Messages from unknown clients will cause
the client to be re-registered using the provided client key, re-establishing the
connection. The time of the most recent keep-alive message is recorded for auditing
purposes. This feature is designed to maintain reliable communication between the
server and many clients when left unattended for long periods of time.
There are three main OSC methods for requesting event data from the server.
/event/latest will send the most recent event back, while /event/random will return a
random event from the history buffer. Since the history buffer is a fixed size and
constantly updated, even the random event will be relatively recent. /event/history
will request the entire history buffer. This is useful to bootstrap a fresh visualization
so it does not start empty.
Event objects from the server are transmitted as serialized JSON, as shown in
Figure 5. The client must parse this into a usable representation. The event contains
all the fields present in the server’s Event object stored in the database.
Periodic updates from RateMonitor are sent as serialized JSON to the /rate
address of each client.
17
"created_at":"2010-05-24 16:01:08 -0700","history":false}
18
These initial visualizations were developed in Java and Processing [23] and are
targeted for autonomous gallery presentation instead of detailed user interaction.
This means that the programs will run unattended for long periods of time and
update without any requirement or opportunity for user input. This approach deviates
from Ben Schneiderman’s Visual Information-Seeking Mantra of “Overview,
Zoom, Filter, Details on demand” [16] since we are not allowing users to interact
directly with the visualization. Even so the design process took into account the
overarching goal of Schneiderman’s theory—ensuring the information is conveyed
efficiently to those who want to see it. This puts extra pressure on the use of space
and time to ensure it is interesting and not too cluttered since there is no opportunity
for viewers to filter or modify the presentation.
With so many potential variables to investigate it was decided to segment the
problem in to two separate but complementary tracks – user-centric and content-
centric. When both visualizations are displayed simultaneously side-by-side they
provide a detailed picture of real-time deleted tweet data.
4.1.1. SparkTimeline
This widget is common to both visualizations. Data rates coming from the server are
rendered as a continuously updating sparkline. Sparklines are described by Edward
Tufte as “small, high resolution graphics embedded in a context of words, numbers,
images” [20].
Time progresses to the right and the previous data points are plotted before the
graph reaches the right side of the screen (the end of the display period), where it
rolls over and starts from the left side again. The display period is variable and
19
determines how fast the graph moves across the screen, as well as how dense the
resulting lines appear. A period of 120 seconds is the default.
The white line depicts the rate of tweets being stored by the Grab stage and the
colored line shows the deletes being checked by the Retrieve stage. All data points
are graphed in terms of items per second. New events are marked on the timeline
upon arrival as vertical lines. Previous events remain on the timeline until it rolls
over.
4.1.2. UserDetails
This widget displays some basic user information and puts the deleted tweet in
context. Metadata such as user screen name, location and number of followers helps
build an image of who this person is in real life. The user’s profile picture and
biographical description are displayed to further humanize this person.
The deleted tweet is displayed in context with a few tweets sent around the same
time that remain undeleted. The age of the tweet (how long ago it was originally
sent) is also displayed. This reinforces the “lifetime” of the tweet since viewers can
see how long ago this tweet was sent, and how far back in history the user had to go
to find the delete button.
20
Figure 8: Tweet context in UserDetails widget
4.1.3. Lifetime
The length of time that each tweet was publicly available is plotted as a single
dimensional plot, with time progressing to the right. Time is apportioned along the
axis using a logarithmic scale due to the large range of values involved.
Labels are provided at meaningful times like Minute, Hour, Day and Week. Events
are drawn as translucent lines (with the current event highlighted), allowing for
relative density to be estimated where there are clusters of overlapping events.
4.1.4. Departure
The two primary events of a tweet’s lifetime – being launched into the web and
subsequently being snatched back – are modeled as a transit departure graph similar
to those highlighted by Tufte [21]. Two days of time is displayed along a horizontal
axis on top and bottom. Some events span more than one day, such as tweets that
were sent in the evening and deleted the next morning. By displaying two days we
can accurately place them on the timeline without the deleted time coming before the
sent time. This follows the wrap around principle for schedules that Tufte discussed
[22].
21
Figure 10: Departure widget
Each event is drawn as a line connecting the sent time on top to the time of deletion
on the bottom. The graph primarily investigates the relative difference between time-
of-day that a user sent their tweet and when they chose to take it back. This can show
if tweets that were sent late at night are often deleted in the mornings.
22
4.2.1. RecentText
The content of each tweet is the center of attention for this visualization, so the
previous several tweets are displayed in rows with the current tweet highlighted. The
user’s profile picture is displayed next to each tweet to stamp it as unique –
reminding viewers that unlike the rows of contextual tweets in the User
visualization, each tweet comes from a different user.
4.2.2. WordFrequency
This widget displays the most frequently seen words over the course of the
visualization. Words in the current tweet that also appear in the most frequent words
are highlighted which makes it easy to see how typical this tweet is. In this context,
“words” are any tokens that are not URLs, emoticons or otherwise contain non-word
punctuation. There is no requirement that tokens be actual words (since many are
slang or abbreviations). These are entered into a RiConcorder from the RiTa library
[27], which calculates cumulative word frequencies. Each time a new event arrives
the word frequencies are updated and a new set of top words is extracted.
23
Figure 13: Full User-centric Visualization
24
5. Discussion
25
“right now” theme of Twitter encourages us to just send it immediately so we can
ready ourselves to send another. There are so many tweets being sent all the time it
would seem quaint to agonize over individual details—setting us up to potentially
regret things said in haste.
People are becoming more open with what they share online and yet
simultaneously clinging to an expectation of privacy or stealth, dubbed the “privacy
paradox” by Susan Barnes [17]. When things that were publicly viewable all along
are suddenly thrust into the spotlight—such as when parents read their children’s
online postings—the immediate reaction is often shock and surprise rather than
responsibility for what was said. We forget that the Internet will never be a truly
private space.
A topic of interest when considering Internet publication and the notion of
responsibility is the question of single vs. multiple publication. Is mass-publication
defined by a single event (such as hitting the “submit” button on a blog post), or is
the act replayed every time someone else views the work? This can be a big
distinction on the Internet when items can be viewed by thousands of people, and it
is not always possible to get an exact count.
This question has roots in disputes of defamation with the very first printed
publications. When assessing damages, courts had to consider if the publisher was
liable for just the initial act of publication (regardless of the number of copies
produced) or could be assessed for each instance of the printed work (perhaps
thousands of times). The current legal precedent has converged on the single
publication rule, which stipulates that a single edition of something can make the
publisher liable for only one case of libel [31]. There is still debate as to how to map
the concept of an edition to the digital realm.
Though primarily a legal concept, the single publication rule can influence how
we think of deleted tweets. It may caution us to avoid ascribing too much importance
to tweets that were deleted after a long time. Though the tweet may have been
26
viewable for days or even weeks, a user trying to rescind their speech out of fear of
causing offense is no more liable than if they had deleted it right away.
Since Twitter is often described as a form of “microblogging” it is important to
consider how even normal-sized blogs are different from traditional publishing.
“Weblogs” have become quite popular in part because of the low barrier to entry—
blogging platforms are usually free and easy for novices to use. Writers are expected
to “be themselves” and not conform to rigid standards of content and refinement. An
interesting characteristic of blogs is that entries are arranged chronologically and not
alphabetically or in a logical order that supports an overall argument [27]. Twitter
takes this point to the extreme as several consecutive tweets by one person may all
belong to different conversational threads or may be independent expressions. This
limits the amount of automatic contextual interpretation we can do based on the data
presented in the visualization – only other humans can really decide how a tweet fits
into the surrounding context, if it fits at all.
27
distant reading (Franco Moretti’s idea of poring over large datasets to gain insight
through aggregation [34]). The text seems at once noisy and heavily encoded—typos
and sentence fragments co-exist with emoticons, URLs and other semantic elements
employed to pack as much meaning into 140 characters as possible. Twitter users’
fluid grammar and hit-or-miss spelling thwarts many efforts at parts-of-speech
tagging and classification. Social datasets like Twitter are likely to only grow in size
and ubiquity, leaving much work for digital humanists and computer science
researchers.
While computational tools have a ways to go before they can be reliably applied
to random sets of tweets, there are other angles that can be examined at present.
Placing the rescinded tweet in context with ones sent immediately prior and
afterward helps envision the users’ behavior patterns. Though many tweets contain
typos, it is easy to see if that was why it was deleted. Aether estimates this by
comparing each tweet to the collection of other tweets sent by the user around that
time (indicating that the user re-sent the tweet with only minor corrections).
Incidentally, out of a sample of 20,000 deleted tweets, only 16% were found to be
typos using these criteria.
Specific dates can be correlated with events in the news, or days of the week (do
people spend Monday complaining about the new week, only to have second
thoughts and delete the later in the week?). Time of day can be examined as well
such as if tweets sent on Friday nights out are often deleted Saturday mornings.
Time continues to play a central role with the consideration of the tweet lifetime
(the number of seconds that elapsed between when the tweet was sent and when it
was deleted). During this time the user was doing other things while their tweet
floated through cyberspace being seen by an unknowable number of entities. When
plotted in the Departure widget of the User visualization it is clear that many tweets
are deleted within an hour of being sent. And yet, there are ones that were deleted
days, even weeks later. It took effort for the user to even find the tweet in the Twitter
interface in order to hit the delete button, and it is likely that no human twitter users
28
were looking at that tweet anymore. These long-tail deletions offer one of the most
interesting points of investigation in Aether.
Though machine-learning approaches to content categorization were not
successful (see the discussion of Natural Language Processing tools in 5.3), some
basic classification in Aether was still possible using regular expressions. Dividing
content into categories can provide a new angle to the data, exploring questions such
as if explicit tweets are deleted more often than tweets with non-offensive language.
Regardless of the efficacy of classification and the amount of meaningful categories
employed there will always be some tweets that can only be labeled as idle chatter.
This chatter takes the form of such vacuous statements as “whats up?” and “Awake.”
Both are real tweets that were really deleted. Why would someone bother to delete a
short snippet of conversation that may not have had a purpose to begin with? That
question alone may stump the viewers of the system.
Recent legal cases also suggest that some deletions may be motivated out of self-
preservation. There are always media reports of people getting in trouble as the
result of immodesty online --- such as losing their job due to scandalous postings on
Facebook. Recently tweets have started becoming used as evidence in court – to the
chagrin of their senders. The case of Amanda Bonnen vs. Horizon Group is an
example that demonstrated both the use of Twitter in the courtroom, and the risk that
corporations take in using engaging single people through the legal system [36]. In
this case Bonnen used Twitter to complain about mold in her Horizon Group-owned
apartment. Horizon sued her for defamation, promptly elevating the case to national
attention. Bonnen was a small-scale Twitter user with less than a few dozen
followers who saw her original tweet. The tweet in question ended up being read by
millions as a result of media publicity, giving substantial negative publicity to
Horizon. In the end the court dismissed the case, citing the tweet as being “too vague
to meet the strict definition of libel.” While this episode ended happily for the
defendant it is entirely plausible that people may have second thoughts about a
pointed comment they made and take it back in hopes of avoiding litigation or other
29
negative consequences.
The autonomous Aether system is not the only actor in this equation. Like any
software it is unfeeling and does not care about the semantic and emotional content
of other peoples’ tweets; it just calmly collects them as directed. The viewers play a
role of interpreter since they may empathy towards those whose thoughts and
feelings are being sampled and dissected. When we see these statuses on the wall
that were taken back we feel a sense of voyeurism, like we are spying on them.
Some viewers of the installation have questioned the legality or ethicalness of
storing and presenting other users’ deleted tweets. While the moral and ethical
considerations surrounding this process are an enduring question (and provide
purpose to this project), the legality can be addressed by looking at the Terms of
Service and other documentation published by Twitter. The primary Terms of
Service (aimed at general users of Twitter) does not cover deleted tweets [39]. The
Developers Rules of the Road specifically mentions deletion saying:
Respect the privacy and sharing settings of Twitter Content. Promptly change
your treatment of Twitter Content (for example, deletions, modifications, and
sharing options) as changes are reported through the Twitter API [38].
Though not clearly spelled out, this would hint that Twitter does not want developers
publicly displaying tweets that have been deleted. Twitter’s help pages mention that
while you can delete your tweets from their system, they may remain in search
indexes [17] and other site language also warns that it may remain in third party
applications (such as this one). Future work includes an idea to operate a “lost and
found” that catches these tweets and sends them back to the person that deleted
them. A compelling element to this plan is the unknown level of user response or
backlash. Users reacting strongly to such a system may in fact cause Twitter to
tighten and enforce their policies with regards to the use of deletion requests.
30
are difficult to process using higher-level tools because each individual update is
short (less than 140 characters) and is low in content.
Automatic content-based categorization and classification was investigated to
shed light on the nature of the tweet’s content. These tools typically require
assembling a large training set of data for each possible category. These sets are used
to generate a ‘fingerprint’ that characterizes the items that are expected to fall into
each category. New items can then be categorized by comparing their contents to
each fingerprint using statistical tests. Some researchers have found success trying
to classify tweets based on content and sentiment using a variety of techniques [39]
[40]. The Java Text Categorization Library [41] was used to experiment with text
classification for Aether but results were not reliable due to the limited size of input
text.
Parts-of-Speech (POS) tagging involves labeling each word in a sentence with its
linguistic part of speech (such as noun or verb). Digital humanists use POS tagging
to investigate individual sentence structure and dialect as well as inform research in
linguistic trends. POS tagging efforts with tweet data were mildly successful but ran
into contextual inaccuracies. Words that could be a noun or a verb were often
mislabeled, an error that can be avoided with contextual awareness not easily
available with such limited, noisy data. Many tokens found in tweets do not readily
map onto our notion of part-of-speech, such as emoticons and hashtags. The RiTa
library [30] was used to explore POS tagging since it is designed for Processing and
has many different text processing features.
One interesting NLP tool that deserves further study is the Markov-chain
sentence generator. This works by analyzing a large group of sentences to produce a
statistical model that can be used to generate similarly structured sentences. While it
can be difficult for these tools to generate truly realistic sentences, a cursory
investigation suggested that they are well suited for use with Twitter content. Since
tweets are sentence-like in length and structure but typos and grammatical issues are
tolerated, the typical shortcomings of a Markov-chain generator are accommodated.
31
Generated pseudo-tweets appeared to be very similar to actual tweets. This could be
used in a future work that looks at the fluidity of language evidenced in status
updates, and how possible it is to fool people with fake updates.
32
while still keeping a coherent grouping (instead of discarding them randomly). The
written language requirement was developed since the audience viewing the
visualizations will be predominantly English speakers.
Country of origin is determined via the user profile’s location as well as UTC
offset (time zone). Both of these metrics have the potential to be misleading since
they are user-supplied via the Twitter settings page. Relying on the estimated
location is fraught with peril since the location string is free-text, may not be their
actual location, and may not even be a real place. Since geocoding takes non-
negligible time it is relegated to post-retrieval and cannot be used to skip a tweet
outright before saving, thus wasting memory. The UTC offset is a more effective test
since it is quick to check and it is probable that most users have filled it in correctly
since this setting affects how twitter displays the times of all the tweets on their
website.
Written language is checked by looking at the user profile “language” field. This
too can be prone to errors as not all users have picked the right language in their
Twitter preferences, and some people write in more than one language. A Ruby
library called ‘whatlanguage’ is used in the Retrieve stage to estimate the language
based on the actual textual content.
33
and the database-storing stage constantly overflowed, causing up to 50% of all data
to be lost. Additionally, Twitter throttled the amount of data being sent since it was
not being consumed fast enough. This resulted in a far lower deleted tweet recovery
rate while still using almost all of the server’s resources and negatively impacting the
other Aether components.
These MySQL performance issues resulted in making a switch to Memcached
[38], a technology originally developed by Livejournal as a distributed caching
system to increase performance of high traffic websites. Memcached stores all data
in volatile RAM memory instead of on disk so it is not persistent, meaning the data
disappears if the Memcached or the computer is restarted. This is contrasted with
regular databases such as MySQL, which maintain data until it is intentionally
deleted. Memcached instances on several computers can be joined together to
provide a larger storage space and potentially increased performance as the load is
shared across machines. It was decided that not losing any data when it arrives from
Twitter is a better short-term goal than having an infinitely large archive but not
knowing how much data was never saved.
Unrecovered tweets are currently being held in a 768MB Memcached instance
running on the test server. In practice only about 88% of the capacity is available for
data items—the rest is used by Memcached for recordkeeping and other tasks. As the
cache fills up, older tweets get evicted which limits how far back in time Aether can
look. In practice, the limit is about 500,000 tweets, covering about 12 hours of
history assuming a nominal saved tweet rate of 12 per second. Increasing the size of
the local Memcached instance or expanding the pool to include Memcached
instances on other servers would increase the maximum look-back time.
While data rates fluctuate, about 4% of all data items received were deletion
requests, the rest were new tweets. On average about 5% of these deletion requests
were successfully retrieved. The other 95% of missing tweets can be explained by a
few factors. First, we discard up to 60% of all tweets received in order to save
resources and limit the data to the desired demographics so these tweets will never
34
be recovered. As mentioned previously, increasing the size of the Memcached
instance or using a different database system would increase the number of tweets
that can be archived at once, increasing the recovery rate.
Besides technical implementation concerns, there are a few known systemic
limitations to the current approach. Aether may not suitable for academic statistical
analysis since the sample stream contains only 5% of all tweets sent and we discard a
large percentage of them to save resources. Additionally, the design decision of
pursuing real-time analysis effectively prohibits doing heavy processing on every
single tweet received (such as geocoding). This makes it difficult to compare
recovered tweets to the larger population since unrecovered tweets do not have as
much derived metadata.
35
issues can be addressed in order to not lose data.
There are many other applications of the deleted tweet data besides ephemeral
visualization. A Twitter account named @tweet_morgue has been created to collect
these “dead” tweets and make them available for others to see using Twitter’s own
infrastructure. If the dead tweets are prefixed with the original sender’s twitter screen
name (a mention) then the sender will likely see their deleted tweet come back onto
their timeline, provoking interesting user reactions. By sending deleted tweets back
to the originator the @tweet_morgue also acts as a sort of lost & found for errant
tweets. A primary theme of this project so far is that the senders do not realize they
are being observed. Would their behavior change if they were made aware that their
deletions are being archived and dissected?
6. Conclusion
Aether has proven to be a robust platform for real-time processing and recovery of
deleted tweets. The multi-stage pipeline approach and separated client-server
architecture has allowed for a flexible system that can support several heterogeneous
clients running simultaneously at multiple locations. The push-style client
communication over OSC allows for clients to display deleted tweet information
within seconds of the real-world deletion event.
The data produced has provided an interesting window into an often-ignored part
of the data lifecycle. Initial visualizations made the data accessible to the general
public and often provoked questions and observations as people viewed the
previously hidden actions of Twitter users around the country.
Future work will continue to make the data available in more formats and to
more people. As people realize their online data is not completely in their control
they may reconsider what they say and give up deleting all together. Until then,
Aether will be watching.
36
References
[1] Lisa Jevbratt. (2006, June) Searching traces of We: Mapping Unintended
Collectives. [Online]. http://jevbratt.com/the_voice/the_voice_cat.pdf
[2] Jonathan Harris and Sep Kamvar. (2005, August) We Feel Fine. [Online].
http://www.wefeelfine.org
[7] Bernardo Huberman, Daniel Romero, and Fang Wu, "Social Networks
that matter: Twitter under the microscope," 2008.
[10] Steve Lohr, "Library of Congress Will Save Tweets," New York Times,
Arpil 2010.
37
[13] IBM Visual Communication Lab. (2004) Manyeyes. [Online].
http://manyeyes.alphaworks.ibm.com
[17] Martin Hepp. (2010, February) Hyper Twitter: Weaving a web of linked
data, tweet by tweet. [Online]. http://semantictwitter.appspot.com/
[18] Teng-Sheng Moh and Murmann Alexander J., "Can You Judge a Man by
His Friends? - Enhancing Spammer Detection on the Twitter
Microblogging Platform Using Friends and Followers," in Information
Systems, Technology and Management, vol. 4, Bangkok, 2010.
38
[26] Antony Unwin, "Interacting With Graphics," in Graphics of Large
Datasets.: Springer, 2006, p. 77.
[28] Ben Schneiderman, "The Eyes Have It: a Task by Data Type Taxonomy
for Information Visualizations," in Proceedings of the 1996 IEEE
Symposium on Visual Languages, College Park, MD, 1996, p. 336.
[33] David Crystal, Language and the Internet.: Cambridge University Press,
2001.
[35] Sapna Kumar, "Website Libel and the Single Publication Rule," The
University of Chicago Law Review, no. Spring, pp. 639-662, 2003.
[Online]. http://www.jstor.org/stable/1600592
[37] Franco Moretti, Graphs, Maps, Trees: abstract models for a literary
history. London: Verso, 2007.
39
[39] Twitter, Inc. (2009, September) Twitter Terms of Service. [Online].
http://twitter.com/tos
[42] Alec Go, Richa Bhayani, and Lei Huang, "Twitter Sentiment
Classification using Distant Supervision," Stanford University, Paper
2009.
40
Appendix
A) Twitter Data
Below is an example of data that Twitter sends for each tweet in the streaming API,
in JSON format. It includes the nested User object as well as an example of the
geotagging feature which is not yet widely-adopted by Twitter users.
{
"contributors": null,
"created_at": "Thu Apr 29 22:53:35 +0000 2010",
"source": "<a href=\"http://ubertwitter.com\"
rel=\"nofollow\">UberTwitter</a>",
"in_reply_to_status_id": null,
"place": null,
"geo": {
"type": "Point",
"coordinates": [
56.419056,
-3.40316
]
},
"in_reply_to_screen_name": null,
"user": {
"profile_background_tile": false,
"friends_count": 127,
"description": "",
"lang": "en",
"favourites_count": 26,
"verified": false,
"created_at": "Thu Nov 20 02:36:19 +0000 2008",
"profile_background_color": "9ae4e8",
"following": null,
"profile_text_color": "000000",
"url": null,
"statuses_count": 7038,
"time_zone": "Edinburgh",
"profile_link_color": "0000ff",
"profile_image_url":
"http://a1.twimg.com/profile_images/456261026/twitterProfilePhoto_normal.jp
g",
"geo_enabled": true,
"notifications": null,
"followers_count": 63,
"protected": false,
"location": "Anywhere you are!!",
"contributors_enabled": false,
"profile_sidebar_fill_color": "e0ff92",
"screen_name": "fifinoir",
"name": "fiona",
"profile_background_image_url":
"http://s.twimg.com/a/1272044617/images/themes/theme1/bg.png",
"id": 17502304,
"utc_offset": 0,
"profile_sidebar_border_color": "87bc44"
},
41
"in_reply_to_user_id": null,
"coordinates": {
"type": "Point",
"coordinates": [
-3.40316,
56.419056
]
},
"truncated": false,
"id": 13094530000,
"favorited": false,
"text": "Ok back not getting too bad. Hopefully I'll get some sleep."
}
B) Aether Data
B.1) Tweet
Every tweet received is stored as its own object along with a small subset of user
information to make retrieval easier.
Identifier: Tweet ID
User ID
User details: Tweet source (program used to send the tweet)
Time: Time sent, time deleted in UTC
User UTC offset for adjusting to local time
Lifetime (number of seconds before tweet was deleted)
B.2) User
User objects contain a subset of the fields that Twitter sends with each tweet. Users
are stored separately to track how many times each person has deleted a status.
Identifier: Twitter user account ID
User details: Screen name (account name)
Display name (supposedly their real name)
Location string (not geocoded)
Biographical description
Profile image URL
Time: UTC offset in seconds
Created_at (time registered)
Counts: Friends, Followers count
Statuses count
Count of deleted tweets from this user (that we know
of)
B.3) Event
Events are created from a recovered deleted tweet. They contain a mixture of official
Tweet and User data as well as derived metrics.
Identifiers: Tweet ID
User ID
Text: Original text
“Word text” – original text, minus punctuation and
non-word tokens (URLs, hashtags, etc)
“typo text” –The tweet that replaced this one if this
is a typo
Extracted Hashtags, URLs, Mentions
Time: Time sent, time deleted in UTC
42
User UTC offset for adjusting to local time
Lifetime (number of seconds before tweet was deleted)
User details: Screen name
Biographical description
Location string
Geocoded location (latitude & longitude)
Tweet source (program used to send the tweet)
Profile image URL
Counts: Friends, Followers count
Count of deleted tweets from this user (that we know
of)
Other tweets: A set of tweets sent before and after the deleted
tweet, retrieved at recovery time.
Useful for building context.
43