Clarkson - Joshua White - PHD Thesis Proposal - JSW - d4

An Exploratory Approach to Social Network
Analysis
PhD, Thesis Proposal
Joshua White
Engineering Science
Clarkson University
[email protected]
January 27, 2011
i
Abstract
Social network analysis is the study of human inter-connectivity; social media sites are
those tools which have been created, to enable easier and more prolic communication within
these networks. The eld of social media network analysis has emerged over the past two years
as a viable means of understanding the sentiment of individuals and groups, as they pertain to
particular events.
Social media networks make up a large percentage of the content available on the Internet.
Most of the time users spend online today, is spent interacting with them. All of the seemingly
small pieces of information, added by billions of people, results in an enormous and rapidly
changing data-set. Searching, correlating and understanding the billions of individual posts is
a signicant technical problem; even the data from a single site, such as twitter, can be difcult
to manage.
We propose a system and process for social network analysis. We describe the overall ar-
chitecture, including the capture, storage, search, and graph components. We present various
use cases for the data, including the detection of phishing websites, analyzing the spread of
trends, and even the detection of malicious content. We describe our experiences with manag-
ing a 100+ terabyte data-set. We discuss the complications of large data-set management and
our experiences dealing with the evolution of the Twitter REST API and present strategies for
maximizing the amount of data collected.
This proposal presents the additional work proposed in fulllment of the Engineering Sci-
ence PhD requirements. This work includes a method for the analysis of trending events, such
as the #KONY2012 meme, and understanding the social roles that individuals play as part of
these networks. This work will include large scale graph analysis, aided by custom map reduce
scripts, which signicantly increase the efciency of the system.
Joshua S. White - Clarkson University ii Engineering Science - PhD Proposal
Contents
1 Background and Related Work .......................................................................................... 1
1.1 Traditional Social Networks ......................................................................................... 1
1.1.1 Denition of Actors In a Social Network . . . . . . . . . . . . . . . . . . . 2
1.2 Online Social Networks ................................................................................................ 4
1.3 Privacy and Legal .......................................................................................................... 5
1.4 Current Publications...................................................................................................... 6
1.4.1 Coalmine: An Experience in Building a System for Social Media Analytics . 6
1.4.2 A Method For Automated Detection of Phishing Websites Through Both
Site Characteristics and Image Analysis . . . . . . . . . . . . . . . . . . . 14
2 Detailed Research Plan........................................................................................................ 20
2.1 Motivation..................................................................................................................... 20
2.2 Data Collection System................................................................................................. 21
2.2.1 Twitter Collection Rate Limitations . . . . . . . . . . . . . . . . . . . . . 21
2.2.2 Insights Into Twitter API . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.2.3 Capture System Network Design . . . . . . . . . . . . . . . . . . . . . . . 22
Sampling Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3 Our Research Data-set .................................................................................................. 27
2.4 Data Analysis System................................................................................................... 28
2.4.1 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.5 Active Research Area.................................................................................................... 30
2.5.1 Tracking the Rise and Fall of KONY2012 . . . . . . . . . . . . . . . . . . 30
2.6 Schedule for completion of this research...................................................................... 35
3 Citations ................................................................................................................................ 36
iii
List of Figures
1 Coalmine System Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Coalmine Example Query Table . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Phishing Detection Process Overview . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Basic Onion Routing Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5 Logical Capture System Network Flow . . . . . . . . . . . . . . . . . . . . . . . . 25
6 DISCO DDFS File System Infrastructure Diagram [6] . . . . . . . . . . . . . . . . 28
List of Tables
1 Twitter Collection Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2 Plan for completion of my research . . . . . . . . . . . . . . . . . . . . . . . . . . 35
List of Code Blocks
1 Setting Up Multiple TOR Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2 Basic Twitter API Subscription Script . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Basic Twitter API Subscription Script With Proxy Support . . . . . . . . . . . . . 26
4 Collection System Restart Script for Cron . . . . . . . . . . . . . . . . . . . . . . 27
iv
1 BACKGROUND AND RELATED WORK
1 Background and Related Work
This work builds on a signicant history of research in the area of SNA (Social Network Analysis)
as it applies to the social sciences, as well as large scale data analysis. We present techniques that
combine methods from the different disciplines in new and unique ways. The following subsec-
tions present an overview of the key areas we have focused our research in.
1.1 Traditional Social Networks
Traditional network science is focused on studying connections in various types of networks. This
includes everything from biological networks, to the electronic networks we rely on in the com-
puting realm. Similarly social media network analysis is the study of how connections occur in
human relationship networks. These networks have existed since humans began socializing. The
term social networks simply describes the interaction between all humans in every communication
form.
Our own work was inspired in part by Malcolm Gladwells book, The Tipping Point, in which
he discusses the how life can be thought of as an epidemic [1]. Thinking in this way leads Glad-
well and those who read his work to think about the spread of information and trends in terms
of an outbreak where key people, Mavens, Connectors, and Salesmen, are primarily responsible.
Communication methods have now become digital and new marketplaces, as Gladwell describes
them, have sprung up such as Facebook, Linkedin and Twitter. These marketplaces limit and force
certain communication to t with certain computing standards which in turn inuences the way
social networks are shaped.
Both Behavioral Science and Social Science disciplines have what they term SNA. These elds
have similar goals and methods, however they concentrate on different units of study. These units
consist of relationships between groups, networks, individuals, etc. This work was started by Au-
guste Comte, arguably the rst sociologist [2]. Comte stated that society existed when individuals
had the power to inuence each other.
Joshua S. White - Clarkson University 1 Engineering Science - PhD Proposal
1.1 Traditional Social Networks 1 BACKGROUND AND RELATED WORK
1.1.1 Denition of Actors In a Social Network
Actors are dened in a social network context as being individuals, dyads (two individuals), groups
larger than two and organizations. Gladwell was not the rst to attempt to classify actors of a
social group. The most compelling and applicable classication The hidden organizational chart
was proposed in 1976 by Allen Harrell as follows [9]:
Actors:
Stars:
This term was rst used in An Experimental Study of the Small World Problem [10]
in 1969, but was adopted by Harrell. Stars are actors with the largest number of inter-
actions within a particular group. This is a percentage based measurement based upon
the number of interactions everyone else in the group has. The small-world experiment
itself was actually a series of experiments by Travers and Milgram that examined the
average length of a connection path in a social network; that was made up of people
in the United States. This series of experiments suggests that from a societal view, a
person is connected to every other person on the planet through very short connection
paths. This work was later referred to as the principal of six degrees of separation.
Bridge:
A bridge is a type of actor which has relationships out side of a focal group, and specif-
ically connects that focal group to another actor or group. Bridges have weak ties, usu-
ally only 2 connections, however they provide the shortest path between two groups or
individuals.
Liaison:
Liaisons are similar to bridges, however instead of being weakly connected they link
many groups together through their individual connections. Again this actor type pro-
vides the shortest path between groups.
Isolates:
Isolates are antisocial actor types, typically not by choice. This is a term derived from
developmental psychology which describes the members of a study group who are not
a part of groups. In some cases this is purposeful, but many times it is not. Isolates
do befriend and interact with members of other social groups but they do not tie their
1.1 Traditional Social Networks 1 BACKGROUND AND RELATED WORK
identity to any particular group. For the sake of our social network analysis, an isolate
is a person who doesnt direct their messages at any one in particular or repeat anyones
sentiment directly, ie: retweet.
We can take the classication of actor types further by incorporating Gladwells three promi-
nent types; Connectors, Mavens, and Salesman [1]. Gladwell states that for a social epidemic to
be successful, actors with a specic set of social traits must be involved.
Connectors:
A connector is someone who knows a lot of people. In the example of some social networks
like Facebook, this would be someone who has a lot of friends. On more complex online
social networks, such as Twitter, this will be someone who has a lot messages directed at,
(@username), them directly. While twitter does support following and being followed by,
this is not the same as the so called friending that occurs on other networks. On a site,
such as Facebook, friending requires acceptance of that friendship between both friend-er
and friend-e. Twitters system is open, and as such, the following relationship can not be
construed as a friendship.
Its important to note that connectors must have a large number of connections and they
must also be willing to introduce them to one-another. In doing so, the connector becomes
a social group bridge. These individuals have the innate ability to be friends with people
from different backgrounds and with different views. Connectors are condent, energetic,
and social. These traits are hard to measure in a system such as Twitter however its not
impossible.
Mavens:
Mavens are those individuals who others rely on for new information. A maven is an actor
who collects new knowledge, usually from other individuals and is willing to share it when
asked. Mavens gather information, many times in an effort to solve a problem that they
themselves are having. However, due to their need to share and their better than average
social and communication skills, they are able to pass on the knowledge they have collected.
Whats more, a Maven will generally not stop at passing this knowledge on to just one
person, they usually will repeat it over and over again to many individuals until it becomes
1.2 Online Social Networks 1 BACKGROUND AND RELATED WORK
an epidemic. Gladwell notes that Mavens can be thought of as information brokers, since
they often trade or ask for more information in return.
Salesman:
Salesman are the most apt named actor type that Gladwell denes. These individuals are
charismatic persuaders who have the ability convince others into agreeing with them. They
are well spoken, typically good looking socialites, with great negotiating skills. Gladwell
lists all the typical factors that make up a good salesman but he also goes on to state that
most salesman are aware of the stickiness factor. This idea states that the packaging of the
message must be memorable.
1.2 Online Social Networks
Social media networks all share one thing in common, they allow users to contribute their own
content. However thats where the similarity stops, these systems come in various forms:
General search and browsing sites allow users to recommend or suggest content for other
users.
Video, audio and image sharing allow users to share their own content or review others, the
most popular of these sites being Youtube, Flickr, Vimeo and Hulu.
Geo-mapping services allow users to share maps about things going on in their communities
on a wide range of topics from politics to health. Once such example is Google Flu which
tracks users reports of u outbreaks.
Blogging and micro-blogging sites are the most proliferate and hardest to track, these in-
clude everything from individual personal blogs to mass micro blog aggregation sites such
as Twitter.
Prole Sharing Sites are of the most concern to privacy because they encourage users to
share personal information with individuals they only casually know or dont know at all.
The largest of these sites being Facebook.
These categories for social media networks are regularly accepted classications based on a
number of works, the most controversial of which is arguably entitled Publicly Available Social
1.3 Privacy and Legal 1 BACKGROUND AND RELATED WORK
Media Monitoring and Situational Awareness Initiative which was published by the Department
for Homeland Security in 2010. It outlined a potential program to use social network data to mon-
itor potential threats to the United States [3].
While an analysis of the privacy concerns that social media network data mining poses is out
side the scope of this work, it is worth pointing out that we are continually monitoring the litera-
ture and law surrounding this. All the seemingly small pieces of information added by billions of
people result in a huge rapidly changing dataset. Searching, correlating, and understanding billions
of individual posts is a signicant technical problem, even all the data from a single site such as
Twitter can be difcult to manage.
1.3 Privacy and Legal
There are a lot of emerging statistics on the lack of privacy in social media network like Facebook
and Twitter. The most obvious issue that weve seen in our work is the ability to geo-locate a large
percentage of users on Twitter and other social networks. This location information comes from
additional message data that is appended to each tweet automatically by the end users application.
Additionally, since all of this data is available to the public, tracking a persons habits, movements,
and relationships can be done in real time. This combined with geo-location leaves individuals
open to being tracked and movements predicted.
Theres a shocking lack of laws regarding the use of this data currently, either by individuals,
corporations, and governments. In 2010 Michael Zimmer published a rebuttal to the public outcry
over Harvards use of students facebook prole for sociological research, titled: But the data is
already public: on the ethics of research in Facebook [7]. The Department of Homeland Security
published Publicly Available Social Media Monitoring and Situational Awareness Initiative Up-
date [3] which raised a lot of negative attention of privacy blogs. Along those same lines the DOJ
paper: Obtaining and Using Evidence From Social Networking Sites stated that the DOJ uses fake
user accounts created by investigators to trick people into friending them [8]. Joseph Calandrino
author of you Might Also Like: Privacy Risks of Collaborative Filtering states that Functioning in
a society as a normal human being automatically exposes a person to certain privacy risks.
1.4 Current Publications 1 BACKGROUND AND RELATED WORK
1.4 Current Publications
The following subsections consist of excerpts of the overview portions of our 2012 conference
proceedings from the SPIE Defense Security and Sensing Conference.
1.4.1 Coalmine: An Experience in Building a System for Social Media Analytics
Coalmine was designed to be a exible data mining architecture that can process large amounts
of streaming social media data, for data-mining applications. The initial focus of coalmine was
to process live Twitter updates, or tweets, for the purpose of discovering botnet activity. Botnet
activity is dened for the purposes of this paper, as command and control trafc, that is carried
over Twitter updates.
When processing a stream of Twitter trafc, theres a lot more than the 140 characters that
make up a users message. Each tweet represents about 1K of data. The tweet texts are at the center
and a rich set of meta-data surrounds it. This additional tweet meta-data, includes information
about the user, the users account, and potentially the users location at the time of the tweet. Since
Twitter is a social networking site, many of the connections between users can be gleamed from
the included post data.
Our system consists of both a back end which can span across multiple headless servers and a
front end which can access the indexed data. Each component of our system is expressed in Figure
2. The four primary components are: (1) the data collection and storage component, (2) Ad-Hoc
query tool, (3) batch processing component, (4) and live ltering, classication, and alerting.
1. The data collection and storage component consists of third party APIs, such as the Twitter
REST interface. Our system has been developed in both Java and Python, as such, we have
chosen Tweetstreamer as our rst connection library. This client provides the basic wrapper
around the Twitter API and facilitates authentication to the Twitter servers [28].
The development team extended the simple TwitterClient to provide le IO, error handling,
and multiple client connections to Twitter, based on keyword and specic user. The data is
received from Twitter as a stream of JSON objects. As the individual streams are received,
they are combined and buffered for le output in gzip format.
Figure 1: Coalmine System Diagram
At this time, the data is passed through an inline compression routine, before it is written to
the output le. The compression ratio is high, averaging a 1:4 ratio. Based on this compres-
sion ratio, it is calculated that, in the case of Twitter, all of the tweets in a given day amount
to 150 gigabytes of compressed data.
2. The query tool provides a user with the ability to index one or more data les, to provide
Google-like search capability.
The core of the query tool component, uses the Apache Lucene data indexing library. Each
eld of the tweet meta-data is stored within a large indexed repository, created when the
query tool processed the data le. The user then queries the repository by selecting a meta-
data eld, such as text, for the text of the tweet and provides a term to match.
For example, if a user is looking for a tweet containing the word Twitter, the user would
enter the string text:Twitter into the query text box. The query syntax is very similar to that
of Google and allows for order-of-operations within the query. Figure 4 depicts an example
of the query panel and the resulting table generated from a query.
3. The batch processing environment allows for longer term usability of the collected data. The
core of this component, is the Apache Hadoop open-source framework [29].
Hadoop allows for the efcient processing of large amounts of data. It provides developers
Figure 2: Coalmine Example Query Table
with two key sub-components. The rst being a distributed le system for the organization,
storage, and transfer of data associated with data processing. The second is a distributed
process execution environment based on the MapReduce method [25].
This environment allows for efcient processing of data over multiple machines that are
aligned in a cluster conguration, and it can scale as new machines are added to the cluster.
Map reduction allows for data to be chopped into tiny pieces and processed by any machine
within the cluster. The results of this can be recombined when the processing is complete.
To use the framework, the developer has to implement the processing logic specic to their
data-sets. The execution process is handled completely within Hadoop.
On small batches, Hadoop can add considerable overhead, and is not recommended. How-
ever, in our workwhere batches are 10-100 gigabytes in size, the Hadoop execution environ-
ment is the only system that we have found that is exible and efcient at processing such
large amounts of data.
4. The nal component of the Coalmine system is the live ltering, classication, and alerting
functionality. Once a user has made a query, the front end returns the intended results, the
results can then be used to form a lter. These lters can be set to run all the time on all
incoming data streams. To do so we have created a rudimentary classication system for
the incoming data. Since it takes too much time and computational power to lter across
multiple posts, our ltering system for now only works on a post by post basis. That is to
say, we can setup a lter that will look for all instances of a certain word, but we can not
setup a live lter that correlates all instances of a specic tweet across multiple accounts.
To do so, would require keeping an enormous amount of correlative data in memory. This
functionality is better left to the batch processing component of our system. Currently when
our system does nd a match using one of our live lters, we report that match using a text
based log le in syslog format. It is our plan that as the system matures, these logs can be
read by a standard SIEM (Security Information and Event Manager).
Automated Analysis System Approach
While our work on automated analysis is only in its initial phases we have determined the basic
approach for doing so. We rst use the NLTK (Natural Language ToolKit) 1 to analyze all text
within a single object such as a tweet and dene the following objects:
User Proles: These contain a combination of lists containing unigrams, bigrams, and com-
plete messages. Unigrams refer to single uses of meaningful words that a user has posted,
likewise bigrams are a combination of words, while complete messages are just that.
Account Proles: These contain the complete historical record for an individual user. These
proles consist of only the posts that we have recorded with our tool and as such may not be
a complete record of everything an individual user has ever posted.
Corpus: This consists of all posts that we have ever recorded in our entire data-set.
To make use of this data for automated analysis we typically list all unigrams and bigrams by
the frequency at which they occur within a particular account prole. While the exact count of
each is useful for determining how often a particular user discusses a topic it is not necessarily all
that interesting if we are trying to look for more covert communications.
Using the statistics from an individual users account prole we can start to determine a trend in
speech pattern and thus develop a system to look for anomalies in those trends. The same can be
accomplished using the complete Corpus as we start to analyze the sentiment of the entire Corpus
collected we start to see uctuations in the usage of specic terms which could eventually lead us
to an automated system for analyzing the sentiment of large groups of people.
One of the issues we face as we continue the development of Coalmine is the determination
of whats relevant and whats not. Current estimates have stated that SPAM accounts for as much
as 50% of all Twitter messages all though that number appears to be receding in recent months
[30, 31, 32]. A number of works have been published recently on just this problem. The approach
that we are currently in the works of integrating is one that utilizes information diversity and user
cognition to sort weed out the noise [35]. This method utilizes entropy based calculations which
we have already added to the output of Coalmine. These calculations can be used to calculate the
diversity of the complete corpus of for a particular site such as twitter. Our next steps include the
addition of entropy distortion minimization algorithms to reduce the number of results. We have
a lot of work left to completely integrate this technique, however based on our initial grouping
results we know that entropy based clustering of our data does work.
Manual Search Criteria
As discussed previously, Coalmine supports manual searching similar to a typical Boolean
search engine. We have added the ability to specify the elds from which to search as well as
logical operators. In addition null and non-null operators expressed as None which is useful for
mass ltering of some criteria such as Latitude and Longitudinal coordinates. For instance a search
for Geo-Lat:!=None would return all posts where the Latitude eld was present meaning the post
could be attributed to an actual physical location.
To support these search terms we have implemented a standard Lucene index rather than some
more custom systems like TweetQL which in our studies did not scale as well when performing
queries across billions of records [33]. Lucene is a Java based solution for fast searching of indexed
data. Currently the Coalmine front end interface creates queries through Lucene using PyLucene
[34].
Botnet C2 Detection Case Study
One of our initial focuses for Coalmine use was the detection of Botnet Command and Control
Channels on Twitter. A botnet is a network of infected systems known as Bots, based on the word
Robots, that are remotely controlled for a malicious individual known as the Botmaster. Botnets
allow attackers to wage war on a much larger scale, which potentially, can be lead from a single
system. Typically, they are not employed by the average attacker, but instead by so called nation
states, criminal organizations, hacktavists, and those seeking prot [36, 37]. Botnets proliferate
for both technical and non-technical reasons. The primary reason is often that, owners of these
malicious networks, sell access to them as a service. Like any good businessman, they attempt to
attract sales through beating out the competition on stats and features. They must therefore grow
their networks, by taking over more systems to use as bots.
Another popular reason is perhaps the simplest, revenge. A botmaster will grow their armies
in order to carry out a vengeful attack. This was proven true during the Low Orbit ION Cannon
(LOIC) attacks carried out by the group Anonymous. These occurred during the rst quarter of
2010 [38]. While that specic incident was non-traditional, in that systems were infected because
their owners purposefully downloaded the malicious software, it none-the-less proves the point.
Before we go further, it is prudent to discuss the following common terms, which are used
throughout this section.
Botnet: A network of infected machines known as bots, that are controlled by a malicious
entity. This network is used to attack victim systems by exploiting a weakness, such as a
maximum services connection limit.
Bot: Once known as zombie computers, bots are the infected systems that do a botmasters
bidding. Bots are modular programs that can be updated with new commands and payloads
to help them carry out orders. The difference between a bot and a typical remote controlled
system, is the legitimacy of access. Bots spread through infection, while the infection means
may vary, the end result is that the owner of a system has not authorized the control.
Botmaster: Once known as a bothearder, the botmaster is the human or group that controls
the activities of a botnet. This individual(s) may not be the one(s) who originally created the
malicious software or the botnet itself. These individuals/groups were once motivated by the
traditional hacker cultural directive of seeing what can be done. Nowadays, money, power,
revenge and politics, are the primary drivers.
Command and Control Channel: The command and control channels are the primary
systems, sites, and protocols, that allow a botmaster to communicate orders to their army.
The channels are typically obscured from normal view through means of obfuscations, en-
cryptions, and a-typical (for the average user) protocols. A botmaster doesnt communicate
directly with each node in the army because doing so would be easily traceable.
Like any other network build-out, a botnet has a fairly standard creation procedure consisting
of what Zhu, et al., discuss in their work A Botnet Research Survey [39] as a four phased approach.
While this approach is very basic, it aids in understanding of the process by which a botnet is cre-
ated.
1. Initial Infection: The exploitation of a system, that allows for it to become infected by a bot
program. This stage results in the basic ability to remote control a victim.
2. Secondary Infection: This usually consists of the botmaster sending a command that in-
structs the bot to download a more sophisticated version of its self, a payload of some sort,
or even carry out a simplistic initial attack.
3. Malicious Activity: This is the act that the bot carries out, due to the command and/or
payload that was sent. This may be as simple as spreading the infection, or as complex as a
distributed computing [17] function.
4. Maintenance and Upgrade: More often then not, a botmaster will send a command to a
bot, that includes instructions for upgrading or regular maintenance reporting.
Detection
As we discuss later, a distinction can be made as to not only the methods used for botnet de-
tection, but also the data sources that feed these methods. In Baileys, et al. (A Survey of Botnet
Technology and Defenses) this distinction has been made for us [40]. The paper describes data-
sources as being limited, depending on sensor location. In this way, the internal enterprise net-
work has more access to data-sources, such as DHCP and DNS resolves, then the carrier network
provider would normally allow. While we disagree with Bailey, et al. that one network location
is better then another, it is better to think of each location as having unique insight into the prob-
lemand the fusion of multiple data sources. as being a better solution for traditional detection tools.
Currently the most popular tools on the market for detecting botnets dont actually detect the
bots at all. Instead, they focus on their command and control channels. Bailey, et al. denes a
number of C2 channel detection methodologies. They make the same points that we had in turn
discovered, through our own initial testing. The three major competing detection technologies are
available, (1) Bothunter, (2) Botsniffer, and (3) Botminer. They all use the same method of Detec-
tion through Cooperative Behaviors. In their own ways, each of these tools looked for the grouping
patterns that occur on these malicious networks, based on either statistical methods or comparative
models.
1. Bothunter was funded by the Army Research Ofce and developed by SRI International. It
is a tool which detects bots using Snort as a dialog event generation engine. These events
are fed into Bothunters modeling and comparison programs, which look for signicant cor-
relations between real events and recorded malicious events [41].
2. Botsniffer attempts to detect botnets using network temporal-spacial detection and passing
the output though a series of statistical algorithms that look for grouping behavior [43].
3. Botminer is really an add-on for Botsniffer that allow cross-correlation to be done on the
data by breaking it into cluster series. This is very useful for nding channels if you know
what an existing one looks like, but can generate a high number of false positives if you do
not [42].
1.4.2 A Method For Automated Detection of Phishing Websites Through Both Site Character-
istics and Image Analysis
Phishing website analysis is largely still a time-consuming manual process of discovering potential
phishing sites, verifying if suspicious sites truly are malicious spoofs and if so, distributing their
URLs to the appropriate blacklisting services. Attackers increasingly use sophisticated systems
for bringing phishing sites up and down rapidly at new locations, making automated response es-
sential. In this paper, we present a method for rapid, automated detection and analysis of phishing
websites. Our method relies on near real-time gathering and analysis of URLs posted on social
media sites. We fetch the pages pointed to by each URL and characterize each page with a set of
easily computed values such as number of images and links. We also capture a screen-shot of the
rendered page image, compute a hash of the image and use the Hamming distance between these
image hashes as a form of visual comparison. We provide initial results demonstrate the feasibility
of our techniques by comparing legitimate sites to known fraudulent versions from Phishtank.com,
by actively introducing a series of minor changes to a phishing toolkit captured in a local honeypot
and by performing some initial analysis on a set of over 2.8 million URLs posted to Twitter over a
4 days in August 2011. We discuss the issues encountered during our testing such as resolvability
and legitimacy of URLs posted on Twitter, the data sets used, the characteristics of the phishing
sites we discovered, and our plans for future work.
Background
Phishing [44] is the act of convincing users to give up critical personal information, either
through conversation or some form of content manipulation. Most modern day phishing attacks
occur by luring users into visiting a malicious web page that looks and behaves like the original.
Once there, the user, if convinced that the page is authentic, may give up private information
including authentication credentials or banking information. This information is typically used
to commit some form of identity theft or fraud. The most common methods used today for the
detection and analysis of phishing web sites are:
Manual viewand report services such as Phishtank.com[45]. This is by far the most common
method which rst, requires actual users to nd, identify, and report suspicious web-pages
and second requires, additional people to verify the status of reported pages.
Correlating links in known SPAM email to phishing sites [46]. Industry standard SPAM
email detection techniques are used to identify SPAM email and then links from those emails
are examined. Typical examination includes looking for URL redirection, or a variety of
DNS tricks that might signify a phishing site. The emphasis is typically on examining
characteristics of the URL itself. Unless a URL is malformed or some other SPAM-like
characteristic is identied, phishing sites are often not identied in this way.
Crawler classication of websites. Pages themselves are examined for suspicious character-
istics like misspelled words, link obfuscation, right-click menu disabling, DNS redirection,
etc. [47]. This method is similar to our own in that it places a strong emphasis on fetching
and analyzing the actual page rather than just the URL. However, it looks for suspicious
content in individual pages where we focus on identifying groups of pages that would look
the same to an unsuspecting user.
Method
Our work takes inspiration from this manual process of nding pages that look the same.
Specically, we automate some of what the human does to recognize duplication of an original
page. To do this we analyze a web page based on some of its structural characteristics and based
on the way it looks visually. Specically, we record a number of page markup characteristics in-
cluding the page title text and number of links, images, forms, iframes and metatags. Throughout
the rest of this paper we will refer to this 5 tuple as a pages structural ngerprint. The image anal-
ysis portion of this work is at its simplest is done by setting a xed dimension and quality setting
for rendering a page within a headless browser and then taking a page screen-shot. We compute
a hash of the resulting image and compare the hash values using the Hamming Distance equation.
This process takes on average 4 seconds a page including software build-up/tear-down time. To
speed up the process further, we could prioritize image analysis for pages that with matches in the
most easily parsed/computed items from the pages markup.
Figure 6 shows a high-level overview of the full process. The rst step is the real-time gath-
ering of URLs from social media sites like Twitter. We used a modication of the get tweets.php
script provided by the 140Dev Twitter tool [48] to fetching the raw JSON version of each tweet
and store it in a MySQL table. Next, we parse the tweets looking for URLs. Since Twitter requires
all URLs to begin with the standard HTTP:// to be hot-linked it was as simple as parsing the JSON
data for that expression. We fetch the page content for each URL using PHPs LoadHTML and
Figure 3: Phishing Detection Process Overview
DOM Object walk through functions to visit the site, load the HTML into memory and nally
count up each of the specic DOM objects. Each time a page has been parsed it is then logged in
the master url stats.csv le for later analysis. For each new URL, we also use XVFB and CutyCapt
to render the resulting page and capture a screenshot. Finally, we compute a variety of hash values
on the resulting image.
Phishing Website Detection - Related Work
Anti-Phishing research has been done for a number of years and falls into a number of cat-
egories. A.P.E Rosiello, et. al. dene these categories as Email-Based, Blacklist-Based, Visual-
Clue-Based, Website-Feature-Based, and Information- Flow-Based. The method employed in their
work could be best categorized as Website Feature Based and consists of a string search technique
which identies the DOM objects within a page and builds a tree of those objects. This tree is
directly compared to those of known legitimate pages. When a page claims to be a specic site,
all DOM trees are compared with a known good sample to look for dissimilarities [49]. This ap-
proach works as a client side browser plugin and thus is useful on an individual user basis but is
not a conducive method for broader analysis.
To date the most widely deployed approaches for protecting users from Phishing attacks are
Blacklist-Based. Blacklist- Based tools that enable them to be warned away from sites that are
known to be malicious but have yet to be taken down. These tools rely on known blacklisting ser-
vices such as Phishtank.com that rely on the submission of suspicious URLs for analysis. After a
suspicious URL is submitted, a variety of techniques are employed to triage the link including sus-
picious DNS reputation, suspicious URL format, URL containment of other domain names in a di-
rectory eld, and actual matching of known URL terms using Lexical analysis [50, 51, 52, 53, 54].
These methods reduce but do not eliminate the manual work required. Most have high false posi-
tive detection rates and all methods get the URLs that they are analyzing from other services which
feed them primarily malicious or at least suspicious URLs to analyze. They have not been shown
to be effective at nding malicious URLs in a a large live data set of predominately legitimate
URLs.
Google has developed and deployed a hybrid approach. Their system relies on both Googles
own page ranking system and an email SPAM ltering system to per-identify potentially mali-
cious pages before analysis [55]. They also apply a machine learning classier which considers
the characteristics of the URL and the website or message content. While this method works well
in Googles own environment, it is unreasonable to assume that an individual institution, or even
government would be able to deploy their own version of it given Googles unique place in the
Internet infrastructure.
Some other methods for detecting phishing websites that are related to our own Website-
Feature-Based and Visual-Clue- Based. C. Ludl, et. al. presented evaluation results of a number
of anti-phishing tools and determined that the best relied on both blacklist sites that used manual
reviewers and those that did some sort of analysis using specic page characteristics. We vali-
date our choices of page characteristics in our own method against those stated in their work [56].
We further validate our method by considering the effectiveness of methods such as the use of
Page Color Histogram comparison matching and Color Vector Distance calculations which have
been proven to be very effective at detecting matches between known phishing and legitimate sites
[57, 58]. The main issue with both of these methods being the load that they put on a system
which restricts them from scaling. Comparisons done using these systems range from .02 to 11.2
seconds per page while using up to 512 MB of RAM. Our method is able to process each page on
average within 4 seconds using less then 32MB of RAM. Additionally our design easily facilitates
simultaneous processing of pages.
Phishing Website Detection - Conclusion and Future Work
Our work in developing an automated system for phishing website detection using social me-
dia data-sets has shown promising initial results. We were able to nd a number of phishing sites
from a live dataset and we also gained critical insight into how to effectively and efciently gather,
format, and analyze, this social data.
As this work has matured, we have continued building on our general social media analytic
collection and analysis process. To date, we have improved our collection process and storage
method to the point where we are consuming and storing upwards of 70 Million messages a day.
In our current work includes the creation of routines that automatically nd correlations between
potential phishing pages and known trusted sites. Specically, we use cluster analysis to identify
clusters of pages with similar characteristics.
We are in the process of completing the missing pieces of overall analysis architecture to en-
able full scale, real-time analysis of the Twitter data feed. Specically, we are completing the
automated characteristic comparison routines as well as the alerting functionality as illustrated in
Figure 1. In addition we plan to implement a true multi-threaded design for simultaneous process-
ing of pages which will replace our current parallel page processing instance launching code. We
are also working on cluster analysis software that will identify the closest matches in the whole
data set without requiring a legitimate site to match against.
As we continue this work we intend on implementing functions for dealing with inconsistent
page load times. Currently we stop trying to process a page after 3000ms however in some select
cases pages may need more time to load. We intend to implement a system whereby pages that
have not loaded after 3000ms will be handed off to a child process that will continue to wait an
additional period of time. This will stop a single long loading page from slowing down the rest of
our analytic process.
2 DETAILED RESEARCH PLAN
2 Detailed Research Plan
Based on the lessons learned during our work on Coalmine, see section 1.4.1, the following sub
sections detail the system design and operation procedures for our prototype social media network
analytics platform. This platform requires a relatively small amount of resources for the capture of
necessary data, however those resources balloon in size when data retention and processing come
into play.
2.1 Motivation
This work was inspired in part by Malcolm Gladwells book, The Tipping Point, in which he dis-
cusses the how life can be thought of as an epidemic [1]. Thinking in this way leads Gladwell and
those who read his work to think about the spread of information and trends in terms of an outbreak
where key people, Mavens, Connectors, and Salesmen, are primarily responsible. In our own work
the idea of trends translate directly to the idea of memes. Memes are phenomenons that manifest
themselves in online cultural environments like Tumblr, 4Chan, Twitter, Facebook and other social
media networks. It is the goal of our research to better understand the key players that Gladwell
discussed in his work in terms of descriptive statistics, behavior, and other characteristics as they
can be determined through social media network analysis.
A lot of research has been done in the area of social network analysis. Most of this research
has been on a relatively small scale. Based on our ongoing literature review, we have determined
that the application of data analysis techniques to online social media networks, is an under studied
area of research. Within this work we propose to answer the following questions:
Can we reliably classify social media/network users as: connectors, mavens, and salesmen?
What are the privacy implications of social network analysis?
What potentially harmful information about end users can be gleamed from these network?
Do Social networks have the impact that some believe they do on traditional mass media?
Who are the opinion leaders and can we measure their inuence?
Is there a way to predict changes in public opinion though social media data mining?
2.2 Data Collection System 2 DETAILED RESEARCH PLAN
Is there a reliable way to detect major events as they happen?
Can we predict how meme / news will spread in and out of social media network sites?
Are individuals or groups covertly manipulating mass media through social network trend
creation?
It is the goal of this research to develop tools and methods for answering these questions and
more.
2.2 Data Collection System
The capture infrastructure discussed herein solves a number of issues that did not exist prior to
spring 2011. These include rate limitations that were put on the Twitter API [12].
2.2.1 Twitter Collection Rate Limitations
The current issue stems from the change in API access that twitter has put in place as of mid 2011
[13]. This change means that no longer can normal users/researchers access the Firehose (ie. 10%)
stream unless they have really good reason, and absolutely no one can get a special code for the
30-40% stream. This is the wake of twitter licensing there feeds to GNIP which resells them at
approximately $30,000 a month [17]. What this leaves us with is access to the Twitter service
known as Spritzer and the streaming twitter API. The Spritzer feed is approximately 1% of all
tweets happening in real time. The Streaming API limits this further by allowing 350 requests to
twitter an hour with a response of no more then 200 tweets [14, 15, 16]. Thats a grand total of
70,000 tweets an hour per application. An application is denoted as any tool written and distributed
to multiple users but using the same account credentials. Again however, twitter imposes a cap of
20,000 tweet responses an hour for any single IP address.
This limitation means that based on a current estimated rate of 200,000,000 tweets occurring
daily and the average application, user, IP combination able to receive no more then 480,000 tweets
daily (20,000 * 24 hours), normal users can only collect .24% of all tweets. This is to small of a
sample size to do any real analysis on.
Table 1: Twitter Collection Rates
Authentication Type Library Used Speed of Access
No Authentication Public Stream No Library Just URLLib 100-300 Tweets per minute
Basic Authentication TweetStream Python Library 1500-2000 Tweets per minute
OAuth Tweepy Python Library 4000 Tweets per minute
Whitelist Access with OAuth* Tweepy Python Library 8000-10000 Tweets per minute
2.2.2 Insights Into Twitter API
One lesson learned is that Twitter does not do a good job at maintaining their documentation,
we found multiple pages with dead links, API functions that have been shut down but are still
in the documentation, broken forms, email addresses that dont work, and conicting statements
about rates, limits, and processes. Additionally weve learned that the community as a whole has
developed dozens of approaches to deal with the misinformation and frequent changes. No matter
what solution used to get the data, it most likely wont work 6 months from now.
2.2.3 Capture System Network Design
We have built the following system to overcome the combination of rate limitations imposed by the
various social media networks as well as network session throttling imposed by carriers of residen-
tial Internet service. This system is minimizes the number of outgoing connections by wrapping
them in an encrypted tunnel through the use of the TOR network. As a result, the IP and MAC
address of the requesting computer is hidden from the social media network providers which pro-
vides research anonymity.
TOR (The Onion Router) is a multi node trafc relay network which requires a trust based sys-
tem where the user relays there trafc through an encrypted network. This system is not designed
with condentiality in mind, it instead is intended to provide anonymity. Standard Geo-IP type
packages can not be used to located a user on the TOR network because there trafc is bounced
multiple times. This method isnt full-prof since it relies on trust of the relay nodes, the directory
nodes, and the exit nodes.
TOR works by creating a private network within the public network in such a way that it pro-
vides a means of both anonymity and defense against trafc analysis. It does this by using a multi-
Figure 4: Basic Onion Routing Diagram
node relay system of sudo-routers; PCs that act as routers. Onion Routing works by establishing a
virtual route through a tunnel based network. This network of tunnels is secured using public key
encryption in a reverse layering format. The head end Onion Router selects a path of nodes that
the trafc will be relayed through, in doing so it generates a symmetric key using the public key
of each of the corresponding nodes. The end users data is then wrapped layer by layer with each
nodes encryption [18]. Figure 1, shows a basic example of how this works. As can be seen from
the image, the layering done in the reversed order that the trafc will be moving. As the Onion is
passed to each Onion Router a layer is stripped off, revealing the next hop of the journey. This is
intended to protect the sender from being identied if one or more of the Routers are compromised.
The problem with TOR is that it requires you to establish a circuit through the network to a
jump off point. Establishment of the circuit can take between 10 seconds and 10 minutes depend-
ing on how its setup. The additional problem is that once you have established a circuit and your
application is directed to use it, thats the jump off point then becomes subject to the same single
IP rate limitation. To solve this issue the most obvious thing to do was to rst launch multiple
instances of TOR and connect to multiple circuits. Code block 1 is a simple Bash script derived
from that used by the Data Big Bang group [19] launches multiple TOR instances and connects
them each to individual circuits.
This moves us one step closer to our goal however each time we reach the max limit with our
application for a single IP, we have to point it at a new circuits Socks port. The easiest way to
deal with this was also discussed on the Data Big Bang blog. Each TOR circuit can be bound
to HAProxy [20] in a round robin fashion using a third software package called Delegate [21]
which was rst sponsored by the National Science Foundation for just this sort of work. Delegate
combines all the TOR circuits into one logical service while HAProxy sends 1 command to each
using RR Scheduling. This means that a single application can make multiple calls to twitter
without running into an IP limitation and continue to do so at an indenite rate dependent only
by the number of circuits and any other bandwidth limitations that may exist. And exist they do,
Twitter only allows a maximum of 100 KBps on average for each connection made.
Code Block 1: Setting Up Multiple TOR Circuits
f o r i in {0. . 29}
do
j=$( ( i+1) )
socks_port=$( ( base_socks_port+i) )
control_port=$( ( base_control_port+i) )
http_port=$( ( base_http_port+i) )
i f [ ! d da t a / t o r $ i ] ; t he n
echo Cr e a t i n g d i r e c t o r y da t a / t o r $ i
mkdir da t a / t o r $ i
f i
echo Runni ng : t o r RunAsDaemon 1 Co o k i e Au t h e n t i c a t i o n 0 \
Has hedCont r ol Pas s wor d \ Co n t r o l Po r t $ c o n t r o l p o r t \
Pi d Fi l e t o r $ i . pi d Soc ks Por t $ s o c k s p o r t Da t a Di r e c t or y \
da t a / t o r $ i
tor RunAsDaemon 1 CookieAuthentication 0 \
HashedControlPassword ControlPort $control_port \
PidFile tor$i. pid SocksPort $socks_port \
DataDirectory data/ tor$i
done
After much debate, trial and error we have chosen to use the python library tweetstream. It
is possible to install this library from pip or easy-install but its important that we pull it manually
[22].
Figure 5: Logical Capture System Network Flow
Code Block 2 is a sample python script that connects to the twitter Spritzer streaming API and
appends each tweet JSON string to a le as a new line.
Code Block 2: Basic Twitter API Subscription Script
i mpor t tweetstream
output=open( t we e t t e x t o u t p u t . t x t , a )
with tweetstream. TweetStream( s ecur emi ndor g , XXXXX ) as stream:
f o r tweet i n stream:
line = str( tweet)
output. write( line)
output. write( \n )
Theres a general problem with all Twitter API libraries for Python including tweetstream, none
of them support running over a proxy. To solve this problem we apply an obscure patch which adds
rudimentary proxy support to tweetstream [23]. Once added its as simple as adding a few more
lines of code to our previous Python Script as follows in Code Block 3.
Code Block 3: Basic Twitter API Subscription Script With Proxy Support
i mpor t tweetstream
output=open( t we e t t e x t o u t p u t . t x t , a )
tweetstream. proxy_server = 1 2 7 . 0 . 0 . 1 : 3 1 2 8
tweetstream. proxy_username =
tweetstream. proxy_password =
with tweetstream. TweetStream( s ecur emi ndor g , XXXXX ) as stream:
f o r tweet i n stream:
line = str( tweet)
output. write( line)
output. write( \n )
This script is now capable of collecting an average of 2,140,000 tweets a day. This is a lot
better then we were getting before but still not the 10% of the total tweets that we were initially
looking for.
Sampling Method
Twitter uses a fairly simplistic method for sampling the live stream. Twitter states:
The status id modulo 100 is taken on each public status, that is, from the Firehose. Modulus
value 0 is delivered to Spritzer, and values 0-10 are delivered to Gardenhose. Over a signicant
period, a 1% and a 10% sample of public statuses is approached. This algorithm, in conjunction
with the status id assignment algorithm, will tend to produce a random selection. [24]
What this boils down to is that for the Spritzer streaming API feed twitter starts a counter when
you connect. The counter starts at 1 and counts to 100 before starting over. The rst tweet 1 is
sampled and returned to the connecting client. After some basic experimentation we discovered
that by putting in a simple sleep 1 command between the start-up of two or more subscriber
scripts such as Code Block 3, is enough to receive a completely different set of tweets between
the accounts. To prove this point we initially setup 5 Twitter Accounts and manually started the
collection script with each set of login credentials one after the other. After running for 24 hours
these 5 Subscriber Script / Account combinations collected 16,599,674 Unique Tweets. Following
this logic we utilized 30 unique accounts for a 147 day period in an effort to collect near 100% of
2.3 Our Research Data-set 2 DETAILED RESEARCH PLAN
all tweets from the live stream.
Code Block 4: Collection System Restart Script for Cron
# ! / bi n / bash
#
# J os h
#
# Qui ck Cl ean Up , Move , Re s t a r t S c r i p t
# i s r un as a c r onj ob ever y 12 hour s
# k i l l a l l pyt hon # don ' t use t h i s , i t s cr ews up some of t he o t h e r s c r i p t s
ps ux | awk ' / twitter/ && ! / awk / {print $2} ' | xargs i k i l l {} # k i l l s onl y t he
c o l l e c t o r s c r i p t s
TIME=$( date +%s )
mv / home/ odin/ sma/ twitter_data/ / home/ odin/ sma/ $TIME
mkdir / home/ odin / / twitter_data/
/ home/ odin/ sma/ twitter_miner_infrastructure1. 0. 5. sh r
cd / home/ odin/ sma/ twitter_data
# l s l h
cd . .
2.3 Our Research Data-set
Our current dataset consists of 147 Days of Twitter data. This data was capture using the system
described in the Capture System section. The data consists of approximately 30 Terabytes of gzip
compressed JSON formatted data. A typical days data compresses on average to around 70 GB.
This varies signicantly based on major events and the length, type of post made.
We currently have the data stored on three external USB RAID enclosures consisting of ap-
proximately 11 TB of storage each. In addition, the primary analysis controller has a copy of any
data being currently worked on within an 18 TB RAID array in which we store the data in DDFS
(Disco Distributed File System) format. DDFS is a non traditional tag-based lesystem which was
specically designed for use with the DISCO map reduce system. DDFS rides on top of traditional
Linux lesystems like ext4, adding horizontally distributed scaling capability [6].
2.4 Data Analysis System 2 DETAILED RESEARCH PLAN
Figure 6: DISCO DDFS File System Infrastructure Diagram [6]
2.4 Data Analysis System
The introduction of Map/Reduce by Google has changed the way large data-sets are processed [26].
A number of projects have formed that support the implementation of map/reduce algorithms and
simplify its deployment, such as Disco, HADOOP, etc [11, 25]. Still developing algorithms for
it and deploying map/reduce systems is no simple task. These systems are unable to support het-
rogenous data-structures with a single algorithm. In order to deal with this limitation we must rst
standardize on an input format such as JSON.
The gathering of publicly accessible information into a single place for processing is a straight
forward problem which gets much more complex when we consider the volume of data that cur-
rently exists for any given situation, and how useful or un-useful that information may be in a
decision making process. Triaging large volumes of data and disposing of all but the most obvi-
ously relevant is no longer acceptable in a world where indicators of an imminent attack, natural
disaster, or other important event may be inside of an innocuous 140 character tweet. Whats more,
decisions about how to handle these events can be compromised in real-time, based on the relative
ease of information sharing that occurs between average citizen reporters and the world [27].
2.4 Data Analysis System 2 DETAILED RESEARCH PLAN
2.4.1 Data Analysis
Throughout this work weve had to deal with varying data sizes and formats which has lead to
numerous unexpected problems, all of which we have attempted to address. When we started
we had no idea that what it would entail storing and analyzing data on this massive of a scale.
Initially our work started in the realm of small data, dened as data typically manually collected,
hundreds of records, or in this instance collected as part of a slow scraper. The area of small data
analysis typically consists of analyzing so called at-les text les in memory. Theres three real
advantages to this approach:
1. The data can be stored and easily moved.
2. We can use utilities like grep, awk, sed, wc.
3. The data can be maintained in RAM which means that analysis is relatively fast.
Additional benet is found when working with small data because this type lends its self well
to graphing. The equation put forth for graph edge memory calculation is as follows: [4]
0(n) = n +n
2
= n(n + 1)
this equation explains node interconnectedness within a graph. A fully interconnected graph con-
taining 2000 nodes has 2 million edges which in modern systems requires approximately 500 MB
of RAM. This requirement scales linearly and as such makes graphing millions of nodes in this
manner unfeasible.
As our own work has progressed weve realized that our data couldnt be kept in memory
when doing calculations. During some early attempts we wrote a chunker that would handle small
amounts of the data at a time, writing the result to disk and then continuing where it left off. The
problem is that certain computations such as frequency analysis dont lend well to this method. In
essence we were trying to create a key value store where the key is the word being counted and
the value is the number of times it occurred. This required that we make either all to frequent
disk access requests which was slow and cumbersome or that we need to keep every word in every
language that these social networks support in memory along with the counts of occurrences at all
2.5 Active Research Area 2 DETAILED RESEARCH PLAN
times. Neither method works well. In retrospect this was our foray into the world of Medium Data.
Various sources state that this medium data world, is where most computational data analysis
exists today. Medium data is categorized by both the use of relational databases such as MySQL,
the typical use of different website specic APIs for data collection, and the storage of millions of
records. In researching existing social media network analysis techniques and systems we found
that nearly all the work discussed has been completed using these types of databases [5]. The issue
that we quickly ran into when following suit was that these relational databases dont scale well
without expensive custom hardware. After our initial data-set reached approximately 40 GB worth
of Twitter records, stored in a MySQL DB, the database became unstable and corrupted regularly.
We also noticed that access times became so slow that they were unbearable at which point we
realized that this method would not work at all for our needs.
Finally we move into the world or termed big data, which is classied by the extreme volume
of the data being collected and processed. These data-sets contain billions of records, and typical
analysis is done through a combination of map/reduce as well as distributed computation tech-
niques. More often then not, the map/reduce implimentation coincides with a NoSQL document
record store like MongoDB, or CouchDB. At this point in our system design we have decided to
use the Disco distributed map/reduce framework in conjunction with the Disco Distributed File
System (DDFS) [11].
2.5 Active Research Area
2.5.1 Tracking the Rise and Fall of KONY2012
The KONY2012 Meme:
On January 20th 2012 a short lm was released by Invisible Children, Inc. entitled KONY
2012 [59] to Vimeo1. The video was released to Youtube2 on March 5th 2012 where it quickly
gained millions of views [60]. The purpose of the lm was to promote the Invisible Childrens
charitable mission to stop the Joseph Kony, an African militia leader and internationally indited
war criminal. The campaign as stated by the Invisible Children foundation was to make his crimes
globally known to the masses in an effort to have him arrested by the end of 2012 [61, 66]. The
lm went viral almost immediately after it was introduced [62, 63]. As of January 1st 2013, the
KONY2012 video had just over 96 million views on YouTube and just over 18 million views on
Vimeo. The rate at which the video spread was so fast that by March 7th, just two days after the
Youtube version of the video was released, the KONY2012 website crashed due to the number of
people trying to access the site.
While all this is interesting, some important facts make an ideal event to use in a social network
case study:
A poll showed that 58% of all Americans 18-29 years old said that had heard about the event
by April 2012 [64].
66% of all conversations on twitter between March 5th and March 12th supported the capture
of Joseph Kony [64].
KONY2012 was considered by TIME magazine to be one of the top most viral videos of all
time [65].
The campaign and video went viral primarily due to its spread on twitter, this was due to a
lot of unforeseen reasons and the support of a lot of popular individuals such as celebrities
Just Bieber and Lady Gaga along with entrepreneurs and philanthropist Bill Gates [67, 68].
KONY2012 - Background on Event Signicance:
The KONY2012 lm describes the the crimes that Joseph Kony has made in conjunction with
his militia known as the LRA (Lords Resistance Army). These crimes include, according the lm,
the forced recruitment of children soldiers and sex slaves. The lm claims that 66,000 children
have been abducted by the LRA and at least two million people have been displaced from their
homes. For his actions Joseph Kony was indicted for war crimes and crimes against humanity by
the International Criminal Court in 2005 but has has evaded capture ever since [69, 70]. The lm
encourages the end of forced youth military service and the restoration of a normal social order. In
an attempt to conne others to join the campaign, the lms director Jason Russell features clips of
his own sons reaction to information about Joseph Kony. These clips are what will be discussed
latter as an attempt at gaining what is known as the stickiness factor. The lm concludes a message
from United States President Barack Obama, lmed in October 2011, authorizing the deployment
of 100 Special Forces military personnel to the Central African Union to help capture Joseph Kony.
Jason Russell, Invisible Children, Inc. and the Stickiness Factor:
Malcolm Gladwell coined the term the Stickiness Factor in his work entitled the Tipping Point
[1], to describe how marketers of a message make it stick and resonate within a individual. Glad-
well makes the point that a message infects through an agent of infection. He gives the example
of Sesame Street using television as an agent of the positive, literacy, infection. What is it about a
message that makes it memorable? In some cases the answer is irritation, in other its repetition,
or music or any other of a large number of factors. Gladwell points out that the stickiness factor
isnt something which everyone is capable of achieving right aways, many times it takes various
revisions and testing of a message before it sticks. Continuing with his example of Sesame Street,
Gladwell points out that each and every episode was beta tested using focus groups of children to
see how well it held their attention.
Following this same logic, Jason Russell and Invisible Children, Inc. employed a number of
techniques in getting their message across and making it stick enough to become viral. Invisible
Children, Inc. and Russell screened the KONY2012 video for focus groups and then to larger
audiences before releasing it online [71, 72]. In this case those audiences where primarily college
age Americans. Many of the screenings took place at parks in progressive neighborhood parks,
community centers, and at Universities. They employed the use of Teaser Trailer on major video
sites just like any other full length feature movie [73]. The nal release video was changed in
subtle ways from what was shown during the screenings.
Russel starts the movie off with compelling messages about humanity and the power of the
internet with facts like There are more people on Facebook then there were in the world 200 years
ago. He then goes on to show the birth of his son Gavin, whom he bases much of the 27minute
movie around. While the point of the movie is to get the viewer to want to stop Kony, the message
is presented through a combination of shocking messages given to both the audience and Gavin, as
the young boy learns about Joseph Kony for the rst time. Its the combination of violent graphics,
the response of Russells son, and the intermingled interesting facts that give the video its stickiness
factor. The message stop Kony seems simple enough but it is reinforced by the voice of a child
voicing it.
The Cover the Night Campaign and Slacktivism:
The KONY 2012 lm ends with encouragement from Director Jason Russell for individuals to
become activists by take part in the Cover the Night event around the world. This event was to
take place on April 20th, 2012. During that evening those who were committed to the cause of
stopping Joseph Kony were to blanket cities around the world with KONY2012 related posters,
stickers, drawings, paintings and messages. The Invisible Children, Inc. Organization offered up
posters and stickers that could be ordered or downloaded for free as well as instructions [74, 75].
The actual event had relatively little turn out. The Invisible Children, Inc. website had received
a number of pledged supporters for the event but the actual turn out was only a fraction of that
number. There are no records of any groups larger than 17 people amassing anywhere in the world
for this event, even though the Invisible Children, Inc. organization sent out 16 teams of people
across the US to help inspire others in major cities [76].
Slactavism is a term that was coined by Dwight Ozard and Fed Clark in 1995 during the Cor-
nerstone Festival. The term is a combination of the words Slacker and Activist and refers to those
individuals who want to help but do so through minimal personal effort [77]. The Cover the Night
Campaign was ultimately unsuccessful because of the limit turn out this was a direct result of
slacktavism. In some instances cities such as Vancouver Canada had pledges by individuals to at-
tend the event in excess of 20,000 [76]. However these pledges where through sites like Facebook.
When receiving a message on a site such as Facebook you simply Like the message through a sin-
gle mouse click. Cover the Night coordinators at the Invisible Children, Inc. Assumed that those
Likes meant that individuals would actually show up. Slacktavists themselves are often refereed to
as those individuals who Like activism.
KONY2012 Event Data Collection:
For this work we needed to put some limitations on the amount of data we would process
given our current hardware setup. We decided to focus on 31 days of Twitter data which had been
previously collected using the methods described in our past work [78]. This data ranged from
February 20th to March 20th of 2012. Our rst complete 24 hour capture of Twitter data was on
February 20th which happened to coincide with the release of the KONY2012 video on Vimeo.
At the time our collection system was running at near full capacity capturing an average of 50
Million Tweets a day. In the 31 days of collection we experienced two days of signicantly lower
collection rates, the rst was on February 24th 2011, and coincides with an intermediate failure
that Twitter.com experienced throughout the day. The second low collection day was March 13th
2011 where Twitter experienced intermediate connection overload due to an increased number of
requests per hour. Most statistics show that on average there are 55 Million tweets made each day
for the year of 2012 [79]. Our own dataset averages out to a collection rate of 55,678,645.61 per
day over the 31 days selected for this study.
2.6 Schedule for completion of this research 2 DETAILED RESEARCH PLAN
2.6 Schedule for completion of this research
Table 2 shows our plan for completion of the research.
Timeline Work Progress
Topic Selection completed
Research Area Rened completed
Data Collection System Designed completed
Data Collection System Build-out and renement completed
Research Dataset Gathering completed
Identication of Major Evaluation Approaches completed
Coalmine - Prototype Social Media/Network Analytic Tool completed
Coalmine Paper Publication completed
Phishing - Analysis/Detection tool creation completed
Phishing Analysis/Detection Paper Publication completed
Jan. 2013 Malware collection system design/build using social media feeds almost complete
Jan. 2013 Tool/Process for Tracking the Rise and Fall of Major Events ongoing
Jan. 2013 Tool/Process for Determining Connectors, Mavens and Salesman ongoing
Apr. 2013 Malware Collection System Paper Submission Goal
Jun. 2013 Tracking the Rise and Fall of Major Events Paper Submission Goal
Jun. 2012 Connectors, Mavens and Salesman Paper Submission Goal
Feb. 2013 Thesis First Full Draft Goal
Mar. 2013 Thesis Final Copy Submission Goal
Apr. 2013 Thesis defense Goal
Table 2: Plan for completion of my research
REFERENCES
3 Citations
References
[1] Gladwell, M. (2000). The tipping point. Boston: Little, Brown and Company.
[2] Bourdeau, Michel, Auguste Comte, The Stanford Encyclopedia of Philosophy (Summer 2011 Edition), Edward
N. Zalta (ed.), URL = http://plato.stanford.edu/archives/sum2011/entries/comte/ .
[3] Donald Triner, Publicaly Available Social Media Monitoring and Situational Awareness Initiative, Ofce of Op-
erations Coordination and Planning: Departmetn of Homeland Security, June 22 2010.
[4] Maksim Tsvetovat, Alexander Kouznetsov, Social Network Analyis for Startups, OReilly Publishing, 2011
[5] Derek Hansen, Ben Shneiderman, and Marc A. Smith. Analyzing Social Media Networks with NodeXL: Insights
from a Connected World. Morgan Kaufmann, 1 edition, September 2010.
[6] Nokia Corporation. Disco Distributed Filesystem - Disco v0.4.4 documentation, Dec 05, 2012.
[7] M. Zimmer. But the data is already public: on the ethics of research in Facebook, Ethics and Information
Technology, December 2010
[8] Department of Justice. Obtaining and using evidence from social networking sites. April 14, 2010.
[9] Allen, H. T. (1976). Communication networks - The hidden organizational chart. The Personnel Administrator,
21(6), 31-35.
[10] Travers J., Milgram S., An Experimental Study of the Small World Problem, Sociometry, Vol. 32, No. 4. (1969),
pp. 425-443, doi:10.2307/2786545
[11] Prashanth Mundkur, Ville Tuulos, and Jared Flatow. 2011. Disco: a computing platform for large-scale data
analytics. In Proceedings of the 10th ACM SIGPLAN workshop on Erlang (Erlang 11). ACM, New York, NY,
USA, 84-89. DOI=10.1145/2034654.2034670
[12] Twitter, Streaming API Concepts: Sampling, Twitter.com, Dec 2012, URL =
https://dev.twitter.com/docs/streaming-api/concepts#sampling
[13] Christina Warren, Measureing Social Media: Who Has Access to the Firehose?, Machable.com, March 13 2011,
URL = http://mashable.com/2011/03/13/sxsw-smaroi/
[14] Mike Melanson, Twitter Kills the API Whitelist: What it Means for Developers and Innovation, February 11
2011, URL = http://www.readwriteweb.com/archives/
[15] Joab Jackson, Twitter Now Using Oauth authentication for Third Party Apps, Computer World UK, September 1,
2010, URL = http://www.computerworlduk.com/news/security/3237659/twitter-now-using-oauth- authentication-
for-third-party-apps/
REFERENCES REFERENCES
[16] Arne Roomann-Kurrik, Announcing gzip Compression for Streaming APIs, Twitter Developers Feed, Jan 20
2012, URL = https://dev.twitter.com/blog/announcing-gzip-compression-streaming-apis
[17] @Rsaver, Twitter + Gnip Partnership, Twitter API Announcements on Google Groups, November 17, 2010,
URL = https://groups.google.com/forum/?fromgroups=#!topic/twitter-api-announce/4KIAawaY-IA
[18] Hooks M.; Miles, J., Onion Routing and Online Anonymity, Department of Computer Science, Duke Univer-
sity, pp.1-24, April 2006
[19] Data Big Bang, Running Your own Anonymous Rotating Proxies, December 22, 2011, URL =
http://blog.databigbang.com/2011/12/
[20] Willy Tarreau, HAProxy Conguration Manual, Version 1.5, December 28, 2012, URL =
http://haproxy.1wt.eu/download/1.5/doc/conguration.txt
[21] National Institute of Advanced Industrial Science and Technology, The Reference Manual of DeleGate, De-
cember 12, 2011, URL = http://www.delegate.org/delegate/Manual.htm
[22] Runeh, The Tweetstream Python Library, September 25, 2011, URL = https://bitbucket.org/runeh/tweetstream/
[23] Runeh, The Tweetstream Python Library - Open Issue #8 Add Proxy Authentication, May 3, 2011, URL =
https://bitbucket.org/runeh/tweetstream/issue/8/add-proxy-authentication
[24] Twitter, REST API v1 Resources December, 2011, URL = https://dev.twitter.com/docs/streaming-
api/concepts#sampling
[25] Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: simplied data processing on large clusters. In Proceed-
ings of the 6th conference on Symposium on Opearting Systems Design & Implementation - Volume 6 (OSDI04),
Vol. 6. USENIX Association, Berkeley, CA, USA, 10-10.
[26] Apache Foundation, Hadoop: Open source implementation of MapReduce, March, 2011, URL =
http://lucene.apache.org/hadoop/
[27] Andrews, L. (2012). I know who you are and I saw what you did: social networks and the death of privacy,
Simon and Schuster.
[28] Rune Halvorsen, Christopher Schierkolk., Tweetstream: simple Twitter Streaming API Access, Version 1.1.1,
2011, URL = http://pypi.python.org/pypi/tweetstream
[29] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler., The Hadoop File System, IEEE 978-1-
4244-7153-9 2010
[30] Douglas Main, How Much of Twitter Is Spam?, Popular Mechanics, August 4 2011, URL =
http://www.popularmechanics.com/technology/how-much-of-twitter-is-spam
[31] Chao Yang, Robert Harkreader, Guofei Gu. Die Free or Live Hard? Empirical Evaluation and New Design for
Fighting Evolving Twitter Spammers. Proceedings of the 14th International Symposium on Recent Advances in
Intrusion Detection (RAID 2011), Menlo Park, California, September 2011.
[32] Jonghyuk Song, Sangho Lee, Jong Kim. Spam Filtering in Twitter using Sender-Receiver Relationship, Proceed-
ings of the 14th International Symposium on Recent Advances in Intrusion Detection (RAID 2011), Menlo Park,
California, September 2011.
[33] Adam Marcus, Michael S. Bernstein, Osama Badar, David R. Karger, Samuel Madden, and Robert C.
Miller. 2012. Processing and visualizing the data in tweets. SIGMOD Rec. 40, 4 (January 2012), 21-
27.DOI=10.1145/2094114.2094120
[34] Yonik Seeley, Full-Text Search with Lucene, ApacheCon, May 2, 2007
[35] Munmun De Choudhury, Scott Counts, and Mary Czerwinski. 2011. Identifying relevant social media content:
leveraging information diversity and user cognition. In Proceedings of the 22nd ACM conference on Hypertext
and hypermedia (HT 11). ACM, New York, NY, USA, 161-170. DOI=10.1145/1995966.1995990
[36] B. Saha and A, Gairola, Botnet: An overview, CERT-In White Paper CIWP-2005-05, 2005.
[37] M. Rajab, J. Zarfoss, F. Monrose, and A. Terzis, A multifaceted approach to understanding the botnet phe-
nomenon, in Proc. 6th ACM SIGCOMM Conference on Internet Measurement (IMC06), 2006, pp. 4152.
[38] A. Pras, A. Sperotto, et. al., Attacks by Anonymous WikiLeaks Proponents not Anonymous, Design and
Analysis of Communication Systems Group (DACS) CTIT Technical Report, 2010, pp. 1-10
[39] Zhaosheng Zhu, Guohan Lu, Yan Chen, Z.J. Fu, P. Roberts, and Keesook Han. Botnet Research Survey, pages
967972, 28 2008-Aug. 1 2008.
[40] M. Bailey, E. Cooke, et al. A Survey of Botnet Technology and Defenses, Cyber Application and Technology
Conference for Homeland Security, Conference Proceedings, IEEE, 2009
[41] G. Gu, P. Porras, V. Yegneswaran, M. Frog, W. Lee. BotHunter: Detecting malware infection through ids-driven
dialog correlation. In Proceedings of the 16th USENIX Security Symposium (Security07), Boston, MA, August
2007.
[42] G. Gu, R. Perdisci, junjie Zhang, and W. Lee. BotMiner: Clustering analysis of network trafc for protocol and
structure-independent botnet detection, Proceedings, 17th USENIX Security Symposium (Security08), San Jose,
CA, 2008.
[43] Gu, G., Zhang, J., &Lee, W. BotSniffer: Detecting Botnet Command and Control Channels in Network Trafc.
Proceedings of the 15th Annual Network and Distributed System Security Symposium.
[44] The Anti-phishing Working Group, http://www.Antiphishing.org
[45] Report a Phishing website, http://www.phishtank.com
[46] Tyler Moore, Richard Clayton, and Henry Stern. 2009. Temporal correlations between spam and phishingweb-
sites. In Proceedings of the 2nd USENIX conference on Large-scale exploits and emergent threats: botnets, spy-
ware, worms, and more (LEET09). USENIX Association, Berkeley, CA, USA, 5-5.
[47] Maher Aburrous, M. A. Hossain, Keshav Dahal, Fadi Thabtah, Predicting Phishing Websites UsingClassi-
cation Mining Techniques with Experimental Case Studies, Information Technology: New Generations, Third
International Conference on, pp. 176-181, 2010 Seventh International Conference on Information Technology,
2010
[48] Twitter to MySQL Tool, http://140dev.com/free-twitter-api-source-code-library/twitter-database-server/
[49] A. P. E. Rosiello, E. Kirda, C. Kruegel, F. Ferr, P. D. Milano, and P. D. Milano, A layout-similarity-based
approach for detecting phishing pages, in SecureComm 07: Proceedings of the 3rd IEEE International Conference
on Security and Privacy in Communication Networks, 2007.
[50] M. Khonji, Y. Iraqi, A. Jones, and A. Jones, Lexical url analysis for discriminating phishing and legitimate
websites, in CEAS 11: Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and
Spam Conference, 2011, pp. 109115.
[51] J. Ma, L. K. Saul, S. Savage, G. M. Voelker, and G. M. Voelker, Identifying suspicious urls: an application of
largescale online learning, in ICML 09: Proceedings of the 26th International Conference on Machine Learning,
2009, p. 86.
[52] A. Blum, B. Wardman, T. Solorio, G. Warner, and G. Warner, Lexical feature based phishing url detection using
online learning, in AISec 11: Proceedings of the 3rd ACM workshop on Articial intelligence and security, 2010,
pp. 5460.
[53] S. Garera, N. Provos, M. Chew, and A. D. Rubin, A framework for detection and measurement of phishing
attacks, in WORM 07: Proceedings of the 5th ACM Workshop on Recurring Malcode, 2007.
[54] Manos Antonakakis, Roberto Perdisci, David Dagon, Wenke Lee, and Nick Feamster. 2010. Building a dynamic
reputation system for DNS. In Proceedings of the 19th USENIX conference on Security (USENIX Security10).
USENIX Association, Berkeley, CA, USA, 18-18.
[55] C. Whittaker, B. Ryner, M. Nazif, and M. Nazif, Largescale automatic classication of phishing pages, in NDSS
10: Proceedings of the 17th Annual Network and Distributed System Security Symposium, 2010.
[56] C. Ludl, S. McAllister, E. Kirda, C. Kruegel, and C. Kruegel, On the effectiveness of techniques to detect phish-
ing sites, in DIMVA 07: Proceedings of the 4th International Conference on Detection of Intrusions, Malware,
and Vulnerability Assessment, 2007, pp. 2039.
[57] E. Medvet, E. Kirda, and C. Kruegel, Visual-similarity-based phishing detection, in SecureComm 08: Proceed-
ings of the 4th IEEE International Conference on Security and Privacy in Communication Networks, 2008.
[58] A. Y. Fu, W. Liu, and X. Deng, Detecting phishing web pages with visual similarity assessment based on earth
movers distance (emd), 2006, pp. 301311.
[59] Know Your Meme, Kony 2012, http://www.knowyourmeme.com/memes/events/kony-2012
[60] BBC News, Uganda rebel Joseph Kony target of viral campaign video, March 7, 2012
http://www.bbc.co.uk/news/world-africa-17295078
[61] Myers, J., A call for justice, The Kentucky Kernel, March 7, 2012 http://kykernel.com/2012/03/07/a-call-for-
justice/
[62] Nelson, S. C., Kony 2012: Invisible Children Documentary Sheds Light on on Uganda Conict, Hufngton Post,
March 7, 2012.
[63] Neylon, S., Kony fever hits York!, The Yorker, March 7, 2012.
[64] Kanczula, A., Kony 2012 in numbers: Whether you loved Kony 2012 or hated it, the viral video started a
phenomenon. But how does it compare?, The Guardian, April 20, 2012.
[65] Carbone, N. Top 10 Viral Videos, Time Magazine Entertainment, December 4, 2012
[66] Goldberg, E. Kony 2012: Invisible Children Campaign Pressures U.S. Government to Capture Joseph Kony
(Take Action), The Hufngton Post, March 7, 2012.
[67] McGrath, J., Celebs Help Stop Kony Trend on Twitter: Who is Kony?, WetPaint Entertainment, March 7, 2012,
http://www.wetpaint.com/network/articles/celebs-help-stop-kony-tweet-on-twitter-who-is-kony
[68] Gates, Bill, @BillGates Status Message: #stopkony, March 8, 2012, URL =
https://twitter.com/BillGates/status/177883491076284418
[69] Associated Press, Joseph Kony, The New York Times, April 30, 2012,
[70] BBC News, Ugandan Army kills senior rebel, The BBC News, August 13, 2006,
http://news.bbc.co.uk/2/hi/africa/4788657.stm
[71] The Post-Journal, Jackson Center To Show KONY? 2012, The Jamestown New York Post Jour-
nal, February 14, 2012, http://post-journal.com/page/content.detail/id/599038/Jackson-Center-To-Show-KONY-
2012.html?nav=5004
[72] Invisible Children at Michigan State University, Meeting Agenda and
Notes, January 17, 2012, URL = https://docs.google.com/folder/d/0Bw6w-
RaV9zmyZTFlMWYxNzMtOWZlNy00Mjc5LWI1YmEtZjcwY2NjN2Q2OTY1/
[73] Invisible Children, Kony 2012 Teaser, ASCIIMEO, December 15, 2012: http://www.asciimeo.com/33709014
[74] Lees, P. Australian Support Amasses for Kony 2012, NINEMSN, March 7, 2012,
http://nekonyws.ninemsn.com.au/article.aspx?id=8431494
[75] Harris, P., Kony 2012 organisers plan massive day of action across US cities, The Guardian, March 13, 2012,
http://www.guardian.co.uk/world/2012/mar/13/kony-2012-invisble-children-day-of-action
[76] Hager, M., Kony 2012 campaign fails to go ofine in Vancouver, The Vancouver Sun, April 21, 2012,
http://www.globaltvbc.com/kony+2012+campaign+fails+to+go+ofine+in+vancouver/6442625789/story.html
[77] Center for Social Impact Communication, Dynamics of Cause Engagement, Georgetown University, November
22, 2011, http://www.slideshare.net/georgetowncsic/dynamics-of-cause-engagement-nal-report
[78] White, J. S., Matthews, J. N., Stacy, J. L., Coalmine: an experience in building a system for social media
analytics, SPIE Defense, Security, and Sensing. Proceedings of, April 2012
[79] Statistics Brain, Twitter Statistics, Veried by Twitter, The Hufngton Post, eMarketer, September 5, 2012,
http://www.statisticbrain.com/twitter-statistics/

Clarkson - Joshua White - PHD Thesis Proposal - JSW - d4

Uploaded by

Copyright:

Available Formats

Clarkson - Joshua White - PHD Thesis Proposal - JSW - d4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Clarkson - Joshua White - PHD Thesis Proposal - JSW - d4

Uploaded by

Copyright:

Available Formats

An Exploratory Approach to Social Network

You might also like