0% found this document useful (0 votes)
14 views41 pages

TM 2

Uploaded by

justextra0820
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views41 pages

TM 2

Uploaded by

justextra0820
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Mining Twitter

Why is Twitter all the Rage ?

 As humans, what are some things that we want that


technology might help us to get?
 • We want to be heard.
 • We want to satisfy our curiosity.
 • We want it easy.
 • We want it now.
Why is Twitter all the Rage ?

 We have a deeply rooted need to share our ideas and


experiences, which gives us the ability to connect with
other people, to be heard, and to feel a sense of worth
and importance.
 We are curious about the world around us and how to
organize and manipulate it, and we use
communication to share our observations, ask
questions, and engage with other people in
meaningful dialogues.
Why is Twitter all the Rage ?

 Ideally, we don’t want to have to work any harder


than is absolutely necessary to satisfy our curiosity or
get any particular job done; we’d rather be doing
“something else” or moving on to the next thing
because our time on this planet is so precious and
short.
 Along similar lines, we want things now and tend to
be impatient when actual progress doesn’t happen at
the speed of our own thought.
Why is Twitter all the Rage ?
 How would you define Twitter?
 One way to describe Twitter is as a microblogging service
that allows people to communicate with short, 140-
character messages that roughly correspond to thoughts
or ideas.
 The macro-level possibilities for marketing and advertising.
 While the communication bus that enables users to share
short quips at the speed of thought may be a necessary
condition for viral adoption and sustained engagement on
the Twitter platform, it’s not a sufficient condition. The
extra ingredient that makes it sufficient is that Twitter’s
asymmetric following model satisfies our curiosity.
Why is Twitter all the Rage ?

 In other words, whereas some social websites like


Facebook and LinkedIn require the mutual
acceptance of a connection between users (which
usually implies a real-world connection of some kind),
Twitter’s relationship model allows you to keep up
with the latest happenings of any other user, even
though that other user may not choose to follow you
back or even know that you exist. Twitter’s following
model is simple but exploits a fundamental aspect of
what makes us human: our curiosity.
Why is Twitter all the Rage ?
 Think of an interest graph as a way of modeling
connections between people and their arbitrary
interests. Interest graphs provide a profound number
of possibilities in the data mining realm that primarily
involve measuring correlations between things for
the objective of making intelligent recommendations
and other applications in machine learning.
 When you realize that Twitter enables you to create,
connect, and explore a community of interest for an
arbitrary topic of interest, the power of Twitter and
the insights you can gain from mining its data become
much more obvious.
Why is Twitter all the Rage ?

 The badges on some accounts that identify celebrities


and public figures as “verified accounts” and basic
restrictions in Twitter’s Terms of Service agreement,
which is required for using the service.
Exploring Twitter’s API

 Fundamental Twitter Terminology


 Creating a Twitter API Connection
 Exploring Trending Topics
 Searching for Tweets
Fundamental Twitter Terminology

 Tweets are the essence of Twitter, and while they are


notionally thought of as the 140 characters of text content
associated with a user’s status update, there’s really quite
a bit more metadata there than meets the eye.
 In addition to the textual content of a tweet itself, tweets
come bundled with two additional pieces of metadata that
are of particular note: entities and places.
 Tweet entities are essentially the user mentions, hashtags,
URLs, and media that may be associated with a tweet, and
places are locations in the real world that may be attached
to a tweet.
Fundamental Twitter Terminology

 Consider a sample tweet with the following text:


 @ptwobrussell is writing @SocialWebMining, 2nd Ed. from his home
office in Franklin,TN. Be #social: http://on.fb.me/16WJAf9
 The tweet is 124 characters long.
 Contains four tweet entities: the user mentions @ptwobrussell and
@SocialWebMining, the hashtag #social, and the URL
http://on.fb.me/16WJAf9.
 Although there is a place called Franklin, Tennessee that’s explicitly
mentioned in the tweet, the places metadata associated with the
tweet might include the location in which the tweet was authored,
which may or may not be Franklin, Tennessee.
Fundamental Twitter Terminology

 Timelines are the chronologically sorted collections of


tweets.
 The home timeline is the view that you see when you
log into your account and look at all of the tweets
from users that you are following.
 A particular user timeline is a collection of tweets only
from a certain user.
Fundamental Twitter Terminology

 Streams are samples of public tweets flowing through


Twitter in realtime.
 The public firehose of all tweets has been known to
peak at hundreds of thousands of tweets per minute
during events with particularly wide interest, such as
presidential debates.
Creating a Twitter API Connection

 Twitter has taken great care to craft an elegantly


simple RESTful API.
 There are great libraries available to further mitigate
the work involved in making API requests.
 A particularly beautiful Python package that wraps
the Twitter API and mimics the public API semantics
almost one-to-one is twitter.
 Like most other Python packages, you can install it
with pip by typing pip install twitter in a terminal.
 Before you can make any API requests to Twitter, you’ll need to
create an application at https://dev.twitter.com/apps.
 why not just plug in your username and password to access the API?
 While that approach might work fine for you, a third party such as a
friend or colleague probably wouldn’t feel comfortable.
 Fortunately, some smart people recognized this problem years ago,
and now there’s a standardized protocol called OAuth (short for
Open Authorization) that works for these kinds of situations
 For simplicity of development, the key pieces of information that
you’ll need to take away from your newly created application’s
settings are its consumer key, consumer secret, access token, and
access token secret.
 In tandem, these four credentials provide everything that an
application would ultimately be getting to authorize the user and
granting authorization.
Authorizing an application to access
Twitter account data

import twitter
CONSUMER_KEY = ‘ ‘
CONSUMER_SECRET = ‘ ‘
OAUTH_TOKEN = '‘
OAUTH_TOKEN_SECRET = ‘ ‘
auth = twitter.oauth.OAuth(OAUTH_TOKEN,
OAUTH_TOKEN_SECRET, CONSUMER_KEY,
CONSUMER_SECRET)

twitter_api = twitter.Twitter(auth=auth)
Exploring Trending Topics
 A word, phrase or topic that is mentioned at a greater rate than others is said to
be a "trending topic".
 Trending topics become popular either through a concerted effort by users, or
because of an event that prompts people to talk about a specific topic.
 With an authorized API connection in place, you can now issue a request to get a
list of trending topics.
 The example demonstrates how to ask Twitter for the topics that are currently
trending worldwide
 The API can easily be parameterized to constrain the topics to more specific
locales
 The API aims to provide a way to map a unique identifier to any named place on
Earth (or theoretically, even in a virtual world). Eg: USA – 23424977
 Twitter imposes rate limits on how many requests an application can make to
any given API resource within a given time window.
 Twitter’s rate limits are well documented, and each individual API resource also
states its particular limits for your convenience.
 For example, the API request that we just issued for trends limits applications
to 15 requests per 15-minute window.
Example :Retrieving trends
WORLD_WOE_ID = 1
US_WOE_ID = 23424977
world_trends =
twitter_api.trends.place(_id=WORLD_WOE_ID) us_trends
= twitter_api.trends.place(_id=US_WOE_ID)
print world_trends
print
print us_trends

O/P:[{u'created_at': u'2013-03-27T11:50:40Z', u'trends':


[{u'url':
u'http://twitter.com/search?q=%23MentionSomeoneImport
antForYou'...
Example: Computing the intersection
of two sets of trends
world_trends_set = set([trend['name'] for trend in
world_trends[0]['trends']])
us_trends_set = set([trend['name'] for trend in
us_trends[0]['trends']])
common_trends =
world_trends_set.intersection(us_trends_set)
print common_trends
Searching for Tweets

 Twitter’s Search API returns results in batches, and


we can configure the number of results per batch to a
maximum value using the count keyword parameter.
 Generally the count is taken as 200. It is possible that
more than 200 results (or the maximum value that
you specify for count) may be available for any given
query.
 In Twitter’s API, we also have cursor to navigate to
the next batch of results.
1.4 Analysing the 140 character

 Tweets Analysis
 Extracting Tweets
 Text Cleaning
 Frequent Words and Word Cloud Word Associations
 Topic Modelling
 Sentiment Analysis
 * The human-readable text of a tweet is available
through t['text']: RT @hassanmusician:
#MentionSomeoneImportantForYou God.
 The entities in the text of a tweet are conveniently
processed for you and available through t['entities']:
 Clues as to the “interestingness” of a tweet are
available through t['favor ite_count'] and
t['retweet_count'], which return the number of times
it’s been bookmarked or retweeted, respectively.
 If a tweet has been retweeted, the
t['retweeted_status'] field provides significant detail
about the original tweet itself and its author.
 The t['retweeted'] field denotes whether or not the
authenticated user (via an authorized application) has
retweeted this particular tweet.
 ‘retweet_count’ reflects the total number of times
that the original tweet has been retweeted and
should reflect the same value in both the original
tweet and all subsequent retweets.
1.4.1Extracting Tweet Entities
1.4.2. Analyzing Tweets and Tweet
Entities with Frequency Analysis
 Now take a closer look at what’s in the data by
computing a frequency distribution and looking at the
top 10 items in each list.
 As of Python 2.7, a collections module is available that
provides a counter that makes computing a
frequency distribution .
 Next Example demonstrates how to use a Counter to
compute frequency distributions as ranked lists of
terms.
Example : Creating a basic frequency
distribution from the words in tweets
Ex: Using prettytable to display
tuples in a nice tabular format
from prettytable import PrettyTable
for label, data in (('Word', words),
('Screen Name', screen_names),
('Hashtag', hashtags)):
pt = PrettyTable(field_names=[label, 'Count'])
c = Counter(data) [ pt.add_row(kv) for kv in
c.most_common()[:10] ] pt.align[label],
pt.align['Count'] = 'l', 'r' # Set column alignment
print pt
1.4.3 Computing lexical diversity of
tweets
Lexical Diversity
 What is it?
Calculating simple frequencies and can be applied to
unstructured text is a metric called lexical diversity.
 Mathematics?
Number of unique tokens in the text divided by
the total number of tokens in the text.
 Lexical diversity can be worth considering as a
primitive statistic for answering a number of
questions. How?
How broad or narrow the subject matter is that an
individual or group discusses
 Breaking down the analysis to specific time periods
could yield additional insight.
 Comparing different groups or individuals
 Lexical Diversity of Coca Cola and Pepsi
Example: Calculating lexical diversity
for tweets
O/P:
Understanding the Example:
 Obs 1: 0.67: One in 3 words is a unique word.

 Obs 2: 0.97: About 19 out of 20 screen names


mentioned are unique.

 Obs 3: 0.068: Diversity of hashtags very low.

 Obs 4: The average number of words per tweet is


very low at a value of just under 6, which makes sense
given the nature of the hashtag, which is designed to
solicit short responses consisting of just a few words.
1.4.4. Examining Patterns in Retweets
 Retweet API used to populate status values such as
retweet_count and retwee ted_status.
 A good exercise at this point would be to further
analyze the data to determine if there was a
particular tweet that was highly retweeted or if there
were just lots of “one-off ” retweets.
 The approach we’ll take to find the most popular
retweets is to simply iterate over each status update
and store out the retweet count, originator of the
retweet, and text of the retweet if the status update
is a retweet.
Example: Finding the most popular
retweets
Output:
Example: Looking up users who have
retweeted a status
1.4.5. Visualizing Frequency Data with
Histograms

 A nice feature of IPython Notebook is its ability to


generate and insert high-quality and customizable
plots of data as part of an interactive workflow.
 In particular, the matplot lib package and other
scientific computing tools that are available for
IPython Note‐ book are quite powerful and capable of
generating complex figures with very little effort.
A plot displaying the sorted
frequencies for the words computed
by Example 1-8
Example :Plotting frequencies of
words

You might also like