Blog Post HTML
Blog Post HTML
<h3>Dependencies</h3>
I used Python 3 for this project; if you do not have Python then I would recommend
installing it via the <a href="https://www.continuum.io/downloads"
target="_blank">Anaconda distribution</a>. Other dependencies are Tweepy 3.5.0 (a
library for accessing the <a href="https://dev.twitter.com/overview/api"
target="_blank">Twitter API</a>) and a personal <a href="https://apps.twitter.com/"
target="_blank">Twitter "data-mining" application</a>�(which is very easy to set
up). I used <a href="http://marcobonzanini.com/2015/03/02/mining-twitter-data-with-
python-part-1/#Register_Your_App" target="_blank">this guide</a> to register my
app. You will need to register your own in order to generate a consumer key,
consumer secret, access token, and access secret; these are required to
authenticate the script in order to access the Twitter API.
<h3>Running the script</h3>
You can download my Python tweet searching/saving script�using Git Shell:
<blockquote>git clone https://github.com/agalea91/twitter_search</blockquote>
or directly from its�<a href="https://github.com/agalea91/twitter_search"
target="_blank">git repository</a>.
[code language="python"]
consumer_key = '189YcjF4IUzF156RGNGNucDD8'
consumer_secret = 'e4KPiY4pSh03HxjDg782HupUjmzdOOSDd98hd'
access_token = '2543812-cpaIuwndjvbdjaDDp5izzndhsD7figa9gb'
access_secret = '4hdyfnas7d988ddjf87sJdj3Dxn4d5CcNpwe'
[/code]
Before running the script, go to the <strong>main()</strong> function and edit the
search criteria. Namely, you should enter a search phrase, the maximum time limit
for the script to run, and the date range for the search (relative to today). For
example:
[code language="python"]
search_phrase = '#makedonalddrumpfagain'
time_limit = 1.0 # runtime limit in hours
min_days_old, max_days_old = 1, 2 # search limits
To run the script, open the terminal/command line to the file location and type:
<blockquote>python twitter_search.py</blockquote>
The script will search for tweets and save them to a JSON file until they have all
been found or the time limit has�exceeded.
<h3>twitter_search.py�functions</h3>
The main program�is contained within the <strong>main()</strong> function, which is
called automatically when running the script from the command line. �This part of
the code is not shown below. Instead we only discuss�the other functions. �Before
we get started I'll list the libraries�used in�the script:
[code language="python"]
import tweepy
from tweepy import OAuthHandler
import json
import datetime as dt
import time
import os
import sys
[/code]
[code language="python"]
def load_api():
''' Function that loads the twitter API after authorizing
the user. '''
consumer_key = '189YcjF4IUzF156RGNGNucDD8'
consumer_secret = 'e4KPiY4pSh03HxjDg782HupUjmzdOOSDd98hd'
access_token = '2543812-cpaIuwndjvbdjaDDp5izzndhsD7figa9gb'
access_secret = '4hdyfnas7d988ddjf87sJdj3Dxn4d5CcNpwe'
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_secret)
# load the twitter API via tweepy
return tweepy.API(auth)
[/code]
Twitter limits the maximum number of tweets returned per search to 100. We use a
function [1] called <strong>tweet_search()</strong> that searches for up to
max_tweets=100 tweets:
[code language="python"]
def tweet_search(api, query, max_tweets, max_id, since_id, geocode):
''' Function that takes in a search string 'query', the maximum
number of tweets 'max_tweets', and the minimum (i.e., starting)
tweet id. It returns a list of tweepy.models.Status objects. '''
searched_tweets = []
while len(searched_tweets) < max_tweets:
remaining_tweets = max_tweets - len(searched_tweets)
try:
new_tweets = api.search(q=query, count=remaining_tweets,
since_id=str(since_id),
max_id=str(max_id-1))
# geocode=geocode)
print('found',len(new_tweets),'tweets')
if not new_tweets:
print('no tweets found')
break
searched_tweets.extend(new_tweets)
max_id = new_tweets[-1].id
except tweepy.TweepError:
print('exception raised, waiting 15 minutes')
print('(until:', dt.datetime.now()+dt.timedelta(minutes=15), ')')
time.sleep(15*60)
break # stop the loop
return searched_tweets, max_id
[/code]
This function loops over an api.search() call because it's possible for less than
100 tweets to be returned and thus it is called until all 100 tweets are found. In
the main program�we loop over this function until the exception is raised, at which
point our script sleeps for 15 minutes before continuing.
The search can be limited to a specific radial area around longitude &
latitude�coordinates by uncommenting the geocode line and defining the parameter
appropriately. For example nearly all states in America are included in the geocode
'39.8,-95.583068847656,2500km'. The issue here is a vast majority of the tweets are
not geocoded and will therefore be excluded.
The api.search() function can start from a given tweet ID or date and will always
search back in time. If we are appending the tweet data to an already existing JSON
file, the "starting" tweet ID is defined based on the last tweet appended to the
file (this is done in the main program). Otherwise we run the function
<strong>get_tweet_id()</strong> to find the ID of a tweet that was posted at the
end of a given day and this is used as the starting point for the search.
[code language="python"]
def get_tweet_id(api, date='', days_ago=9, query='a'):
''' Function that gets the ID of a tweet. This ID can
then be used as a 'starting point' from which to
search. The query is required and has been set to
a commonly used word by default. The variable
'days_ago' has been initialized to the maximum amount
we are able to search back in time (9).'''
After each call of tweet_search() in the main program, we append the new tweets to
a file in JSON format:
[code language="python"]
def write_tweets(tweets, filename):
''' Function that appends tweets to a file. '''
The resulting JSON file can easily (although not necessarily quickly) be read and
converted to a Pandas dataframe for analysis.
<h3>Reading in JSON files to a dataframe</h3>
The twitter_search.py file is only used for collecting tweets and I use ipyhton
notebooks for analysis. First we'll need to read the JSON file(s):
[code language="python"]
import json
We now have a dictionary named "tweets". This can be accessed to create a dataframe
with the required information. We'll include the location information of the user
who published the tweet as well as the coordinates (if available) and tweet text.
[code language="python"]
def populate_tweet_df(tweets):
df = pd.DataFrame()
return df
[/code]
<h3>Data analysis: plotting tweet coordinates</h3>
We can now, for example, plot the locations from which the tweets were sent using
the Basemap library (which must by manually installed [2]).
[code language="python"]
from mpl_toolkits.basemap import Basemap
plt.show()
[/code]
In the next post we'll look at a politically inspired analysis of tweets posted
with the hashtag #MakeDonaldDrumpfAgain. The phrase was trending a couple weeks ago
in reaction to <a href="https://www.youtube.com/watch?
v=DnpO_RTSNmQ&feature=youtu.be&t=1204" target="_blank">an episode of HBO's
"Last Week Tonight" with John Oliver</a>. The phrase�represents a negative
sentiment towards Donald Trump - a�Republican candidate�for the upcoming American
election. I've collected every #MakeDonaldDrumpfAgain tweet since the video was
posted and was able to produce, using the plotting script above, this illustration
of tweet locations:
From the 550,000+ tweets I collected, only ~400�of them had longitude and latitude
coordinates and these locations are all plotted above. As can be seen, most
geocoded tweets about this topic have come from the eastern USA.
Thanks for reading! �If you would like to discuss any of the plots or have any
questions or corrections, please write a comment. You are also welcome to email me
at [email protected] or tweet me @agalea91