Twitter Python Assignment
Twitter Python Assignment
Help
Twitter represents a fundamentally new instrument to make social measurements. Millions of people
voluntarily express opinions across any topic imaginable --- this data source is incredibly valuable for
both research and business.
For example, researchers have shown that the "mood" of communication on twitter reflects
biological rhythms and can even used to predict the stock market. A student here at UW used
geocoded tweets to plot a map of locations where "thunder" was mentioned in the context of
a storm system in Summer 2012.
Researchers from Northeastern University and Harvard University studying the characteristics and
dynamics of Twitter have an excellent resource for learning more about how Twitter can be used to
analyze moods at national scale.
In this assignment, you will
access the twitter Application Programming Interface(API) using python
estimate the public's perception (the sentiment) of a particular term or phrase
analyze the relationship between location and mood based on a sample of twitter data
Some points to keep in mind:
This assignment is open-ended in several ways. You'll need to make some decisions about how
best to solve the problem and implement them carefully.
It is perfectly acceptable to discuss your solution on the forum, but don't share code.
Each student must submit their own solution to the problem.
You will have an unlimited number of tries for each submission.
Your code will be run in a protected environment, so you should only use the Python standard
libraries unless you are specifically instructed otherwise. Your code should also not rely on any
external libraries or web services.
1/8
7/1/2014
You will need to establish a Python programming environment to complete this assignment. You can
install Python yourself by downloading it from the Python website, or can use the class virtual machine.
Unicode strings
Strings in the twitter data prefixed with the letter "u" are unicode strings. For example:
u"This is a string"
Unicode is a standard for representing a much larger variety of characters beyond the roman alphabet
(greek, russian, mathematical symbols, logograms from non-phonetic writing systems such as kanji,
etc.)
In most circumstances, you will be able to use a unicode object just like a string.
If you encounter an error involving printing unicode, you can use the encode method to properly print
the international characters, like this:
unicode_string = u"aaa "
encoded_string = unicode_string.encode('utf-8')
print encoded_string
Getting Started
Once again: If you are new to Python, many students have recommended Google's Python class.
5. On the next page, click the "API Keys" tab along the top, then scroll all the way down until you see
the section "Your Access Token"
https://class.coursera.org/datasci-002/assignment/view?assignment_id=3
2/8
7/1/2014
6. Click the button "Create My Access Token". You can Read more about Oauth authorization.
7. You will now copy four values into twitterstream.py. These values are your "API Key", your "API
secret", your "Access token" and your "Access token secret". All four should now be visible on the
API Keys page. (You may see "API Key" referred to as "Consumer key" in some places in the code
or on the web; they are synonyms.) Open twitterstream.py and set the variables corresponding to
the api key, api secret, access token, and access secret. You will see code like the below:
api_key = "<Enter api key>"
api_secret = "<Enter api secret>"
access_token_key = "<Enter your access token key here>"
access_token_secret = "<Enter your access token secret here>"
8. Run the following and make sure you see data flowing and that no errors occur.
$ python twitterstream.py > output.txt
This command pipes the output to a file. Stop the program with Ctrl-C, but wait at least 3 minutes
for data to accumulate. Keep the file output.txt for the duration of the assignment; we will be
reusing it in later problems. Don't use someone else's file; we will check for uniqueness in other
parts of the assignment.
9. If you wish, modify the file to use the twitter search API to search for specific terms. For example, to
search for the term "microsoft", you can pass the following url to the twitterreq function:
https://api.twitter.com/1.1/search/tweets.json?q=microsoft
What to turn in: The first 20 lines of the twitter data you downloaded from the web. You
should save the first 20 lines to a file problem_1_submission.txt by using the following
command:
$ head -n 20 output.txt > problem_1_submission.txt
3/8
7/1/2014
The file AFINN-111.txt contains a list of pre-computed sentiment scores. Each line in the file contains a
word or phrase followed by a sentiment score. Each word or phrase that is found in a tweet but not
found in AFINN-111.txt should be given a sentiment score of 0. See the file AFINN-README.txt for
more information.
To use the data in the AFINN-111.txt file, you may find it useful to build a dictionary. Note that the
AFINN-111.txt file format is tab-delimited, meaning that the term and the score are separated by a tab
character. A tab character can be identified a "\t".The following snippet may be useful:
afinnfile = open("AFINN-111.txt")
scores = {} # initialize an empty dictionary
for line in afinnfile:
term, score = line.split("\t") # The file is tab-delimited. "\t" means "tab character
"
scores[term] = int(score) # Convert the score to an integer.
print scores.items() # Print every (term, score) pair in the dictionary
The data in the tweet file you generated in Problem 1 is represented as JSON, which stands for
JavaScript Object Notation. It is a simple format for representing nested structures of data --- lists of
lists of dictionaries of lists of .... you get the idea.
Each line of output.txt represents a streaming message. Most, but not all, will be tweets. (The
skeleton program will tell you how many lines are in the file.)
It is straightforward to convert a JSON string into a Python data structure; there is a library to do so
called json.
To use this library, add the following to the top of tweet_sentiment.py
import json
Then, to parse the data in output.txt, you want to apply the function json.loads to every line in
the file.
This function will parse the json data and return a python data stucture; in this case, it returns a
dictionary. If needed, take a moment to read the documentation for Python dictionaries.
You can read the Twitter documentation to understand what information each tweet contains and how
to access it, but it's not too difficult to deduce the structure by direct inspection.
Your script should print to stdout the sentiment of each tweet in the file, one numeric sentiment score
per line. The first score should correspond to the first tweet, the second score should correspond to
the second tweet, and so on. If you sort the scores, they won't match up. If you sort the tweets, they
won't match up. If you put the tweets into a dictionary, the order will not be preserved. Once again:
The nth line of the file you submit should contain only a single number that represents the
https://class.coursera.org/datasci-002/assignment/view?assignment_id=3
4/8
7/1/2014
Hint: This is real-world data, and it can be messy! Refer to the twitter documentation to understand
more about the data structure you are working with. Don't get discouraged, and ask for help on the
forums if you get stuck!
tweet_sentiment.py
Your script should print output to stdout. Each line of output should contain a term, followed by a
space, followed by the sentiment. That is, each line should be in the format <term:string>
<sentiment:float>
For example, if you have the pair ("foo", 103.256) in Python, it should appear in the output as:
https://class.coursera.org/datasci-002/assignment/view?assignment_id=3
5/8
7/1/2014
foo 103.256
term_sentiment.py
How we will grade Part 3: We will run your script on a file that contains strongly positive and strongly
negative tweets and verify that the non-sentiment-carrying terms in the strongly positive tweets are
assigned a higher score than the non-sentiment-carrying terms in negative tweets. Your scores need
not (and likely will not) exactly match any specific solution.
If the grader is returning "Formatting error: ", make note of the line of text returned in the message.
This line corresponds to a line of your output. The grader will generate this error if line.split()
does not return exactly two items. One common source of this error is to not remove the two calls to
the "lines" function in the solution template; this function prints the number of lines in each file. Make
sure to check the first two lines of your output!
Your script will be run from the command line like this:
$ python frequency.py <tweet_file>
You should assume the tweet file contains data formatted the same way as the livestream data.
Your script should print output to stdout. Each line of output should contain a term, followed by a
space, followed by the frequency of that term in the entire file. There should be one line per unique
term in the entire file. Even if 25 tweets contain the word lol, the term lol should only appear once
in your output (and the frequency will be at least 25!) Each line should be in the format <term:string>
<frequency:float>
For example, if you have the pair (bar, 0.1245) in Python it should appear in the output as:
bar 0.1245
If you wish, you may consider a term to be a multi-word phrase, but this is not required. You may
https://class.coursera.org/datasci-002/assignment/view?assignment_id=3
6/8
7/1/2014
frequency.py
7/8
7/1/2014
sentiment to stdout.
Note that you may need a lot of tweets in order to get enough tweets with location data. Let the live
stream run for a while if you wish.
Your script will not have access to the Internet, so you cannot rely on third party services to
resolve geocoded locations!
happiest_state.py
You should assume the tweet file contains data formatted the same way as the livestream data.
In the tweet file, each line is a Tweet object, as described in the twitter documentation. To find the
hashtags, you should not parse the text field; the hashtags have already been extracted by twitter.
Your script should print to stdout each hashtag-count pair, one per line, in the following format:
Your script should print output to stdout. Each line of output should contain a hashtag, followed by a
space, followed by the frequency of that hashtag in the entire file. There should be one line per
unique hashtag in the entire file. Each line should be in the format <hashtag:string>
<frequency:float>
For example, if you have the pair (bar, 30) in Python it should appear in the output as:
bar 30
top_ten.py
https://class.coursera.org/datasci-002/assignment/view?assignment_id=3
8/8