C1 W2 Assignment
C1 W2 Assignment
Welcome to week two of this specialization. You will learn about Naive Bayes. Concretely, you
will be using Naive Bayes for sentiment analysis on tweets. Given a tweet, you will decide if it
has a positive sentiment or a negative one. Specifically you will:
You may already be familiar with Naive Bayes and its justification in terms of conditional
probabilities and independence.
• In this week's lectures and assignments we used the ratio of probabilities between
positive and negative sentiment.
• This approach gives us simpler formulas for these 2-way classification tasks.
1. You have not added any extra print statement(s) in the assignment.
2. You have not added any extra code cell(s) in the assignment.
3. You have not changed any of the function parameters.
4. You are not using any global variables inside your graded exercises. Unless specifically
instructed to do so, please refrain from it and use the local variables instead.
5. You are not changing the assignment code where it is not required, like creating extra
variables.
If you do any of the following, you will get something like, Grader not found (or similarly
unexpected) error upon submitting your assignment. Before asking for help/debugging the
errors in your assignment, check for these first. If this is the case, and you don't remember the
changes you have made, you can get a fresh copy of the assignment by following these
instructions.
Load the cell below to import some packages. You may want to browse the documentation of
unfamiliar libraries and functions.
nltk.download('twitter_samples')
nltk.download('stopwords')
True
If you are running this notebook in your local computer, don't forget to download the tweeter
samples and stopwords from nltk.
nltk.download('stopwords')
nltk.download('twitter_samples')
filePath = f"{getcwd()}/../tmp2/"
nltk.data.path.append(filePath)
# split the data into two pieces, one for training and one for testing
(validation set)
test_pos = all_positive_tweets[4000:]
train_pos = all_positive_tweets[:4000]
test_neg = all_negative_tweets[4000:]
train_neg = all_negative_tweets[:4000]
• Remove noise: You will first want to remove noise from your data -- that is, remove
words that don't tell you much about the content. These include all common words like
'I, you, are, is, etc...' that would not give us enough information on the sentiment.
• We'll also remove stock market tickers, retweet symbols, hyperlinks, and hashtags
because they can not tell you a lot of information on the sentiment.
• You also want to remove all the punctuation from a tweet. The reason for doing this is
because we want to treat words with or without the punctuation as the same word,
instead of treating "happy", "happy?", "happy!", "happy," and "happy." as different
words.
• Finally you want to use stemming to only keep track of one variation of each word. In
other words, we'll treat "motivation", "motivated", and "motivate" similarly by grouping
them within the same stem of "motiv-".
We have given you the function process_tweet that does this for you.
You will also implement a lookup helper function that takes in the freqs dictionary, a word, and
a label (1 or 0) and returns the number of times that word and label tuple appears in the
collection of tweets.
For example: given a list of tweets ["i am rather excited", "you are rather
happy"] and the label 1, the function will return a dictionary that contains the following key-
value pairs:
• Notice how for each word in the given string, the same label 1 is assigned to each word.
• Notice how the words "i" and "am" are not saved, since it was removed by process_tweet
because it is a stopword.
• Notice how the word "rather" appears twice in the list of tweets, and so its count value is
2.
Instructions
Create a function count_tweets that takes a list of tweets as input, cleans all of them, and
returns a dictionary.
• The key in the dictionary is a tuple containing the stemmed word and its class label, e.g.
("happi",1).
• The value the number of times this word appears in the given collection of tweets (an
integer).
# UNQ_C1 GRADED FUNCTION: count_tweets
return result
result = {}
tweets = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i
am tired']
ys = [1, 0, 0, 0, 0]
count_tweets(result, tweets, ys)
P ( D pos)
The prior is the ratio of the probabilities . We can take the log of the prior to rescale it,
P ( Dn e g )
and we'll call this the logprior
logprior=l o g
( P( Dpos)
P ( Dn e g ) ) ( )
=l o g
Dpos
Dn e g
.
Note that l o g ( AB ) is the same as l o g ( A ) − l o g ( B ). So the logprior can also be calculated as the
difference between two logs:
• f r e q p o s and f r e qn e g are the frequencies of that specific word in the positive or negative
class. In other words, the positive frequency of a word is the number of times the word is
counted with the label of 1.
• N p o s and N n e g are the total number of positive and negative words for all documents (for
all tweets), respectively.
• V is the number of unique words in the entire set of documents, for all classes, whether
positive or negative.
We'll use these to compute the positive and negative probability for a specific word using this
formula:
f r e q p o s +1
P ( W p o s )=
N p o s+ V
f r e q n e g+ 1
P (W ne g)=
N n e g +V
Notice that we add the "+1" in the numerator for additive smoothing. This wiki article explains
more about additive smoothing.
Log likelihood
To compute the loglikelihood of that very same word, we can implement the following
equations:
loglikelihood =log
( P (W po s )
P ( W ne g ) )
Create freqs dictionary
• Given your count_tweets function, you can compute a dictionary called freqs that
contains all the frequencies.
• In this freqs dictionary, the key is the tuple (word, label)
• The value is the number of times it has appeared.
Instructions
Given a freqs dictionary, train_x (a list of tweets) and a train_y (a list of labels for each
tweet), implement a naive bayes classifier.
Calculate V
• You can then compute the number of unique words that appear in the freqs dictionary
to get your V (you can use the set function).
Calculate f r e q p o s and f r e qn e g
• Using your freqs dictionary, you can compute the positive and negative frequency of
each word f r e q p o s and f r e qn e g .
Calculate N p o s, and N n e g
• Using freqs dictionary, you can also compute the total number of positive words and
total number of negative words N p o s and N n e g .
Calculate D , D p o s, D n e g
• Using the train_y input list of labels, calculate the number of documents (tweets) D , as
well as the number of positive documents (tweets) D p o s and number of negative
documents (tweets) D n e g .
• Calculate the probability that a document (tweet) is positive P ( D p o s ) , and the probability
that a document (tweet) is negative P ( D n e g )
f r e q p o s +1
P ( W p o s )=
N p o s+ V
f r e q n e g+ 1
P (W ne g)=
N n e g +V
Note: We'll use a dictionary to store the log likelihoods for each word. The key is the word, the
value is the log likelihood of that word).
• You can then compute the loglikelihood: l o g
( P( W pos)
P (W ne g) )
.
# Calculate logprior
logprior = np.log(D_pos/D_neg)
0.0
9165
Expected Output:
0.0
9165
Note
Note we calculate the prior from the training data, and that the training data is evenly split
between positive and negative labels (4000 positive and 4000 negative tweets). This means that
the ratio of positive to negative 1, and the logprior is 0.
The value of 0.0 means that when we add the logprior to the log likelihood, we're just adding
zero to the log likelihood. However, please remember to include the logprior, because whenever
the data is not perfectly balanced, the logprior will be a non-zero value.
'''
### START CODE HERE ###
# process the tweet to get a list of words
word_l = process_tweet(tweet)
return p
Expected Output:
Implement test_naive_bayes
Instructions:
return accuracy
Expected Accuracy:
Naive Bayes accuracy = 0.9955
Expected Output:
• I am happy -> 2.14
• I am bad -> -1.31
• this movie should have been great. -> 2.12
• great -> 2.13
• great great -> 4.26
• great great great -> 6.39
• great great great great -> 8.52
# Feel free to check the sentiment of your own tweet below
my_tweet = 'you are bad :('
naive_bayes_predict(my_tweet, logprior, loglikelihood)
-8.837351738825648
Implement get_ratio
• Given the freqs dictionary of words and a particular word, use
lookup(freqs,word,1) to get the positive count of the word.
• Similarly, use the lookup function to get the negative count of that word.
• Calculate the ratio of positive divided by negative counts
pos_words +1
r a t i o=
neg_words+ 1
Where pos_words and neg_words correspond to the frequency of the words in their respective
classes. Words Positive word count Negative Word Count glad 41 2 arriv 57 4 :( 1
3663 :-( 0 378
get_ratio(freqs, 'happi')
Implement get_words_by_threshold(freqs,label,threshold)
• If we set the label to 1, then we'll look for all words whose threshold of positive/negative
is at least as high as that threshold, or higher.
• If we set the label to 0, then we'll look for all words whose threshold of positive/negative
is at most as low as the given threshold, or lower.
• Use the get_ratio function to get a dictionary containing the positive count, negative
count, and the ratio of positive to negative counts.
• Append the get_ratio dictionary inside another dictinoary, where the key is the word,
and the value is the dictionary pos_neg_ratio that is returned by the get_ratio
function. An example key-value pair would have this structure:
{'happi':
{'positive': 10, 'negative': 20, 'ratio': 0.524}
}
Notice the difference between the positive and negative ratios. Emojis like :( and words like 'me'
tend to have a negative connotation. Other words like glad, community, arrives, tend to be found
in the positive tweets.
9.571143871339594