0% found this document useful (0 votes)
5 views

How To Perform Sentiment Analysis in Python 3 Using The Natural Language Toolkit (NLTK) - DigitalOcean

Uploaded by

tauqeer
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

How To Perform Sentiment Analysis in Python 3 Using The Natural Language Toolkit (NLTK) - DigitalOcean

Uploaded by

tauqeer
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

15 DAYS Upcoming Tech Talk: Getting Started With Laravel and Inertia.js

TUTORIAL

How To Perform Sentiment Analysis in Python 3 Using the


Natural Language Toolkit (NLTK)
Python Development Programming Project Data Analysis

By Shaumik Daityari
Published on September 26, 2019  184k

The author selected the Open Internet/Free Speech fund to receive a donation as part of the
Write for DOnations program.

Introduction
A large amount of data that is generated today is unstructured, which requires processing to
generate insights. Some examples of unstructured data are news articles, posts on social
media, and search history. The process of analyzing natural language and making sense out
of it falls under the field of Natural Language Processing NLP . Sentiment analysis is a
common NLP task, which involves classifying texts or parts of texts into a pre-defined
sentiment. You will use the Natural Language Toolkit NLTK , a commonly used NLP library in
Python, to analyze textual data.

In this tutorial, you will prepare a dataset of sample tweets from the NLTK package for NLP
with different data cleaning methods. Once the dataset is ready for processing, you will train
a model on pre-classified tweets and use the model to classify the sample tweets into
negative and positives sentiments.

This article assumes that you are familiar with the basics of Python (see our How To Code in
Python 3 series), primarily the use of data structures, classes, and methods. The tutorial
assumes that you have no background in NLP and nltk , although some knowledge on it is
an added advantage.
S C R O L L TO TO P
Prerequisites
https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 1/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

This tutorial is based on Python version 3.6.5. If you don’t have Python 3 installed, Here’s a
guide to install and setup a local programming environment for Python 3.

Familiarity in working with language data is recommended. If you’re new to using NLTK,
check out the How To Work with Language Data in Python 3 using the Natural Language
Toolkit NLTK guide.

Step 1 — Installing NLTK and Downloading the Data


You will use the NLTK package in Python for all NLP tasks in this tutorial. In this step you will
install NLTK and download the sample tweets that you will use to train and test your model.

First, install the NLTK package with the pip package manager:

$ pip install nltk==3.3

This tutorial will use sample tweets that are part of the NLTK package. First, start a Python
interactive session by running the following command:

$ python3

Then, import the nltk module in the python interpreter.

>>> import nltk

Download the sample tweets from the NLTK package:

>>> nltk.download('twitter_samples')

Running this command from the Python interpreter downloads and stores the tweets locally.
Once the samples are downloaded, they are available for your use.

You will use the negative and positive tweets to train your model on sentiment analysis later
in the tutorial. The tweets with no sentiments will be used to test your model.

If you would like to use your own dataset, you can gather tweets from a specific time period,
S C R O L L TO TO P
user, or hashtag by using the Twitter API.
https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 2/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

Now that you’ve imported NLTK and downloaded the sample tweets, exit the interactive
session by entering in exit() . You are ready to import the tweets and begin processing the
data.

Step 2 — Tokenizing the Data


Language in its original form cannot be accurately processed by a machine, so you need to
process the language to make it easier for the machine to understand. The first part of
making sense of the data is through a process called tokenization, or splitting strings into
smaller parts called tokens.

A token is a sequence of characters in text that serves as a unit. Based on how you create
the tokens, they may consist of words, emoticons, hashtags, links, or even individual
characters. A basic way of breaking language into tokens is by splitting the text based on
whitespace and punctuation.

To get started, create a new .py file to hold your script. This tutorial will use nlp_test.py :

$ nano nlp_test.py

In this file, you will first import the twitter_samples so you can work with that data:

nlp_test.py

from nltk.corpus import twitter_samples

This will import three datasets from NLTK that contain various tweets to train and test the
model:

negative_tweets.json 5000 tweets with negative sentiments

positive_tweets.json 5000 tweets with positive sentiments

tweets.20150430-223406.json 20000 tweets with no sentiments

Next, create variables for positive_tweets , negative_tweets , and text :

nlp_test.py S C R O L L TO TO P

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 3/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

from nltk.corpus import twitter_samples

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')

The strings() method of twitter_samples will print all of the tweets within a dataset as
strings. Setting the different tweet collections as a variable will make processing and testing
easier.

Before using a tokenizer in NLTK, you need to download an additional resource, punkt . The
punkt module is a pre-trained model that helps you tokenize words and sentences. For
instance, this model knows that a name may contain a period (like “S. Daityari”) and the
presence of this period in a sentence does not necessarily end it. First, start a Python
interactive session:

$ python3

Run the following commands in the session to download the punkt resource:

>>> import nltk


>>> nltk.download('punkt')

Once the download is complete, you are ready to use NLTK’s tokenizers. NLTK provides a
default tokenizer for tweets with the .tokenized() method. Add a line to create an object
that tokenizes the positive_tweets.json dataset:

nlp_test.py

from nltk.corpus import twitter_samples

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')

If you’d like to test the script to see the .tokenized method in action, add the highlighted
S C R O L L TO TO P
content to your nlp_test.py script. This will tokenize a single tweet from the

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 4/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

positive_tweets.json dataset:

nlp_test.py

from nltk.corpus import twitter_samples

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')
tweet_tokens = twitter_samples.tokenized('positive_tweets.json') [0]

print(tweet_tokens[0])

Save and close the file, and run the script:

$ python3 nlp_test.py

The process of tokenization takes some time because it’s not a simple split on white space.
After a few moments of processing, you’ll see the following:

Output
['#FollowFriday',
'@France_Inte',
'@PKuchly57',
'@Milipol_Paris',
'for',
'being',
'top',
'engaged',
'members',
'in',
'my',
'community',
'this',
'week',
':)']

Here, the .tokenized() method returns special characters such as @ and _ . These
characters will be removed through regular expressions later in this tutorial.
S C R O L L TO TO P

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 5/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

Now that you’ve seen how the .tokenized() method works, make sure to comment out or
remove the last line to print the tokenized tweet from the script by adding a # to the start of
the line:

nlp_test.py

from nltk.corpus import twitter_samples

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')
tweet_tokens = twitter_samples.tokenized('positive_tweets.json') [0]

#print(tweet_tokens[0])

Your script is now configured to tokenize data. In the next step you will update the script to
normalize the data.

Step 3 — Normalizing the Data


Words have different forms—for instance, “ran”, “runs”, and “running” are various forms of
the same verb, “run”. Depending on the requirement of your analysis, all of these versions
may need to be converted to the same form, “run”. Normalization in NLP is the process of
converting a word to its canonical form.

Normalization helps group together words with the same meaning but different forms.
Without normalization, “ran”, “runs”, and “running” would be treated as different words, even
though you may want them to be treated as the same word. In this section, you explore
stemming and lemmatization, which are two popular techniques of normalization.

Stemming is a process of removing affixes from a word. Stemming, working with only simple
verb forms, is a heuristic process that removes the ends of words.

In this tutorial you will use the process of lemmatization, which normalizes a word with the
context of vocabulary and morphological analysis of words in text. The lemmatization
algorithm analyzes the structure of the word and its context to convert it to a normalized
form. Therefore, it comes at a cost of speed. A comparison of stemming and lemmatization
ultimately comes down to a trade off between speed and accuracy.
S C R O L L TO TO P

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 6/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

Before you proceed to use lemmatization, download the necessary resources by entering
the following in to a Python interactive session:

$ python3

Run the following commands in the session to download the resources:

>>> import nltk


>>> nltk.download('wordnet')
>>> nltk.download('averaged_perceptron_tagger')

wordnet is a lexical database for the English language that helps the script determine the
base word. You need the averaged_perceptron_tagger resource to determine the context of
a word in a sentence.

Once downloaded, you are almost ready to use the lemmatizer. Before running a lemmatizer,
you need to determine the context for each word in your text. This is achieved by a tagging
algorithm, which assesses the relative position of a word in a sentence. In a Python session,
Import the pos_tag function, and provide a list of tokens as an argument to get the tags. Let
us try this out in Python:

>>> from nltk.tag import pos_tag


>>> from nltk.corpus import twitter_samples
>>>
>>> tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
>>> print(pos_tag(tweet_tokens[0]))

Here is the output of the pos_tag function.

Output
[('#FollowFriday', 'JJ'),
('@France_Inte', 'NNP'),
('@PKuchly57', 'NNP'),
('@Milipol_Paris', 'NNP'),
('for', 'IN'),
('being', 'VBG'),
('top', 'JJ'),
S C R O L L TO TO P
('engaged', 'VBN'),

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 7/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

('members', 'NNS'),
('in', 'IN'),
('my', 'PRP$'),
('community', 'NN'),
('this', 'DT'),
('week', 'NN'),
(':)', 'NN')]

From the list of tags, here is the list of the most common items and their meaning:

NNP Noun, proper, singular

NN Noun, common, singular or mass

IN Preposition or conjunction, subordinating

VBG Verb, gerund or present participle

VBN Verb, past participle

Here is a full list of the dataset.

In general, if a tag starts with NN , the word is a noun and if it stars with VB , the word is a
verb. After reviewing the tags, exit the Python session by entering exit() .

To incorporate this into a function that normalizes a sentence, you should first generate the
tags for each token in the text, and then lemmatize each word using the tag.

Update the nlp_test.py file with the following function that lemmatizes a sentence:

nlp_test.py

...

from nltk.tag import pos_tag


from nltk.stem.wordnet import WordNetLemmatizer

def lemmatize_sentence(tokens):
lemmatizer = WordNetLemmatizer()
lemmatized_sentence = []
for word, tag in pos_tag(tokens):
if tag.startswith('NN'):
pos = 'n'
S C R O L L TO TO P
elif tag.startswith('VB'):

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 8/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

pos = 'v'
else:
pos = 'a'
lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
return lemmatized_sentence

print(lemmatize_sentence(tweet_tokens[0]))

This code imports the WordNetLemmatizer class and initializes it to a variable, lemmatizer .

The function lemmatize_sentence first gets the position tag of each token of a tweet. Within
the if statement, if the tag starts with NN , the token is assigned as a noun. Similarly, if the
tag starts with VB , the token is assigned as a verb.

Save and close the file, and run the script:

$ python3 nlp_test.py

Here is the output:

Output
['#FollowFriday',
'@France_Inte',
'@PKuchly57',
'@Milipol_Paris',
'for',
'be',
'top',
'engage',
'member',
'in',
'my',
'community',
'this',
'week',
':)']

You will notice that the verb being changes to its root form, be , and the noun members
changes to member . Before you proceed, comment out the last line that prints the sample
S C R O L L TO TO P
tweet from the script.

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 9/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

Now that you have successfully created a function to normalize words, you are ready to
move on to remove noise.

Step 4 — Removing Noise from the Data


In this step, you will remove noise from the dataset. Noise is any part of the text that does
not add meaning or information to data.

Noise is specific to each project, so what constitutes noise in one project may not be in a
different project. For instance, the most common words in a language are called stop words.
Some examples of stop words are “is”, “the”, and “a”. They are generally irrelevant when
processing language, unless a specific use case warrants their inclusion.

In this tutorial, you will use regular expressions in Python to search for and remove these
items:

Hyperlinks All hyperlinks in Twitter are converted to the URL shortener t.co. Therefore,
keeping them in the text processing would not add any value to the analysis.

Twitter handles in replies These Twitter usernames are preceded by a @ symbol, which
does not convey any meaning.

Punctuation and special characters While these often provide context to textual data,
this context is often difficult to process. For simplicity, you will remove all punctuation and
special characters from tweets.

To remove hyperlinks, you need to first search for a substring that matches a URL starting
with http:// or https:// , followed by letters, numbers, or special characters. Once a
pattern is matched, the .sub() method replaces it with an empty string.

Since we will normalize word forms within the remove_noise() function, you can comment
out the lemmatize_sentence() function from the script.

Add the following code to your nlp_test.py file to remove noise from the dataset:

nlp_test.py

...

import re, string S C R O L L TO TO P

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 10/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

def remove_noise(tweet_tokens, stop_words = ()):

cleaned_tokens = []

for token, tag in pos_tag(tweet_tokens):


token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
'(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
token = re.sub("(@[A-Za-z0-9_]+)","", token)

if tag.startswith("NN"):
pos = 'n'
elif tag.startswith('VB'):
pos = 'v'
else:
pos = 'a'

lemmatizer = WordNetLemmatizer()
token = lemmatizer.lemmatize(token, pos)

if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_
cleaned_tokens.append(token.lower())
return cleaned_tokens

This code creates a remove_noise() function that removes noise and incorporates the
normalization and lemmatization mentioned in the previous section. The code takes two
arguments: the tweet tokens and the tuple of stop words.

The code then uses a loop to remove the noise from the dataset. To remove hyperlinks, the
code first searches for a substring that matches a URL starting with http:// or https:// ,
followed by letters, numbers, or special characters. Once a pattern is matched, the .sub()
method replaces it with an empty string, or '' .

Similarly, to remove @ mentions, the code substitutes the relevant part of text using regular
expressions. The code uses the re library to search @ symbols, followed by numbers,
letters, or _ , and replaces them with an empty string.

Finally, you can remove punctuation using the library string .

In addition to this, you will also remove stop words using a built-in set of stop words in
NLTK, which needs to be downloaded separately.

S C R O L L TO TO P

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 11/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

Execute the following command from a Python interactive session to download this
resource:

>>> nltk.download('stopwords')

Once the resource is downloaded, exit the interactive session.

You can use the .words() method to get a list of stop words in English. To test the function,
let us run it on our sample tweet. Add the following lines to the end of the nlp_test.py file:

nlp_test.py

...
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

print(remove_noise(tweet_tokens[0], stop_words))

After saving and closing the file, run the script again to receive output similar to the
following:

Output
['#followfriday', 'top', 'engage', 'member', 'community', 'week', ':)']

Notice that the function removes all @ mentions, stop words, and converts the words to
lowercase.

Before proceeding to the modeling exercise in the next step, use the remove_noise()
function to clean the positive and negative tweets. Comment out the line to print the output
of remove_noise() on the sample tweet and add the following to the nlp_test.py script:

nlp_test.py

...
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

# print(remove_noise(tweet_tokens[0], stop_words)) S C R O L L TO TO P

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 12/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []

for tokens in positive_tweet_tokens:


positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

for tokens in negative_tweet_tokens:


negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

Now that you’ve added the code to clean the sample tweets, you may want to compare the
original tokens to the cleaned tokens for a sample tweet. If you’d like to test this, add the
following code to the file to compare both versions of the 500th tweet in the list:

nlp_test.py

...
print(positive_tweet_tokens[500])
print(positive_cleaned_tokens_list[500])

Save and close the file and run the script. From the output you will see that the punctuation
and links have been removed, and the words have been converted to lowercase.

Output
['Dang', 'that', 'is', 'some', 'rad', '@AbzuGame', '#fanart', '!', ':D', 'https://t.co/bI8k8tb
['dang', 'rad', '#fanart', ':d']

There are certain issues that might arise during the preprocessing of text. For instance,
words without spaces (“iLoveYou”) will be treated as one and it can be difficult to separate
such words. Furthermore, “Hi”, “Hii”, and “Hiiiii” will be treated differently by the script unless
you write something specific to tackle the issue. It’s common to fine tune the noise removal
process for your specific data.

Now that you’ve seen the remove_noise() function in action, be sure to comment out or
remove the last two lines from the script so you can add more to it:
S C R O L L TO TO P

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 13/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

nlp_test.py

...
# print(positive_tweet_tokens[500])
# print(positive_cleaned_tokens_list[500])

In this step you removed noise from the data to make the analysis more effective. In the next
step you will analyze the data to find the most common words in your sample dataset.

Step 5 — Determining Word Density


The most basic form of analysis on textual data is to take out the word frequency. A single
tweet is too small of an entity to find out the distribution of words, hence, the analysis of the
frequency of words would be done on all positive tweets.

The following snippet defines a generator function, named get_all_words , that takes a list
of tweets as an argument to provide a list of words in all of the tweet tokens joined. Add the
following code to your nlp_test.py file:

nlp_test.py

...

def get_all_words(cleaned_tokens_list):
for tokens in cleaned_tokens_list:
for token in tokens:
yield token

all_pos_words = get_all_words(positive_cleaned_tokens_list)

Now that you have compiled all words in the sample of tweets, you can find out which are
the most common words using the FreqDist class of NLTK. Adding the following code to
the nlp_test.py file:

nlp_test.py

from nltk import FreqDist

freq_dist_pos = FreqDist(all_pos_words)
print(freq_dist_pos.most_common(10))
S C R O L L TO TO P

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 14/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

The .most_common() method lists the words which occur most frequently in the data. Save
and close the file after making these changes.

When you run the file now, you will find the most common terms in the data:

Output
[(':)', 3691),
(':-)', 701),
(':d', 658),
('thanks', 388),
('follow', 357),
('love', 333),
('...', 290),
('good', 283),
('get', 263),
('thank', 253)]

From this data, you can see that emoticon entities form some of the most common parts of
positive tweets. Before proceeding to the next step, make sure you comment out the last
line of the script that prints the top ten tokens.

To summarize, you extracted the tweets from nltk , tokenized, normalized, and cleaned up
the tweets for using in the model. Finally, you also looked at the frequencies of tokens in the
data and checked the frequencies of the top ten tokens.

In the next step you will prepare data for sentiment analysis.

Step 6 — Preparing Data for the Model


Sentiment analysis is a process of identifying an attitude of the author on a topic that is
being written about. You will create a training data set to train a model. It is a supervised
learning machine learning process, which requires you to associate each dataset with a
“sentiment” for training. In this tutorial, your model will use the “positive” and “negative”
sentiments.

Sentiment analysis can be used to categorize text into a variety of sentiments. For simplicity
and availability of the training dataset, this tutorial helps you train your model in only two
categories, positive and negative. S C R O L L TO TO P

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 15/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

A model is a description of a system using rules and equations. It may be as simple as an


equation which predicts the weight of a person, given their height. A sentiment analysis
model that you will build would associate tweets with a positive or a negative sentiment. You
will need to split your dataset into two parts. The purpose of the first part is to build the
model, whereas the next part tests the performance of the model.

In the data preparation step, you will prepare the data for sentiment analysis by converting
tokens to the dictionary form and then split the data for training and testing purposes.

Converting Tokens to a Dictionary


First, you will prepare the data to be fed into the model. You will use the Naive Bayes
classifier in NLTK to perform the modeling exercise. Notice that the model requires not just a
list of words in a tweet, but a Python dictionary with words as keys and True as values. The
following function makes a generator function to change the format of the cleaned data.

Add the following code to convert the tweets from a list of cleaned tokens to dictionaries
with keys as the tokens and True as values. The corresponding dictionaries are stored in
positive_tokens_for_model and negative_tokens_for_model .

nlp_test.py

...
def get_tweets_for_model(cleaned_tokens_list):
for tweet_tokens in cleaned_tokens_list:
yield dict([token, True] for token in tweet_tokens)

positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

Splitting the Dataset for Training and Testing the Model


Next, you need to prepare the data for training the NaiveBayesClassifier class. Add the
following code to the file to prepare the data:

nlp_test.py

...
import random
S C R O L L TO TO P
positive_dataset = [(tweet_dict, "Positive")
https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 16/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

for tweet_dict in positive_tokens_for_model]

negative_dataset = [(tweet_dict, "Negative")


for tweet_dict in negative_tokens_for_model]

dataset = positive_dataset + negative_dataset

random.shuffle(dataset)

train_data = dataset[:7000]
test_data = dataset[7000:]

This code attaches a Positive or Negative label to each tweet. It then creates a dataset
by joining the positive and negative tweets.

By default, the data contains all positive tweets followed by all negative tweets in sequence.
When training the model, you should provide a sample of your data that does not contain
any bias. To avoid bias, you’ve added code to randomly arrange the data using the
.shuffle() method of random .

Finally, the code splits the shuffled data into a ratio of 70 30 for training and testing,
respectively. Since the number of tweets is 10000, you can use the first 7000 tweets from
the shuffled dataset for training the model and the final 3000 for testing the model.

In this step, you converted the cleaned tokens to a dictionary form, randomly shuffled the
dataset, and split it into training and testing data.

Step 7 — Building and Testing the Model


Finally, you can use the NaiveBayesClassifier class to build the model. Use the .train()
method to train the model and the .accuracy() method to test the model on the testing
data.

nlp_test.py

...
from nltk import classify
from nltk import NaiveBayesClassifier
classifier = NaiveBayesClassifier.train(train_data)

print("Accuracy is:", classify.accuracy(classifier, test_data))


S C R O L L TO TO P
print(classifier.show_most_informative_features(10))
https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 17/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

Save, close, and execute the file after adding the code. The output of the code will be as
follows:

Output
Accuracy is: 0.9956666666666667

Most Informative Features


:( = True Negati : Positi = 2085.6 : 1.0
:) = True Positi : Negati = 986.0 : 1.0
welcome = True Positi : Negati = 37.2 : 1.0
arrive = True Positi : Negati = 31.3 : 1.0
sad = True Negati : Positi = 25.9 : 1.0
follower = True Positi : Negati = 21.1 : 1.0
bam = True Positi : Negati = 20.7 : 1.0
glad = True Positi : Negati = 18.1 : 1.0
x15 = True Negati : Positi = 15.9 : 1.0
community = True Positi : Negati = 14.1 : 1.0

Accuracy is defined as the percentage of tweets in the testing dataset for which the model
was correctly able to predict the sentiment. A 99.5% accuracy on the test set is pretty good.

In the table that shows the most informative features, every row in the output shows the
ratio of occurrence of a token in positive and negative tagged tweets in the training dataset.
The first row in the data signifies that in all tweets containing the token :( , the ratio of
negative to positives tweets was 2085.6 to 1 . Interestingly, it seems that there was one
token with :( in the positive datasets. You can see that the top two discriminating items in
the text are the emoticons. Further, words such as sad lead to negative sentiments,
whereas welcome and glad are associated with positive sentiments.

Next, you can check how the model performs on random tweets from Twitter. Add this code
to the file:

nlp_test.py

...
from nltk.tokenize import word_tokenize

custom_tweet = "I ordered just once from TerribleCo, they screwed up, neverS C ROLL
used TOapp
the TO Paga

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 18/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

custom_tokens = remove_noise(word_tokenize(custom_tweet))

print(classifier.classify(dict([token, True] for token in custom_tokens)))

This code will allow you to test custom tweets by updating the string associated with the
custom_tweet variable. Save and close the file after making these changes.

Run the script to analyze the custom text. Here is the output for the custom text in the
example:

Output
'Negative'

You can also check if it characterizes positive tweets correctly:

nlp_test.py

...
custom_tweet = ' Congrats #SportStar on your 7th best goal from last season winning goal of t

Here is the output:

Output
'Positive'

Now that you’ve tested both positive and negative sentiments, update the variable to test a
more complex sentiment like sarcasm.

nlp_test.py

...
custom_tweet = ' Thank you for sending my baggage to CityX and flying me to CityY at the same

Here is the output: S C R O L L TO TO P

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 19/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

Output
'Positive'

The model classified this example as positive. This is because the training data wasn’t
comprehensive enough to classify sarcastic tweets as negative. In case you want your
model to predict sarcasm, you would need to provide sufficient amount of training data to
train it accordingly.

In this step you built and tested the model. You also explored some of its limitations, such as
not detecting sarcasm in particular examples. Your completed code still has artifacts
leftover from following the tutorial, so the next step will guide you through aligning the code
to Python’s best practices.

Step 8 — Cleaning Up the Code (Optional)


Though you have completed the tutorial, it is recommended to reorganize the code in the
nlp_test.py file to follow best programming practices. Per best practice, your code should
meet this criteria:

All imports should be at the top of the file. Imports from the same library should be grouped
together in a single statement.

All functions should be defined after the imports.

All the statements in the file should be housed under an if __name__ == "__main__":
condition. This ensures that the statements are not executed if you are importing the
functions of the file in another file.

We will also remove the code that was commented out by following the tutorial, along with
the lemmatize_sentence function, as the lemmatization is completed by the new
remove_noise function.

Here is the cleaned version of nlp_test.py :

from nltk.stem.wordnet import WordNetLemmatizer


from nltk.corpus import twitter_samples, stopwords
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
S C R O L L TO TO P
from nltk import FreqDist, classify, NaiveBayesClassifier

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 20/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

import re, string, random

def remove_noise(tweet_tokens, stop_words = ()):

cleaned_tokens = []

for token, tag in pos_tag(tweet_tokens):


token = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+#]|[!*\(\),]|'\
'(?:%[0-9a-fA-F][0-9a-fA-F]))+','', token)
token = re.sub("(@[A-Za-z0-9_]+)","", token)

if tag.startswith("NN"):
pos = 'n'
elif tag.startswith('VB'):
pos = 'v'
else:
pos = 'a'

lemmatizer = WordNetLemmatizer()
token = lemmatizer.lemmatize(token, pos)

if len(token) > 0 and token not in string.punctuation and token.lower() not in stop_
cleaned_tokens.append(token.lower())
return cleaned_tokens

def get_all_words(cleaned_tokens_list):
for tokens in cleaned_tokens_list:
for token in tokens:
yield token

def get_tweets_for_model(cleaned_tokens_list):
for tweet_tokens in cleaned_tokens_list:
yield dict([token, True] for token in tweet_tokens)

if __name__ == "__main__":

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]

stop_words = stopwords.words('english')

positive_tweet_tokens = twitter_samples.tokenized('positive_tweets.json')
negative_tweet_tokens = twitter_samples.tokenized('negative_tweets.json')

positive_cleaned_tokens_list = []
negative_cleaned_tokens_list = []

for tokens in positive_tweet_tokens: S C R O L L TO TO P


positive_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 21/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

for tokens in negative_tweet_tokens:


negative_cleaned_tokens_list.append(remove_noise(tokens, stop_words))

all_pos_words = get_all_words(positive_cleaned_tokens_list)

freq_dist_pos = FreqDist(all_pos_words)
print(freq_dist_pos.most_common(10))

positive_tokens_for_model = get_tweets_for_model(positive_cleaned_tokens_list)
negative_tokens_for_model = get_tweets_for_model(negative_cleaned_tokens_list)

positive_dataset = [(tweet_dict, "Positive")


for tweet_dict in positive_tokens_for_model]

negative_dataset = [(tweet_dict, "Negative")


for tweet_dict in negative_tokens_for_model]

dataset = positive_dataset + negative_dataset

random.shuffle(dataset)

train_data = dataset[:7000]
test_data = dataset[7000:]

classifier = NaiveBayesClassifier.train(train_data)

print("Accuracy is:", classify.accuracy(classifier, test_data))

print(classifier.show_most_informative_features(10))

custom_tweet = "I ordered just once from TerribleCo, they screwed up, never used the app

custom_tokens = remove_noise(word_tokenize(custom_tweet))

print(custom_tweet, classifier.classify(dict([token, True] for token in custom_tokens)))

Conclusion
This tutorial introduced you to a basic sentiment analysis model using the nltk library in
Python 3. First, you performed pre-processing on tweets by tokenizing a tweet, normalizing
the words, and removing noise. Next, you visualized frequently occurring items in the data.
Finally, you built a model to associate tweets to a particular sentiment.

A supervised learning model is only as good as its training data. To further strengthen the
S C R O L L TO TO P
model, you could considering adding more categories like excitement and anger. In this

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 22/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

tutorial, you have only scratched the surface by building a rudimentary model. Here’s a
detailed guide on various considerations that one must take care of while performing
sentiment analysis.

Was this helpful? Yes No    


10

Report an issue

About the authors

Shaumik Daityari Haley Mills

Shaumik is an optimist, but one who Editor


carries an umbrella. An undergrad at
IITR, he loves writing, when he's not
busy keeping the blue flag flying
high.

Still looking for an answer?

 Ask a question  Search for more help

REL ATED

S C R O L L TO TO P

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 23/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

DigitalOcean App Platform

You bring your web app in a


GitHub repo
App Platform handles deployments and builds
DNS, HTTPS, CDN, DDoS Mitigation, Vertical Scaling, Horizontal Scaling, and more.

More Info

How To Install PHP 7.4 and Set Up a Local Development Environment on Ubuntu 20.04
Tutorial

How To Install PHP 7.4 and Set Up a Local Development Environment on Ubuntu 18.04
Tutorial

Comments

10 Comments

Leave a comment...

Sign In to Comment

hilkogk December 11, 2019

2 Hi Shaumik,

Thank you very much for this brilliant tutorial.


I’m in the process of developing a few custom tools for Alteryx and this tutorial was absolutely
legendary!!
S C R O L L TO TO P
Reply Report

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 24/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

mantaskantautas February 26, 2020

0
This comment has been marked as resolved by mantaskantautas. SHOW COMMENT

mantaskantautas February 26, 2020

0 So how can we alter the logic, so you would only need to do all then training part only once - as
it takes a lot of time and resources. And in real life scenarios most of the time only the custom
sentence will be changing.
Reply Report

martinbreuss March 17, 2020

2 I think there’s a slice too much in this example:

tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]

print(tweet_tokens[0])

Seems to me you wanted to show a single example tweet, so makes sense to keep the [0] in
your print() function, but remove it from the line above. Otherwise tweet_tokens becomes
less useful.
Reply Report

toddrimes May 2, 2020

1 Hi, Shaumik:

In the final code, what is

text = twitter_samples.strings('tweets.20150430-223406.json')

…for? It looks like ‘text’ is never referenced or used after that.

Thank you,
Todd
Reply Report

rufio May 2, 2020

0
@toddrimes I think those four lines:

S C R O L L TO TO P

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 25/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')
text = twitter_samples.strings('tweets.20150430-223406.json')
tweet_tokens = twitter_samples.tokenized('positive_tweets.json')[0]

are there for consistency, since one of the earlier steps was demonstrating how to load and
extract the formatted twitter data.

You can also use the text variable at the end of the script to test the sentiment analysis
by feeding it real tweets and seeing how accurately it classifies each one.
Reply Report

gmehta1996 July 3, 2020

0 I tried the sentiment analysis with the positive and negative tweets but I want to add more
sentiments to it like sarcasm or neutral. I tried to add 5000 neutral tweets and followed the
same procedure like positive and negative. If I do so can I get the ratio of all the three
sentiments when I use the ‘classifier.showmostinformative_features(10)’ command . Currently I
am getting ratios of neutral with either only positive or negative

following is the output:

Most Informative Features


:( True Negati Neutra = 1864.7 1.0
:) True Positi Negati = 847.0 1.0
rt True Neutra Negati = 807.8 1.0
:d True Positi Neutra = 672.7 1.0
True Positi Neutra = 215.0 1.0
… True Neutra Negati = 198.0 1.0
tory True Neutra Positi = 108.7 1.0
morning True Positi Neutra = 104.0 1.0
rather True Neutra Negati = 99.9 1.0
deal True Neutra Positi = 84.4 1.0

How do I compare all three together or If I add more sentiments how do I compare their ratios
to each other
Reply Report

messyaryal August 13, 2020

2 The obtained accuracy is very high so I was wondering what made the model that accurate
S C R O L L TO TO P
when it does not even handle double negation sentences. Does it consist of any outliers? Or Is

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 26/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

there something else?


Reply Report

konijetinikhil November 27, 2020

0 What else classifier’s in nltk can we use here in place of Naive Bayes?
Reply Report

eldesastre January 28, 2021

0 Great tutorial, this is very much appreciated!


Reply Report

This work is licensed under a Creative


Commons Attribution-NonCommercial-
ShareAlike 4.0 International License.

GET OUR BIWEEKLY NEWSLETTER

Sign up for Infrastructure as a


Newsletter.

S C R O L L TO TO P

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 27/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

HOLLIE'S HUB FOR GOOD

Working on improving health and


education, reducing inequality,
and spurring economic growth?
We'd like to help.

BECOME A CONTRIBUTOR

You get paid; we donate to tech


nonprofits.

Featured on Community Kubernetes Course Learn Python 3 Machine Learning in Python


Getting started with Go Intro to Kubernetes

DigitalOcean Products Virtual Machines Managed Databases Managed Kubernetes Block Storage
Object Storage Marketplace VPC Load Balancers

Welcome to the developer cloud

DigitalOcean makes it simple to launch in the


cloud and scale up as you grow – whether you’re
running one virtual machine or ten thousand.
S C R O L L TO TO P

Learn More
https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 28/29
4/5/2021 How To Perform Sentiment Analysis in Python 3 Using the Natural Language Toolkit (NLTK) | DigitalOcean

Company

About
Leadership
© 2021 DigitalOcean, LLC. All rights reserved.
Blog
Careers
Partners
Referral Program
Press
Legal
Security & Trust Center

Products Community Contact

Pricing Tutorials Get Support


Products Overview Q&A Trouble Signing In?
Droplets Tools and Integrations Sales
Kubernetes Tags Report Abuse
Managed Databases Product Ideas System Status
Spaces Write for DigitalOcean
Marketplace Presentation Grants
Load Balancers Hatch Startup Program
Block Storage Shop Swag
API Documentation Research Program
Documentation Open Source
Release Notes Code of Conduct
S C R O L L TO TO P

https://www.digitalocean.com/community/tutorials/how-to-perform-sentiment-analysis-in-python-3-using-the-natural-language-toolkit-nltk 29/29

You might also like