Removing stop words with NLTK in Python

When computers process natural language, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words.

For example, if you give the input sentence as −

John is a person who takes care of the people around him.

After stop word removal, you'll get the output −

['John', 'person', 'takes', 'care', 'people', 'around', '.']

NLTK has a collection of these stopwords which we can use to remove these from any given sentence. This is inside the NLTK.corpus module. We can use that to filter out stop words from out sentence. For example,

Example

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

my_sent = "John is a person who takes care of people around him."
tokens = word_tokenize(my_sent)

filtered_sentence = [w for w in tokens if not w in stopwords.words()]

print(filtered_sentence)

Output

This will give the output −

['John', 'person', 'takes', 'care', 'people', 'around', '.']