0% found this document useful (0 votes)
30 views4 pages

Lab Prgms Weel1-Output

This document provides a guide on installing the Natural Language Toolkit (NLTK) in Python, including commands for installation and downloading necessary packages. It outlines text preprocessing techniques such as tokenization and stopword removal, demonstrating how to implement these using NLTK functions. The tutorial emphasizes the importance of these preprocessing steps for preparing text data for machine learning tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views4 pages

Lab Prgms Weel1-Output

This document provides a guide on installing the Natural Language Toolkit (NLTK) in Python, including commands for installation and downloading necessary packages. It outlines text preprocessing techniques such as tokenization and stopword removal, demonstrating how to implement these using NLTK functions. The tutorial emphasizes the importance of these preprocessing steps for preparing text data for machine learning tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Installing Python NLTK

To get started, you need to install NLTK on your computer. Run the
following command:

!pip install nltk

After installation, you need to import NLTK and download the


necessary packages.

import nltk

nltk.download('punkt')

nltk.download('wordnet')

nltk.download('averaged_perceptron_tagger')

nltk.download('stopwords')

nltk.download('maxent_ne_chunker')

nltk.download('words')

Here’s the output that you should expect:

[nltk_data] Downloading package punkt to /home/user/nltk_data...

[nltk_data] Unzipping tokenizers/punkt.zip.

[nltk_data] Downloading package wordnet to

/home/user/nltk_data...

[nltk_data] Unzipping corpora/wordnet.zip.

[nltk_data] Downloading package averaged_perceptron_tagger to

...

The above commands download several NLTK packages


using nltk.download().
You will need these to perform tasks such as part of speech tagging,
stopword removal, and lemmatization.
With the Natural Language Toolkit installed, we are now ready to
explore the next steps of preprocessing.

Text Preprocessing
Text preprocessing is the practice of cleaning and preparing text data
for machine learning algorithms. The primary steps include tokenizing,
removing stop words, stemming, lemmatizing, and more.

These steps help reduce the complexity of the data and extract
meaningful information from it.
In the coming sections of this tutorial, we’ll walk you through each of
these steps using NLTK.

Sentence and word tokenization


Tokenization is the process of breaking down text into words,
phrases, symbols, or other meaningful elements called tokens. The
input to the tokenizer is a unicode text, and the output is a list of
sentences or words.

In NLTK, we have two types of tokenizers – the word tokenizer and


the sentence tokenizer.
Let’s see an example:

from nltk.tokenize import sent_tokenize, word_tokenize

text = "Natural language processing is fascinating. It involves

many tasks such as text classification, sentiment analysis, and

more."

sentences = sent_tokenize(text)

print(sentences)
words = word_tokenize(text)

print(words)

Output:

['Natural language processing is fascinating.', 'It involves many

tasks such as text classification, sentiment analysis, and

more.']

['Natural', 'language', 'processing', 'is', 'fascinating', '.',

'It', 'involves', 'many', 'tasks', 'such', 'as', 'text',

'classification', ',', 'sentiment', 'analysis', ',', 'and',

'more', '.']

The sent_tokenize function splits the text into sentences, and


the word_tokenize function splits the text into words. As you can see,
punctuation is also treated as a separate token.

The sent_tokenize function splits the text into sentences, and


the word_tokenize function splits the text into words. As you can see,
punctuation is also treated as a separate token.

Stopwords removal
In natural language processing, stopwords are words that you want to
ignore, so you filter them out when you’re processing your text.

These are usually words that occur very frequently in any text and do
not convey much meaning, such as “is”, “an”, “the”, “in”, etc.
NLTK comes with a predefined list of stopwords in several languages,
including English.
Let’s use NLTK to filter out stopwords from our list of tokenized words:
from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

text = "Natural language processing is fascinating. It involves

many tasks such as text classification, sentiment analysis, and

more."

stop_words = set(stopwords.words('english'))

words = word_tokenize(text)

filtered_words = [word for word in words if word.casefold() not

in stop_words]

print(filtered_words)

Output:

['Natural', 'language', 'processing', 'fascinating', '.',

'involves', 'many', 'tasks', 'text', 'classification', ',',

'sentiment', 'analysis', ',', '.',]

In this piece of code, we first import the stopwords from NLTK,


tokenize the text, and then filter out the stopwords.
The casefold() method is used to ignore the case while comparing
words to the stop words list.

You might also like