0% found this document useful (0 votes)
9 views2 pages

Write A Python Program For The Following Preprocessing of Text in NLP: Tokenization Filtration Script Validation Stop Word Removal Stemming

The document provides a Python program for text preprocessing in NLP, which includes tokenization, filtration, script validation, stop word removal, and stemming. It utilizes the NLTK library for various text processing tasks and the langdetect library for language detection. An example usage is included to demonstrate the preprocessing steps on a sample text.

Uploaded by

Nidhi Rao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views2 pages

Write A Python Program For The Following Preprocessing of Text in NLP: Tokenization Filtration Script Validation Stop Word Removal Stemming

The document provides a Python program for text preprocessing in NLP, which includes tokenization, filtration, script validation, stop word removal, and stemming. It utilizes the NLTK library for various text processing tasks and the langdetect library for language detection. An example usage is included to demonstrate the preprocessing steps on a sample text.

Uploaded by

Nidhi Rao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

1 Write a Python program for the following preprocessing of text in NLP:

● Tokenization
● Filtration
● Script Validation
● Stop Word Removal
● Stemming

pip install nltk langdetect

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re
from langdetect import detect

# Download necessary resources


nltk.download('punkt')
nltk.download('stopwords')

def preprocess_text(text):
# Tokenization
tokens = word_tokenize(text)
print("Tokens:", tokens)

# Filtration (Removing non-alphabetic tokens)


filtered_tokens = [word for word in tokens if word.isalpha()]
print("Filtered Tokens:", filtered_tokens)

# Script Validation (Checking if text is in English)


try:
if detect(text) != 'en':
return "Text is not in English, skipping preprocessing."
except:
return "Language detection failed."

# Stop Word Removal


stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in filtered_tokens if word.lower() not in
stop_words]
print("After Stop Word Removal:", filtered_tokens)

# Stemming
stemmer = PorterStemmer()
stemmed_tokens = [stemmer.stem(word) for word in filtered_tokens]
print("Stemmed Tokens:", stemmed_tokens)

return ' '.join(stemmed_tokens)

# Example Usage
text = "This is an example sentence demonstrating text preprocessing in NLP!"
processed_text = preprocess_text(text)
print("Processed Text:", processed_text)

OUTPUT

Tokens: ['This', 'is', 'an', 'example', 'sentence', 'demonstrating', 'text',


'preprocessing', 'in', 'NLP', '!']

Filtered Tokens: ['This', 'is', 'an', 'example', 'sentence', 'demonstrating', 'text',


'preprocessing', 'in', 'NLP']

After Stop Word Removal: ['example', 'sentence', 'demonstrating', 'text',


'preprocessing', 'NLP']

Stemmed Tokens: ['exampl', 'sentenc', 'demonstr', 'text', 'preprocess', 'nlp']

Processed Text: exampl sentenc demonstr text preprocess nlp

You might also like