0% found this document useful (0 votes)
7 views13 pages

1 - Write A Python Program To Perform Following Tasks On Text A) Tokenization

The document provides a comprehensive guide on performing various text processing tasks in Python, including tokenization, stop word removal, stemming using the Porter stemmer algorithm, word analysis, word generation, and part-of-speech tagging. It also covers morphological analysis, n-grams generation, and smoothing techniques using the NLTK library. Sample code snippets are included for each task to demonstrate their implementation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views13 pages

1 - Write A Python Program To Perform Following Tasks On Text A) Tokenization

The document provides a comprehensive guide on performing various text processing tasks in Python, including tokenization, stop word removal, stemming using the Porter stemmer algorithm, word analysis, word generation, and part-of-speech tagging. It also covers morphological analysis, n-grams generation, and smoothing techniques using the NLTK library. Sample code snippets are included for each task to demonstrate their implementation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 13

1 .

Write a Python Program to perform following tasks on text

a) Tokenization

# Sample text

text = "This is a simple example of tokenization using split."

# Tokenizing the text using split() function

tokens = text.split()

# Displaying the tokens

print("Tokens:", tokens)

b) Stop word Removal

import nltk

from nltk.corpus import stopwords

from nltk.tokenize import word_tokenize

# Download the necessary resources for tokenization and stop words

nltk.download('punkt')

nltk.download('stopwords')

# Sample text

text = "This is an example sentence demonstrating stopword removal using NLTK."

# Tokenizing the text using word_tokenize()

tokens = word_tokenize(text)

# Get the set of stopwords in English

stop_words = set(stopwords.words('english'))

# Filter out stopwords from the tokens

filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Displaying the filtered tokens

print("Filtered Tokens:", filtered_tokens)


2. Write a Python program to implement Porter stemmer algorithm for stemming

import re

class PorterStemmer:

def stem(self, word):

suffixes = [

(r"(sses|ss)$", ""),

(r"(ies|ied)$", "i"),

(r"(ing|ed)$", ""),

(r"(es|s)$", ""),

(r"(ly|ness)$", ""),

(r"(er|ful)$", ""),

word = word.lower()

for pattern, replacement in suffixes:

word = re.sub(pattern, replacement, word)

return word

# Test the Porter Stemmer

porter_stemmer = PorterStemmer()

words = ["running", "better", "happiness", "jumps", "faster", "running", "beauty", "kindness"]

stemmed_words = [porter_stemmer.stem(word) for word in words]

print("Original Words:", words)

print("Stemmed Words:", stemmed_words)


3. Write Python Program for

a) Word Analysis

import string

class WordAnalyzer:

def __init__(self):

self.vowels = "aeiou"

def analyze_word(self, word):

word = word.lower()

word_length = len(word)

vowels_count = sum(1 for char in word if char in self.vowels)

consonants_count = sum(1 for char in word if char in string.ascii_lowercase and char not in
self.vowels)

unique_chars = len(set(word)

return {

"Word": word,

"Length": word_length,

"Vowels": vowels_count,

"Consonants": consonants_count,

"Unique Characters": unique_chars

# Usage example for Word Analysis:

word_analyzer = WordAnalyzer()

word = input("Enter a word for analysis: ")

analysis = word_analyzer.analyze_word(word)

print("\nWord Analysis:")

for key, value in analysis.items():


print(f"{key}: {value}

b) Word Generation

import random

import string

def generate_word(length):

# Randomly selects letters to form a word of specified length

return ''.join(random.choice(string.ascii_lowercase) for _ in range(length))

def generate_words(num_words, word_length):

words = [generate_word(word_length) for _ in range(num_words)]

return words

# Example usage

num_words = 5 # Number of words to generate

word_length = 8 # Length of each word

generated_words = generate_words(num_words, word_length)

print("Generated Words:", generated_words)


4. Create a Sample list for at least 5 words with ambiguous sense and Write a Python program to
implement WSD

from collections import Counter

# Ambiguous words and their possible senses

word_senses = {

'bank': ['financial institution', 'side of a river'],

'bark': ['sound a dog makes', 'outer covering of a tree'],

'bat': ['flying mammal', 'sports equipment'],

'lead': ['metal', 'to guide someone'],

'spring': ['season', 'coiled object used for bouncing']

# Sample sentences with ambiguous words

sentences = [

"I went to the bank to deposit some money.",

"The dog started to bark loudly.",

"He hit the ball with a bat.",

"She will lead the team to victory.",

"The flowers bloom every spring."

# Function to determine the sense of a word based on context

def wsd(word, sentence):

senses = word_senses.get(word, [])

sense_counter = Counter()

# Count sense occurrences based on the context

for sense in senses:

for word_in_context in sentence.split():


if word_in_context.lower() in sense.lower():

sense_counter[sense] += 1

return sense_counter.most_common(1)[0][0] if sense_counter else "No clear sense"

# Implement WSD for each sentence

for sentence in sentences:

for word in word_senses:

if word in sentence.lower():

print(f"Sentence: {sentence}")

print(f"Word: {word} -> Predicted Sense: {wsd(word, sentence)}")

print("-" * 50)
5. Install NLTK tool kit and perform stemming

import nltk

from nltk.stem import PorterStemmer

from nltk.tokenize import word_tokenize

# Initialize the PorterStemmer

stemmer = PorterStemmer()

# Sample text

text = "The cats were playing with the scratched balls, and they enjoyed the games."

# Tokenize the sentence into words

words = word_tokenize(text)

# Perform stemming

stemmed_words = [stemmer.stem(word) for word in words]

# Display the stemmed words

print("Original Text: ", text)

print("Stemmed Words: ", stemmed_words)


6. Create Sample list of at least 10 words POS tagging and find the POS for any given word

import nltk

from nltk import pos_tag

# Sample list of words

words = ["run", "quickly", "dog", "happily", "under", "the", "sky", "ate", "jump", "beautiful"]

# Perform POS tagging

tagged_words = pos_tag(words)

# Print tagged words

print("Tagged Words:", tagged_words)

# Function to find POS for a given word

def find_pos(word):

for w, tag in tagged_words:

if w.lower() == word.lower():

return f"POS for '{word}': {tag}"

return f"'{word}' not found."

# Example usage

print(find_pos("dog"))

print(find_pos("run"))
7. Write a Python program to

a) Perform Morphological Analysis using NLTK library

import nltk

from nltk.stem import PorterStemmer, WordNetLemmatizer

from nltk.tokenize import word_tokenize

# Download necessary NLTK data

nltk.download('punkt')

nltk.download('wordnet')

# Initialize stemmer and lemmatizer

stemmer = PorterStemmer()

lemmatizer = WordNetLemmatizer()

# Sample text

text = "hi this is me."

# Tokenize text and perform morphological analysis

words = word_tokenize(text)

stemmed = [stemmer.stem(word) for word in words]

lemmatized = [lemmatizer.lemmatize(word, pos='v') for word in words]

# Display results

print("Stemmed:", stemmed)

print("Lemmatized:", lemmatized)
b) Generate n-grams using NLTK N-Grams library

import nltk

from nltk.util import ngrams

from nltk.tokenize import word_tokenize

# Download necessary NLTK data

nltk.download('punkt')

# Sample text

text = "hi this is me."

# Tokenize and generate bigrams (n=2)

bigrams = list(ngrams(word_tokenize(text), 2))

# Display the bigrams

print("Bigrams:", bigrams)

# Generate trigrams (n=3)

trigrams = list(ngrams(word_tokenize(text), 3))

# Display the trigrams

print("Trigrams:", trigrams)
c) Implement N-Grams Smoothing

import nltk

from nltk.util import ngrams

from nltk.tokenize import word_tokenize

from collections import Counter

# Download necessary NLTK data

nltk.download('punkt')

# Sample text

text = "The quick brown fox jumps over the lazy dog."

# Tokenize the text

tokens = word_tokenize(text)

# Define n (for bigrams, n=2)

n=2

# Generate n-grams

ngram_list = list(ngrams(tokens, n))

# Count the occurrences of n-grams

ngram_counts = Counter(ngram_list)

# Calculate total n-grams

total_ngrams = len(ngram_list)
# Laplace (Add-one) smoothing

vocab_size = len(set(tokens)) # Number of unique words

# Function to calculate smoothed probability of an n-gram

def laplace_smoothing(ngram):

ngram_count = ngram_counts[ngram] + 1 # Add-one smoothing

return ngram_count / (total_ngrams + vocab_size)

# Test with a bigram

bigram = ('quick', 'brown')

smoothed_prob = laplace_smoothing(bigram)

print(f"Smoothed probability of {bigram}: {smoothed_prob:.4f}")

You might also like