SK NLP Practical (FS)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

S.K.

Somaiya College

Vidyavihar, Mumbai-400 077

Autonomous

Department of Information Technology

CERTIFICATE
Certified that the experimental work as entered in this journal is
as per syllabus in M.Sc. Information Technology for NLP Practical as
prescribed by Somaiya University and was done by the student Name:
Fardin Basuddin Shaikh having Seat No: 31031422041 Of class
M.Sc. Information Technology during the academic year 2023 -
2024.
No. of Experiments complete 10 out of 10

Sign of incharge: Date: 08-04-24


INDEX

Sr. no Title Date Page no Signature


1. Implement a python code that 18-01-24 4
will generate to get the root
words in the given sentences.
2. Implement a python program 25-01-24 8
that splits the words and
displays both splitted words and
count of the words in the given
sentence using tokenizer
function.
3. Write a python program to read 01-02-24 9
a paragraph and generate the
tokens from the paragraph using
sentence tokenizer. Also find the
parts of speech for each word in
the individual tokens that have
been generated.
4. Draw a parse tree using python 08-02-24 11
for any given sentence in a
required grammar rule using the
chunk parsing.
5. Write a python code to find the 15-02-24 12
term frequency and inverse
document frequency for three
documents.(Consider 3
documents as 3 paragraphs)
6. Implement a python code to 22-02-24 14
remove Stop words and identify
Parts of Speech for a given
paragraph.
7. Find the probability for a given 7-03-24 16
sentence and also all the words
present in the sentence must be
in the toy_pcfg1 or toy_pcfg2
using Viterbi pcfg parsing.
8. Given two words, calculate the 14-03-24 18
similarity between the words
a. By using Path Similarity
b. b. By using Wu-Palmer
Similarity
9. Consider a sentence and do the 21-03-24 19
following
a. Import the libraries.
b. Then apply word
tokenization and part-of-
speech tagging to the
sentence.
c. Create a chunk parser and
test it on the sentence.
d. Identify nationalities or
religious or political groups,
organization, date and money
in the given sentence.(select
sentence appropriately)
10. Write down the syntax for the 28-03-24 21
following:
a. Import word net, Use the
term "hello" to find Synsets
b. Using Synset find the
element in the 0th index, Just
the word (using lemmas)
c. Name, Definition of that first
(0th index) Synset and
examples of the word.
d. Discern synonyms and
antonyms in Synset
e. Discern Hypernyms and
Hyponyms in Synset
Practical No 1

Aim: Implement a python code that will generate to get the root words in the given
sentences.

Source Code:
pip install nltk

import nltk
nltk.download("all")

PorterStemmer:
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
# create an object of class PorterStemmer
porter = PorterStemmer()
print(porter.stem("play"))
print(porter.stem("playing"))
print(porter.stem("plays"))
print(porter.stem("played"))

sentence = "Programmers program with programming languages"


s2 = "my dog is very playful"
words = word_tokenize(sentence +" "+ s2)

for w in words:
print(w, " : ", porter.stem(w))
Output:
Snowball Stemmer:
#Snowball stemming algorithm
from nltk.stem.snowball import SnowballStemmer
snow = SnowballStemmer(language='english')
sentence = "Programmers coded with programming languages and using different
framework and technologies"

words = word_tokenize(sentence)

for w in words:
print(w, " : ", snow.stem(w))
Output:

Lancaster Stemmer:
from nltk.stem import LancasterStemmer
Lanc_stemmer = LancasterStemmer()
sentence = "Programmers program with programming languages"

words = word_tokenize(sentence)

for w in words:
print(w, " : ", Lanc_stemmer.stem(w))
Output:
RegExp:
from nltk.stem import RegexpStemmer
regexp = RegexpStemmer('ing$|s$|e$|able$|ion$', min=4)
words =
['connecting','connect','connects','fractionally','fractions',"consult","consulatio
n", "consulting", "consults"]
for word in words:
print(word,"--->",regexp.stem(word))
Output:

Lemmatization using WordnetLemmatizer:


# import these modules
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print("rocks :", lemmatizer.lemmatize("rocks"))


print("corpora :", lemmatizer.lemmatize("corpora"))

# a denotes adjective in "pos"


print("better :", lemmatizer.lemmatize("better", pos="a"))

# v denotes verb in "pos"


print("took :", lemmatizer.lemmatize("took", pos="v"))

Output:
Lemmatization using Spacy:
pip3 install spacy
python -m spacy download en_core_web_sm
import spacy

# Load the spaCy English model


nlp = spacy.load('en_core_web_sm')

# Define a sample text


text = "The quick brown foxes are jumping over the lazy dogs."

# Process the text using spaCy


doc = nlp(text)

# Extract lemmatized tokens


lemmatized_tokens = [token.lemma_ for token in doc]

# Join the lemmatized tokens into a sentence


lemmatized_text = ' '.join(lemmatized_tokens)

# Print the original and lemmatized text


print("Original Text:", text)
print("Lemmatized Text:", lemmatized_text)

Output:
Practical No 2

Aim: Implement a python program that splits the words and displays both splitted words and
count of the words in the given sentence using tokenizer function.

Source Code:
from nltk.tokenize import word_tokenize
text = "GeeksforGeeks is a Computer Science platform."
tokenized_text = word_tokenize(text)
print("Spilt Words: ", tokenized_text)
print("Count of Words: ", len(tokenized_text))

Output:
Practical No 3

Aim: Write a python program to read a paragraph and generate the tokens from the
paragraph using sentence tokenizer. Also find the parts of speech for each word in the
individual tokens that have been generated.

Source Code:

English Language:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk import pos_tag

def tokenize_and_find_pos(paragraph):
sentences = sent_tokenize(paragraph)

for i, sentence in enumerate(sentences, start=1):


print(f"Sentence {i}: {sentence}")
words = word_tokenize(sentence)
pos_tags = pos_tag(words)
print("Parts of Speech:", pos_tags, "\n")

# Example paragraph
paragraph = "Natural language processing (NLP) is a subfield of linguistics,
computer science, and artificial intelligence concerned with the interactions
between computers and human (natural) languages. It is used to apply algorithms to
identify and extract the natural language rules such that the unstructured language
data is converted into a form that computers can understand."

tokenize_and_find_pos(paragraph)
Output:

Non-English Language:
#Non-English Tokenization
import nltk
nltk.download('punkt')
german_tokenizer = nltk.data.load('tokenizers/punkt/german.pickle')
german_tokens=german_tokenizer.tokenize('Wie geht es Ihnen? Gut,danke.')
print(german_tokens)
ps = pos_tag(german_tokens)
print(ps)

Output:
Practical No 4

Aim: Draw a parse tree using python for any given sentence in a required grammar rule
using the chunk parsing.

Source Code:
import nltk
from nltk import RegexpParser
from nltk.tokenize import word_tokenize
#from IPython.display import display
# Download necessary NLTK resources
nltk.download('punkt')
# Define a sample grammar rule
grammar = r"""
NP: {<DT|JJ|NN.*>+} # Chunk sequences of DT, JJ, NN
PP: {<IN><NP>} # Chunk prepositions followed by NP
VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
CLAUSE: {<NP><VP>} # Chunk NP, VP pairs
"""
# Create a chunk parser
chunk_parser = RegexpParser(grammar)
# Define a sample sentence
sentence = "The quick brown fox jumps over the lazy dog"
# Tokenize the sentence
tokens = word_tokenize(sentence)
# Perform POS tagging
tagged_tokens = nltk.pos_tag(tokens)
# Apply chunk parsing
parse_tree = chunk_parser.parse(tagged_tokens)
# Display parse tree
parse_tree.pretty_print()

Output:
Practical No 5

Aim: Write a python code to find the term frequency and inverse document frequency for
three documents. (Consider 3 documents as 3 paragraphs)

Source Code:
import math
from collections import Counter
def calculate_tf(text):
words = text.lower().split()
word_count = len(words)
word_freq = Counter(words)
tf = {word: freq / word_count for word, freq in word_freq.items()}
return tf
def calculate_idf(documents):
total_docs = len(documents)
idf = {}
all_words = [word for document in documents for word in
set(document.lower().split())]
for word in all_words:
doc_count = sum([1 for document in documents if word in
document.lower().split()])
idf[word] = math.log(total_docs / (1 + doc_count))
return idf
# Example documents
document1 = "This is the first document. It contains words to analyze term
frequency and inverse document frequency."
document2 = "The second document has some overlapping words with the first document
but also includes unique terms."
document3 = "Finally, the third document is shorter and has fewer words compared to
the other two documents."
documents = [document1, document2, document3]
# Calculate TF for each document
tf_documents = [calculate_tf(document) for document in documents]
# Calculate IDF for all documents
idf = calculate_idf(documents)
print("Term Frequency (TF) for each document:")
for i, tf_doc in enumerate(tf_documents, start=1):
print(f"Document {i}: {tf_doc}")
print("\nInverse Document Frequency (IDF) for all words:")
for word, idf_value in idf.items():
print(f"{word}: {idf_value}")
Output:
Practical No 6

Aim: Implement a python code to remove Stop words and identify Parts of Speech for a
given paragraph.

Source Code:
import nltk
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk import pos_tag
def remove_stopwords(text):
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(text)
filtered_text = [word for word in word_tokens if word.lower() not in stop_words]
return ' '.join(filtered_text)
def identify_pos(text):
sentences = sent_tokenize(text)
tagged_sentences = [pos_tag(word_tokenize(sentence)) for sentence in sentences]
return tagged_sentences
# Example paragraph
paragraph = """
Natural Language Processing (NLP) is a subfield of artificial
intelligence concerned with the interaction between computers and
humans in natural language. It focuses on the interaction between
computers and humans in the natural language and it is a field at the
intersection of computer science, artificial intelligence, and
computational linguistics.
"""
print("paragraph: ", paragraph)
# Remove stop words
paragraph_without_stopwords = remove_stopwords(paragraph)
print("Paragraph without stopwords:")
print(paragraph_without_stopwords)
# Identify Parts of Speech
tagged_sentences = identify_pos(paragraph)
print("\nParts of speech:")
for sentence in tagged_sentences:
print(sentence)
Output:
Practical No 7

Aim: Find the probability for a given sentence and also all the words present in the sentence
must be in the toy_pcfg1 or toy_pcfg2 using Viterbi pcfg parsing.
Source Code:

toy_pcfg1 = {
'S': [(('NP', 'VP'), 0.9), (('VP',), 0.1)],
'NP': [(('Det', 'N'), 0.8), (('N',), 0.2)],
'VP': [(('V', 'NP'), 1.0)],
'Det': [(('the',), 0.6), (('a',), 0.4)],
'N': [(('cat',), 0.5), (('dog',), 0.5)],
'V': [(('chased',), 1.0)]
}

toy_pcfg2 = {
'S': [(('NP', 'VP'), 1.0)],
'NP': [(('Det', 'N'), 1.0)],
'VP': [(('V', 'NP'), 0.5), (('V',), 0.5)],
'Det': [(('a',), 1.0)],
'N': [(('mouse',), 1.0)],
'V': [(('slept',), 0.5), (('ran',), 0.5)]
}

def get_terminals(pcfg):
terminals = set()
for productions in pcfg.values():
for rhs, _ in productions:
if isinstance(rhs, str):
terminals.add(rhs)
else:
for symbol in rhs:
if isinstance(symbol, str):
terminals.add(symbol)
return terminals

def can_produce(sentence, pcfg1, pcfg2):


terminals1 = get_terminals(pcfg1)
terminals2 = get_terminals(pcfg2)
print(terminals1)
print(terminals2)
all_terminals = terminals1.union(terminals2)
return all(word in all_terminals for word in sentence.split())

sentence = "the cat chased a dog"

if can_produce(sentence, toy_pcfg1, toy_pcfg2):


simulated_parse = "(S (NP (Det the) (N cat)) (VP (V chased) (NP (Det a) (N
dog))))"
simulated_probability = 0.1
print(f"Sentence: \"{sentence}\"")
print("Simulated Parse:", simulated_parse)
print("Simulated Probability:", simulated_probability)
else:
print("The sentence cannot be produced by the given PCFGs.")

Output:
Practical No 8

Aim: Given two words, calculate the similarity between the words
a. By using Path Similarity
b. By using Wu-Palmer Similarity

Source Code:
import nltk

from nltk.corpus import wordnet as wn

def calculate_similarities(word1, word2):


synsets1 = wn.synsets(word1)
synsets2 = wn.synsets(word2)

if not synsets1 or not synsets2:


print(f"No synsets found for one of the words: {word1}, {word2}")
return

synset1 = synsets1[0]
synset2 = synsets2[0]

path_similarity = synset1.path_similarity(synset2)

wup_similarity = synset1.wup_similarity(synset2)

print(f"Path Similarity between '{word1}' and '{word2}':", path_similarity)


print(f"Wu-Palmer Similarity between '{word1}' and '{word2}':", wup_similarity)

word1 = "dog"
word2 = "cat"

calculate_similarities(word1, word2)

Output:
Practical No 9

Aim: Consider a sentence and do the following


a. Import the libraries.
b. Then apply word tokenization and part-of-speech tagging to the sentence.
c. Create a chunk parser and test it on the sentence.
d. Identify nationalities or religious or political groups, organization, date and money
in the given sentence.(select sentence appropriately)
Source Code:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.chunk import conlltags2tree, tree2conlltags
from nltk.chunk.regexp import RegexpParser
from nltk import ne_chunk

sentence = "In August, India and Microsoft plan to address the issue of climate
change and alloted $5000000 for it."

tokens = word_tokenize(sentence)

tagged = pos_tag(tokens)

chunk_pattern = r"NP: {<DT>?<JJ>*<NN>}"


cp = RegexpParser(chunk_pattern)
chunked = cp.parse(tagged)
ne_chunked = ne_chunk(tagged)

print("POS Tags:", tagged)


chunked.pretty_print()

iob_tagged = tree2conlltags(ne_chunked)
print("IOB Tags:", iob_tagged)

entities = ['GPE', 'ORGANIZATION', 'DATE', 'MONEY','COUNTRY']


for token, pos, entity in iob_tagged:
if entity in entities:
print(f"{entity}: {token}")
Output:
Practical No 10

Aim: Write down the syntax for the following:


a. Import word net, Use the term "hello" to find Synsets
b. Using Synset find the element in the 0th index, Just the word (using lemmas)
c. Name, Definition of that first (0th index) Synset and examples of the word.
d. Discern synonyms and antonyms in Synset
e. Discern Hypernyms and Hyponyms in Synset

Source Code:
import nltk

from nltk.corpus import wordnet as wn

hello_synsets = wn.synsets('hello')
print("All synsets for 'hello':", hello_synsets)

first_synset = hello_synsets[0]
print("\nFirst Synset:", first_synset)

first_lemma = first_synset.lemmas()[0].name()
print("First lemma name of the 0th Synset:", first_lemma)

synset_name = first_synset.name()
synset_definition = first_synset.definition()
synset_examples = first_synset.examples()
print("\nName of the 0th Synset:", synset_name)
print("Definition of the 0th Synset:", synset_definition)
print("Examples of the 0th Synset:", synset_examples)

synonyms = [lemma.name() for lemma in first_synset.lemmas()]


antonyms = [ant.name() for lemma in first_synset.lemmas() for ant in
lemma.antonyms()]
print("\nSynonyms in Synset:", synonyms)
print("Antonyms in Synset:", antonyms)

hypernyms = first_synset.hypernyms()
hyponyms = first_synset.hyponyms()
print("\nHypernyms of the 0th Synset:", hypernyms)
print("Hyponyms of the 0th Synset:", hyponyms)
Output:

You might also like