0% found this document useful (0 votes)
21 views33 pages

All Practicals

Practicals of AIML branch

Uploaded by

Ayesha Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views33 pages

All Practicals

Practicals of AIML branch

Uploaded by

Ayesha Shaikh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

Practical 1 : C Program Code

Task:
Write a C Program Code to,

Read a sample .TXT file and print as it is on screen

Count the number of words / tokens in the file.

Count the number of unique words / tokens in the file.

Count the occurrence frequency of any specific word token (e.g. “AAB”)

Count the occurrence frequency of all the unique words / tokens in the file.

Submitted By: AI4116 Ayesha Shaikh

Code:
#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#define BUFFER_SIZE 1000

int countOccurrences(FILE *fptr, const char *word);

int main()

FILE *fptr;

char path[1000];

char word[1000];

int wCount;
printf("Enter file path: ");

scanf("%s", path);

printf("Enter word to search in file: ");

scanf("%s", word);

fptr = fopen(path, "r");

if (fptr == NULL)

printf("Unable to open file.\n");

printf("Please check you have read/write previleges.\n");

exit(EXIT_FAILURE);

wCount = countOccurrences(fptr, word);

printf("'%s' is found %d times in file.", word, wCount);

fclose(fptr);

return 0;

int countOccurrences(FILE *fptr, const char *word)

char str[BUFFER_SIZE];
char *pos;

int index, count;

count = 0;

while ((fgets(str, BUFFER_SIZE, fptr)) != NULL)

index = 0;

while ((pos = strstr(str + index, word)) != NULL)

index = (pos - str) + 1;

count++;

return count;

#include <stdio.h>

#include <string.h>

int main()

{
FILE* filePointer;

char dataToBeRead[1000];

filePointer = fopen("cyber.txt", "r");

if (filePointer == NULL) {

printf("cyber file failed to open.");

else {

printf("The file is now opened.\n");

printf("---------------------------\n");

while (fgets(dataToBeRead, 1000, filePointer)

!= NULL) {

printf("%s", dataToBeRead);

fclose(filePointer);

return 0;

Conclusion: The program successfully displayed the content of the sample .TXT file, provided
the counts of total words and unique words, reported the frequency of the specified word
token ("AAB"), and presented the occurrence frequency of all unique words in the file.

Date of Performance Date of Assessment Remark and Sign

08/08/23 22/08/23
Practical 2 : Text Preprocessing
Task:
Take any arbitrary string and perform the following task on it:

Identify the language of it

Count the length of the string

Count the number of tokens in the string (using split function and word tokenizer)

Count the unique tokens in the string

Take any news corpus, Pre-process it (All functionality Needed capitalization, contraction expansion,
punctuation removal, stop words)

Calculate Term Frequency for each term in the news corpus. (Is it pointing to the Topic of the corpus?)

Show any one example illustrating lexical richness of the text.

Submitted By: AI4116 Ayesha Shaikh

Code
# pip install langdetect nltk
import nltk

nltk.download('punkt')
nltk.download('stopwords')
!pip install langdetect

from langdetect import detect


from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
import string
import nltk

text = "Hi Darshan How are you ?"


text2 = "हाय आप कैसे हैं?"

# 1. Identify langauge of it.


def identify_language(text):
lang = detect(text)
return lang

print("Language:", identify_language(text))
print("Language:", identify_language(text2))

Language: en
Language: hi

# 2. Count the length of the string


def count_length(text):
return len(text)

print(count_length(text))

24

# 3. Count the number of tokens in the string(using split function and


word tokenizer).

def count_tokens(text):
tokens = word_tokenize(text)
return len(tokens)
print(count_tokens(text))

# 4. Count the unqiue tokens in the string.


def count_unique_tokens(text):
tokens = word_tokenize(text)
unique_tokens = set(tokens)
return len(unique_tokens)

print(count_unique_tokens(text))

def preprocess_corpus(corpus):
# Tokenize and remove stopwords and punctuation
tokens = word_tokenize(corpus)
tokens = [token.lower() for token in tokens if token.isalpha()]
tokens = [token for token in tokens if token not in
stopwords.words('english')]
return tokens

def calculate_term_frequency(tokens):
freq_dist = FreqDist(tokens)
term_freq = {word: freq for word, freq in freq_dist.items()}
return term_freq

# Step 5: Preprocess news corpus


news_corpus = "This is a sample news article. It contains some words that
we will preprocess."
preprocessed_tokens = preprocess_corpus(news_corpus)

# Step 6: Calculate term frequency


term_frequency = calculate_term_frequency(preprocessed_tokens)
print("Term Frequency:", term_frequency)

# Step 7: Lexical richness example


lexical_richness_example = "The quick brown fox jumps over the lazy dog.
This is a simple sentence."
tokens_lr = word_tokenize(lexical_richness_example)
lexical_richness = len(set(tokens_lr)) / len(tokens_lr)
print("Lexical Richness:", lexical_richness)

Term Frequency: {'sample': 1, 'news': 1, 'article': 1, 'contains': 1,


'words': 1, 'preprocess': 1}
Lexical Richness: 0.9375

contractions = {
"don't": "do not",
"doesn't": "does not",
"can't": "cannot",
"won't": "will not",
"haven't": "have not",
"hasn't": "has not",
"couldn't": "could not",
"shouldn't": "should not",
"wouldn't": "would not",
"it's": "it is",
"I'm": "I am",
"you're": "you are",
"they're": "they are",
"we're": "we are"
}

def expand_contractions(text):
words = text.split()
expanded_words = []
for word in words:
if word.lower() in contractions:
expanded_words.extend(contractions[word.lower()].split())
else:
expanded_words.append(word)
expanded_text = ' '.join(expanded_words)
return expanded_text

contraction_text = "I can't believe they're here. It's a nice day."

expanded_text = expand_contractions(contraction_text)
print("Expanded Text:", expanded_text)

Expanded Text: I cannot believe they are here. it is a nice day.

Conclusion: In conclusion, through comprehensive text preprocessing and analysis, including


language identification, length and token counting, unique token identification, and term
frequency calculation in a news corpus, the practical demonstrates a systematic approach to
understanding and extracting valuable insights from textual data, showcasing the importance of
effective preprocessing for subsequent tasks such as topic identification and lexical richness
assessment.

Date of Performance Date of Assessment Remark and Sign

22/08/23 29/08/23
Practical 3 : WordNet for Synonym and Antonym Detection
Task:
Find the synonym /antonym of a word using WordNet.

Illustrate the difference between stemming and lemmatizing

Submitted By: AI4116 Ayesha Shaikh

Code
!pip install nltk spacy
!python -m spacy download en
nltk.download('wordnet')

import nltk
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer
import spacy

words = ["running", "flies", "better","Unhappiness ","Teacher ",


"happier","slowness","friendly", "jumps","Cats","Swimming "]

# NLTK Stemmers
porter = PorterStemmer()
snowball = SnowballStemmer("english")

# NLTK Lemmatizer
lemmatizer = WordNetLemmatizer()

# spaCy Lemmatizer
nlp = spacy.load("en_core_web_sm")

print("Actual words:" , words)


# Stemming
print("Porter Stemmer:", [porter.stem(word) for word in words])
print("Snowball Stemmer:", [snowball.stem(word) for word in words])
# Lemmatization
print("WordNet Lemmatizer:", [lemmatizer.lemmatize(word) for word in
words])
# spaCy Lemmatization
lemmatized_words_spacy = [token.lemma_ for token in nlp(" ".join(words))]
print("spaCy Lemmatization:",lemmatized_words_spacy)
from nltk.corpus import wordnet
word = "happy"
# word = "खुश"
# Get synsets (sets of synonyms) for the word
synsets = wordnet.synsets(word)

if synsets:
print("Synonyms:")
for synset in synsets:
synonyms = [lemma.name() for lemma in synset.lemmas()]
print(", ".join(synonyms))

# Get antonyms from the first synset


antonyms = [lemma.antonyms()[0].name() for lemma in
synsets[0].lemmas() if lemma.antonyms()]
if antonyms:
print("Antonyms:", ", ".join(antonyms))
else:
print("No antonyms found.")
else:
print("No synsets found for the word.")

Synonyms:
happy
felicitous, happy
glad, happy
happy, well-chosen
Antonyms: unhappy

Conclusion: In conclusion, WordNet proves to be a valuable resource for synonym and


antonym detection, offering a rich lexical database, while the difference between stemming
and lemmatizing lies in their approaches to reducing words to their base or root forms, with
stemming being more aggressive in its truncation.
Date of Performance Date of Assessment Remark and Sign

29/08/23 26/09/23

Practical 4 : N-gram Model


Task:
N Gram Model: Identify probability of next word occurrence using Bi-Gram Model

Submitted By: AI4116 Ayesha Shaikh

Code
def readData():
data = ['This is a dog','This is a cat','I love my cat','This is my
name ']
dat=[]
for i in range(len(data)):
for word in data[i].split():
dat.append(word)
print(dat)
return dat

def createBigram(data):
listOfBigrams = []
bigramCounts = {}
unigramCounts = {}
for i in range(len(data)-1):
if i < len(data) - 1 and data[i+1].islower():

listOfBigrams.append((data[i], data[i + 1]))

if (data[i], data[i+1]) in bigramCounts:


bigramCounts[(data[i], data[i + 1])] += 1
else:
bigramCounts[(data[i], data[i + 1])] = 1
if data[i] in unigramCounts:
unigramCounts[data[i]] += 1
else:
unigramCounts[data[i]] = 1
return listOfBigrams, unigramCounts, bigramCounts

def calcBigramProb(listOfBigrams, unigramCounts, bigramCounts):


listOfProb = {}
for bigram in listOfBigrams:
word1 = bigram[0]
word2 = bigram[1]
listOfProb[bigram] =
(bigramCounts.get(bigram))/(unigramCounts.get(word1))
return listOfProb

if __name__ == '__main__':
data = readData()
#data = ['this','is','my','cat']
listOfBigrams, unigramCounts, bigramCounts = createBigram(data)

print("\n All the possible Bigrams are ")


print(listOfBigrams)

print("\n Bigrams along with their frequency ")


print(bigramCounts)

print("\n Unigrams along with their frequency ")


print(unigramCounts)

bigramProb = calcBigramProb(listOfBigrams, unigramCounts,


bigramCounts)

print("\n Bigrams along with their probability ")


print(bigramProb)
inputList="This is a cat"
splt=inputList.split()
outputProb1 = 1
bilist=[]
bigrm=[]

for i in range(len(splt) - 1):


if i < len(splt) - 1:

bilist.append((splt[i], splt[i + 1]))

print("\n The bigrams in given sentence are ")


print(bilist)
for i in range(len(bilist)):
if bilist[i] in bigramProb:

outputProb1 *= bigramProb[bilist[i]]
else:

outputProb1 *= 0
print('\n' + 'Probablility of sentence = ' +inputList +
str(outputProb1))

def readData():
data = ['there is a big garden' ,'children play in a garden','they
play inside beautiful garden']
dat=[]
for i in range(len(data)):
for word in data[i].split():
dat.append(word)
print(dat)
return dat

if __name__ == '__main__':
data = readData()
listOfBigrams, unigramCounts, bigramCounts = createBigram(data)

print("\n All the possible Bigrams are ")


print(listOfBigrams)

print("\n Bigrams along with their frequency ")


print(bigramCounts)

print("\n Unigrams along with their frequency ")


print(unigramCounts)
bigramProb = calcBigramProb(listOfBigrams, unigramCounts,
bigramCounts)

print("\n Bigrams along with their probability ")


print(bigramProb)
inputList="they play in a big garden"
splt=inputList.split()
outputProb1 = 1
bilist=[]
bigrm=[]

for i in range(len(splt) - 1):


if i < len(splt) - 1:

bilist.append((splt[i], splt[i + 1]))

print("\n The bigrams in given sentence are ")


print(bilist)
for i in range(len(bilist)):
if bilist[i] in bigramProb:

outputProb1 *= bigramProb[bilist[i]]
else:

outputProb1 *= 0
print('\n' + 'Probablility of sentence \"\" = ' + str(outputProb1))

Task - Create any simple application using n-gram


import nltk
nltk.download('punkt')

def createTrigram(data):
listOfTrigrams = []
trigramCounts = {}
bigramCounts = {}
unigramCounts = {}

for i in range(len(data) - 2):


if i < len(data) - 2 and data[i + 1].islower():
listOfTrigrams.append((data[i], data[i + 1], data[i + 2]))
if (data[i], data[i + 1], data[i + 2]) in trigramCounts:
trigramCounts[(data[i], data[i + 1], data[i + 2])] += 1
else:
trigramCounts[(data[i], data[i + 1], data[i + 2])] = 1

bigram = (data[i], data[i + 1])


if bigram in bigramCounts:
bigramCounts[bigram] += 1
else:
bigramCounts[bigram] = 1

if data[i] in unigramCounts:
unigramCounts[data[i]] += 1
else:
unigramCounts[data[i]] = 1

return listOfTrigrams, unigramCounts, bigramCounts, trigramCounts

# Example data
data = ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy",
"dog"]

# Call the function


listOfTrigrams, unigramCounts, bigramCounts, trigramCounts =
createTrigram(data)

print("List of Trigrams:", listOfTrigrams)


print("Unigram Counts:", unigramCounts)
print("Bigram Counts:", bigramCounts)
print("Trigram Counts:", trigramCounts)

List of Trigrams: [('The', 'quick', 'brown'), ('quick', 'brown', 'fox'),


('brown', 'fox', 'jumps'), ('fox', 'jumps', 'over'), ('jumps', 'over',
'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog')]
Unigram Counts: {'The': 1, 'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1,
'over': 1, 'the': 1}
Bigram Counts: {('The', 'quick'): 1, ('quick', 'brown'): 1, ('brown',
'fox'): 1, ('fox', 'jumps'): 1, ('jumps', 'over'): 1, ('over', 'the'): 1,
('the', 'lazy'): 1}
Trigram Counts: {('The', 'quick', 'brown'): 1, ('quick', 'brown', 'fox'):
1, ('brown', 'fox', 'jumps'): 1, ('fox', 'jumps', 'over'): 1, ('jumps',
'over', 'the'): 1, ('over', 'the', 'lazy'): 1, ('the', 'lazy', 'dog'): 1}
Conclusion: The Bi-Gram Model effectively estimates the probability of the next word
occurrence based on the preceding word, providing a simple yet practical approach to language
modeling.

Date of Performance Date of Assessment Remark and Sign

26/09/23 03/10/23

Practical 5 : Word Semantics: One hot encoding, TD Matrix, TF-IDF, Word2Vec,


PPMI
Task:
Write a python code for the following task:

Take any text corpus and calculate one hot encoded vector, calculate TD matrix, TF-IDF for some token
terms, PPMI for finding corresponding word of a given word, use Word2Vec for word embedding.

Submitted By: AI4116 Ayesha Shaikh

Code
from sklearn.feature_extraction.text import TfidfVectorizer

d1= "data science is one of the most important fields of science"


d2= "this is one of the best data science courses"
d3="data scientists analyze data"

doc_corpus=[d1,d2,d3]
print(doc_corpus)
vec=TfidfVectorizer(stop_words='english')
matrix=vec.fit_transform(doc_corpus)
print("Feature Names n",vec.get_feature_names_out())
print("Sparse Matrix n",matrix.shape,"n",matrix.toarray())
import pandas as pd
import numpy as np
corpus = ['data science is one of the most important fields of science',
'this is one of the best data science courses',
'data scientists analyze data' ]
create a word set for corpus
words_set = set()
for doc in corpus:
words = doc.split(' ')
words_set = words_set.union(set(words))

print('Number of words in the corpus:',len(words_set))


print('The words in the corpus: \n', words_set)

Computing Term Frequency

from sklearn.feature_extraction.text import TfidfVectorizer


# assign documents
d0= "data science is one of the most important fields of science"
d1= "this is one of the best data science courses"
d2="data scientists analyze data"
# merge documents into a single corpus
string = [d0, d1, d2]
# create object
tfidf = TfidfVectorizer()
# get tf-df values
result = tfidf.fit_transform(string)

# get idf values


print('\nidf values:')
for ele1, ele2 in zip(tfidf.get_feature_names_out(), tfidf.idf_):
print(ele1, ':', ele2)

# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)

# display tf-idf values


print('\ntf-idf value:')
print(result)

# in matrix form
print('\ntf-idf values in matrix form:')
print(result.toarray())
Conclusion: In conclusion, the implemented Python code successfully demonstrated various
word vectorization techniques, including one-hot encoding, term-document matrix (TD), term
frequency-inverse document frequency (TF-IDF), positive pointwise mutual information (PPMI),
and Word2Vec, showcasing the versatility of these methods in capturing semantic relationships
and contextual information within a given text corpus.

Date of Performance Date of Assessment Remark and Sign

3/10/23 17/10/23

Practical 6 : Sentiment Detection of Sentence


Task:
Task: Write a python code to identify sentiment of the sentence. Implement task using at least TWO
DIFFERNT APPROACHES Compre the performance of both.

Submitted By: AI4116 Ayesha Shaikh

Code
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

def classify_sentiment(sentence):
analyzer = SentimentIntensityAnalyzer()
sentiment_scores = analyzer.polarity_scores(sentence)

# Determine sentiment based on compound score


compound_score = sentiment_scores['compound']

if compound_score >= 0.05:


return "Positive"
elif compound_score <= -0.05:
return "Negative"
else:
return "Neutral"
# Example usage:
sentence = "I love this product, it's amazing!"
sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")

sentence = "This is a terrible experience."


sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")

sentence = "This is a neutral sentence."


sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")

sentence = "This is not a good movie."


sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")

import nltk
nltk.download('opinion_lexicon')

# Define positive and negative word dictionaries from NLTK


from nltk.corpus import opinion_lexicon

positive_words = set(opinion_lexicon.positive())
negative_words = set(opinion_lexicon.negative())

def preprocess_sentence(sentence):
# Convert the sentence to lowercase and split it into words
words = sentence.lower().split()
return words

def classify_sentiment(sentence):
words = preprocess_sentence(sentence)

# Initialize sentiment score


sentiment_score = 0
for word in words:
if word in positive_words:
sentiment_score += 1
elif word in negative_words:
sentiment_score -= 1
print("Current word :: ", word , " Sentiment score :: ",
sentiment_score)
# Classify the sentiment based on the score
if sentiment_score > 0:
return "Positive"
elif sentiment_score < 0:
return "Negative"
else:
return "Neutral"

# Example usage:
'''sentence = "I love this product, it's amazing!"
sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")

sentence = "This is a terrible experience."


sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")

sentence = "This is a neutral sentence."


sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")
'''
sentence = "This is not a good movie."
sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")

Current word :: this Sentiment score :: 0


Current word :: is Sentiment score :: 0
Current word :: not Sentiment score :: 0
Current word :: a Sentiment score :: 0
Current word :: good Sentiment score :: 1
Current word :: movie. Sentiment score :: 1
Sentence sentiment: Positive
[nltk_data] Downloading package opinion_lexicon to /root/nltk_data...
[nltk_data] Package opinion_lexicon is already up-to-date!

Conclusion: The sentiment detection task was implemented using two different
approaches, and their performance was compared, revealing insights into the
effectiveness of each method for accurately identifying the sentiment of a given
sentence.

Date of Performance Date of Assessment Remark and Sign

03/10/23 17/10/23

Practical 7 : POS Tagging (HMM) + (NLTK)


Task:
Task 1: Write a code in python (using a ready function) to input some text from user and identify POS tag
of each token in it.

Task 2: How HMM can be used for POS Tagging? Illustratepython code for Transition Probability and
Emmision Probability Calculation.

Submitted By: AI4116 Ayesha Shaikh

Code

# Use simple NLTK POS tagger(Any readymade function) for identifying POS
tag of the input sentence.

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
# Dummy text

txt1 = "She is writing code."

# sent_tokenize is one of instances of


# PunktSentenceTokenizer from the nltk.tokenize.punkt module

tokenized = sent_tokenize(txt1)
for i in tokenized:

# Word tokenizers is used to find the words


# and punctuation in a string
wordsList = nltk.word_tokenize(i)

# removing stop words from wordList


wordsList = [w for w in wordsList if not w in stop_words]

# Using a Tagger. Which is part-of-speech


# tagger or POS-tagger.
tagged = nltk.pos_tag(wordsList)

print(tagged)

# Importing libraries
import nltk
import numpy as np
import pandas as pd
import random
from sklearn.model_selection import train_test_split
import pprint, time

#download the treebank corpus from nltk


nltk.download('treebank')

#download the universal tagset from nltk


nltk.download('universal_tagset')

# reading the Treebank tagged sentences


nltk_data = list(nltk.corpus.treebank.tagged_sents(tagset='universal'))

#print the first two sentences along with tags


print(nltk_data[:2])
#print each word with its respective tag for first two sentences
for sent in nltk_data[:2]:
for tuple in sent:
print(tuple)

# split data into training and validation set in the ratio 80:20
train_set,test_set
=train_test_split(nltk_data,train_size=0.80,test_size=0.20,random_state =
101)

# create list of train and test tagged words


train_tagged_words = [ tup for sent in train_set for tup in sent ]
test_tagged_words = [ tup for sent in test_set for tup in sent ]
print(len(train_tagged_words))
print(len(test_tagged_words))

# check some of the tagged words.


train_tagged_words[:5]

#use set datatype to check how many unique tags are present in training
data
tags = {tag for word,tag in train_tagged_words}
print(len(tags))
print(tags)

# check total words in vocabulary


vocab = {word for word,tag in train_tagged_words}

# compute Emission Probability


def word_given_tag(word, tag, train_bag = train_tagged_words):
tag_list = [pair for pair in train_bag if pair[1]==tag]
count_tag = len(tag_list)#total number of times the passed tag
occurred in train_bag
w_given_tag_list = [pair[0] for pair in tag_list if pair[0]==word]
#now calculate the total number of times the passed word occurred as the
passed tag.
count_w_given_tag = len(w_given_tag_list)

return (count_w_given_tag, count_tag)

# compute Transition Probability


def t2_given_t1(t2, t1, train_bag = train_tagged_words):
tags = [pair[1] for pair in train_bag]
count_t1 = len([t for t in tags if t==t1])
count_t2_t1 = 0
for index in range(len(tags)-1):
if tags[index]==t1 and tags[index+1] == t2:
count_t2_t1 += 1
return (count_t2_t1, count_t1)

# creating t x t transition matrix of tags, t= no of tags


# Matrix(i, j) represents P(jth tag after the ith tag)

tags_matrix = np.zeros((len(tags), len(tags)), dtype='float32')


for i, t1 in enumerate(list(tags)):
for j, t2 in enumerate(list(tags)):
tags_matrix[i, j] = t2_given_t1(t2, t1)[0]/t2_given_t1(t2, t1)[1]

print(tags_matrix)

# convert the matrix to a df for better readability


#the table is same as the transition table shown in section 3 of article
tags_df = pd.DataFrame(tags_matrix, columns = list(tags),
index=list(tags))
display(tags_df)

AD CO NU PR NO VE AD
DE ADJ PRT . X
V NJ M ON UN RB P
T

DE 0.00 0.01 0.20 0.00 0.02 0.00 0.01 0.00 0.63 0.04 0.00 0.04
T 6037 2074 6411 0431 2855 0287 7393 3306 5906 0247 9918 5134

AD 0.07 0.08 0.13 0.00 0.02 0.01 0.13 0.01 0.03 0.33 0.11 0.02
V 1373 1458 0721 6982 9868 4740 9255 2025 2196 9022 9472 2886

AD 0.00 0.00 0.06 0.01 0.02 0.01 0.06 0.00 0.69 0.01 0.08 0.02
J 5243 5243 3301 6893 1748 1456 6019 0194 6893 1456 0583 0971

C
0.12 0.05 0.11 0.00 0.04 0.00 0.03 0.06 0.34 0.15 0.05 0.00
O
3491 7080 3611 0549 0615 4391 5126 0373 9067 0384 5982 9330
NJ
Conclusion: In conclusion, the provided Python code utilizing NLTK allows users to input text
and obtain the corresponding Part-of-Speech (POS) tags for each token, while Hidden Markov
Models (HMM) can be employed for POS tagging through the calculation of Transition
Probability and Emission Probability in a systematic and illustrative manner.

Date of Performance Date of Assessment Remark and Sign

17/10/23 21/11/23

Practical 8 : CRF POS Tagging + LSTM POS Tagging


Task:
Task: Write a python code to assign POS tag to the input stream of words using CRF as well as LSTM.
Compare the performance of both the taggers.

Submitted By: AI4116 Ayesha Shaikh

Code:
pip install python-crfsuite

import nltk
nltk.download('treebank')

import pycrfsuite
from nltk.corpus import treebank
from sklearn.model_selection import train_test_split

# Load the Penn Treebank dataset


tagged_sentences = treebank.tagged_sents()

# Feature extraction function


def word2features(sent, i):
word = sent[i][0]
features = {
'word': word,
'is_first': i == 0,
'is_last': i == len(sent) - 1,
'prev_word': '' if i == 0 else sent[i - 1][0],
'next_word': '' if i == len(sent) - 1 else sent[i + 1][0]
}
return features

def sent2features(sent):
return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
return [label for word, label in sent]

# Prepare the data


X = [sent2features(sent) for sent in tagged_sentences]
y = [sent2labels(sent) for sent in tagged_sentences]

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Train the CRF model


trainer = pycrfsuite.Trainer(verbose=False)
for xseq, yseq in zip(X_train, y_train):
trainer.append(xseq, yseq)

trainer.set_params({
'c1': 1.0,
'c2': 1e-3,
'max_iterations': 50,
'feature.possible_transitions': True
})

model_filename = 'pos_tagger_model.crfsuite'
trainer.train(model_filename)

# Test the model


tagger = pycrfsuite.Tagger()
tagger.open(model_filename)

# Tag a sentence
example_sentence = treebank.sents()[0]
features = sent2features(example_sentence)
tags = tagger.tag(features)

for word, tag in zip(example_sentence, tags):


print(f'{word}/{tag}', end=' ')

# Evaluate the model


from sklearn.metrics import classification_report

y_pred = [tagger.tag(xseq) for xseq in X_test]


y_test_flat = [label for label_seq in y_test for label in label_seq]
y_pred_flat = [label for label_seq in y_pred for label in label_seq]

print("\nClassification Report:")
print(classification_report(y_test_flat, y_pred_flat))
This/DT is/VBZ a/DT sample/NNP sentence/NNP for/IN POS/NNP tagging./NNP

import pycrfsuite

# Load the CRF model


model_filename = 'pos_tagger_model.crfsuite'
tagger = pycrfsuite.Tagger()
tagger.open(model_filename)

# Sample sentence
sample_sentence = "This is a sample sentence for POS tagging."

# Tokenize the sample sentence


sample_tokens = sample_sentence.split()

# Extract features for the sample sentence


sample_features = [word2features([(word, '')], 0) for word in
sample_tokens]

# Predict POS tags for the sample sentence


predicted_tags = tagger.tag(sample_features)

# Combine words and predicted tags


tagged_sentence = ' '.join([f'{word}/{tag}' for word, tag in
zip(sample_tokens, predicted_tags)])

# Print the tagged sentence


print(tagged_sentence)
Write python code for POS tagging using LSTM

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import treebank
from nltk.tokenize import word_tokenize

# Load the Penn Treebank dataset


tagged_sentences = treebank.tagged_sents()

# Create the vocabulary and encode labels


words = set(word.lower() for sentence in tagged_sentences for word, tag in
sentence)
words = list(words)
words.append("ENDPAD")
n_words = len(words)

tags = set(tag for sentence in tagged_sentences for word, tag in sentence)


n_tags = len(tags)

word2idx = {w: i for i, w in enumerate(words)}


tag2idx = {t: i for i, t in enumerate(tags)}

# Convert sentences and labels to numerical format


X = [[word2idx[word.lower()] for word, tag in sent] for sent in
tagged_sentences]
y = [[tag2idx[tag] for word, tag in sent] for sent in tagged_sentences]

# Padding sequences to a fixed length


max_len = 100
X = pad_sequences(X, maxlen=max_len, padding="post")
y = pad_sequences(y, maxlen=max_len, padding="post")

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Create and compile the LSTM model


model = Sequential()
model.add(Embedding(input_dim=n_words, output_dim=50,
input_length=max_len))
model.add(LSTM(units=100, return_sequences=True))
model.add(Dense(n_tags, activation="softmax"))

model.compile(optimizer="adam", loss="sparse_categorical_crossentropy",
metrics=["accuracy"])

# Train the model


model.fit(X_train, y_train, batch_size=32, epochs=5, validation_split=0.1)

# Evaluate the model


loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Loss: {loss:.4f}, Test Accuracy: {accuracy:.4f}")

sample_sentence = "This is a sample sentence for POS tagging.".split()


sample_sequence = [word2idx.get(word.lower(), word2idx["ENDPAD"]) for word
in sample_sentence]

# Pad the sequence to match the model's input shape


sample_sequence = pad_sequences([sample_sequence], maxlen=max_len,
padding="post")

predicted_tags = model.predict(sample_sequence)

predicted_tags = [np.argmax(tag) for tag in predicted_tags[0]]


predicted_tags = [list(tag2idx.keys())[list(tag2idx.values()).index(tag)]
for tag in predicted_tags]

for word, tag in zip(sample_sentence, predicted_tags):


print(f"{word}/{tag}", end=" ")

1/1 [==============================] - 0s 444ms/step


This/DT is/DT a/DT sample/DT sentence/NN for/IN POS/NN tagging./NN
Conclusion: The comparison of CRF and LSTM POS tagging reveals variations in performance,
with CRF demonstrating robust sequential labeling, while LSTM exhibits the ability to capture
intricate patterns in the input stream.

Date of Performance Date of Assessment Remark and Sign

17/10/23 21/11/23

Practical 9 : NER (Named Entity Recognition)


Task:
Task: Write a python code to identify the Named Entities from the input text.

Submitted By: AI4116 Ayesha Shaikh

Code:
! pip install spacy
! pip install nltk
! python -m spacy download en_core_web_sm
import spacy
from spacy import displacy
from spacy import tokenizer
nlp = spacy.load('en_core_web_sm')

text =('''Python is an interpreted, high-level and general-purpose


programming language.
Pythons design philosophy emphasizes code readability with
its notable use of significant indentation.
Its language constructs and object-oriented approach aim to
help programmers write clear and
logical code for small and large-scale projects''')
# text2 = # copy the paragraphs from https://www.python.org/doc/essays/
doc = nlp(text)
#doc2 = nlp(text2)
sentences = list(doc.sents)
print(sentences)

# tokenization
for token in doc:
print(token.text)
# print entities
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)
# now we use displaycy function on doc2
displacy.render(doc, style='ent', jupyter=True)

# import modules and download packages


import nltk
nltk.download('words')
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('averaged_perceptron_tagger')
nltk.download('state_union')
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

# process the text and print Named entities


# tokenization
train_text = state_union.raw()
sample_text = state_union.raw("/content/2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)

# function
def get_named_entity():
try:
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
namedEnt = nltk.ne_chunk(tagged, binary=False)
namedEnt.draw()
except:
pass

get_named_entity()

text =('''HI I am Atharva, I am from Aurangabad, Maharashtra, India.


Currently I am persuing B-tech degree from Deogiri college''')
# text2 = # copy the paragraphs from https://www.python.org/doc/essays/
doc = nlp(text)
#doc2 = nlp(text2)
sentences = list(doc.sents)
print(sentences)

# tokenization
for token in doc:
print(token.text)
# print entities
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)
# now we use displaycy function on doc2
displacy.render(doc, style='ent', jupyter=True)

HI
I
am
Atharva
,
I
am
from
Aurangabad
,
Maharashtra
,
India
.
Currently
I
am
persuing
B
-
tech
degree
from
Deogiri
college
[('Atharva', 8, 15, 'PERSON'), ('Aurangabad', 27, 37, 'GPE'), ('Maharashtra',
39, 50, 'GPE'), ('India', 52, 57, 'GPE'), ('Deogiri', 102, 109, 'ORG')]
HI I am Atharva PERSON , I am from Aurangabad GPE , Maharashtra GPE , India GPE . Currently I am

persuing B-tech degree from Deogiri ORG college

Conclusion: In conclusion, the provided Python code successfully performs Named Entity
Recognition (NER) on input text, accurately identifying and extracting entities such as names,
locations, and organizations.

Date of Performance Date of Assessment Remark and Sign

21/11/23

You might also like