0% found this document useful (0 votes)

21 views33 pages

All Practicals

Practicals of AIML branch

Uploaded by

Ayesha Shaikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views33 pages

All Practicals

Practicals of AIML branch

Uploaded by

Ayesha Shaikh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 33

Practical 1 : C Program Code

Task:
Write a C Program Code to,

Read a sample .TXT file and print as it is on screen

Count the number of words / tokens in the file.

Count the number of unique words / tokens in the file.

Count the occurrence frequency of any specific word token (e.g. “AAB”)

Count the occurrence frequency of all the unique words / tokens in the file.

Submitted By: AI4116 Ayesha Shaikh

Code:
#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#define BUFFER_SIZE 1000

int countOccurrences(FILE fptr, const char word);

int main()

FILE *fptr;

char path[1000];

char word[1000];

int wCount;
printf("Enter file path: ");

scanf("%s", path);

printf("Enter word to search in file: ");

scanf("%s", word);

fptr = fopen(path, "r");

if (fptr == NULL)

printf("Unable to open file.\n");

printf("Please check you have read/write previleges.\n");

exit(EXIT_FAILURE);

wCount = countOccurrences(fptr, word);

printf("'%s' is found %d times in file.", word, wCount);

fclose(fptr);

return 0;

int countOccurrences(FILE fptr, const char word)

char str[BUFFER_SIZE];
char *pos;

int index, count;

count = 0;

while ((fgets(str, BUFFER_SIZE, fptr)) != NULL)

index = 0;

while ((pos = strstr(str + index, word)) != NULL)

index = (pos - str) + 1;

count++;

return count;

#include <stdio.h>

#include <string.h>

int main()

{
FILE* filePointer;

char dataToBeRead[1000];

filePointer = fopen("cyber.txt", "r");

if (filePointer == NULL) {

printf("cyber file failed to open.");

else {

printf("The file is now opened.\n");

printf("---------------------------\n");

while (fgets(dataToBeRead, 1000, filePointer)

!= NULL) {

printf("%s", dataToBeRead);

fclose(filePointer);

return 0;

Conclusion: The program successfully displayed the content of the sample .TXT file, provided
the counts of total words and unique words, reported the frequency of the specified word
token ("AAB"), and presented the occurrence frequency of all unique words in the file.

Date of Performance Date of Assessment Remark and Sign

08/08/23 22/08/23
Practical 2 : Text Preprocessing
Task:
Take any arbitrary string and perform the following task on it:

Identify the language of it

Count the length of the string

Count the number of tokens in the string (using split function and word tokenizer)

Count the unique tokens in the string

Take any news corpus, Pre-process it (All functionality Needed capitalization, contraction expansion,
punctuation removal, stop words)

Calculate Term Frequency for each term in the news corpus. (Is it pointing to the Topic of the corpus?)

Show any one example illustrating lexical richness of the text.

Submitted By: AI4116 Ayesha Shaikh

Code
# pip install langdetect nltk
import nltk

nltk.download('punkt')
nltk.download('stopwords')
!pip install langdetect

from langdetect import detect

from nltk.tokenize import word_tokenize
from nltk.probability import FreqDist
from nltk.corpus import stopwords
import string
import nltk

text = "Hi Darshan How are you ?"

text2 = "हाय आप कैसे हैं?"

# 1. Identify langauge of it.

def identify_language(text):
lang = detect(text)
return lang

print("Language:", identify_language(text))
print("Language:", identify_language(text2))

Language: en
Language: hi

# 2. Count the length of the string

def count_length(text):
return len(text)

print(count_length(text))

# 3. Count the number of tokens in the string(using split function and

word tokenizer).

def count_tokens(text):
tokens = word_tokenize(text)
return len(tokens)
print(count_tokens(text))

# 4. Count the unqiue tokens in the string.

def count_unique_tokens(text):
tokens = word_tokenize(text)
unique_tokens = set(tokens)
return len(unique_tokens)

print(count_unique_tokens(text))

def preprocess_corpus(corpus):
# Tokenize and remove stopwords and punctuation
tokens = word_tokenize(corpus)
tokens = [token.lower() for token in tokens if token.isalpha()]
tokens = [token for token in tokens if token not in
stopwords.words('english')]
return tokens

def calculate_term_frequency(tokens):
freq_dist = FreqDist(tokens)
term_freq = {word: freq for word, freq in freq_dist.items()}
return term_freq

# Step 5: Preprocess news corpus

news_corpus = "This is a sample news article. It contains some words that
we will preprocess."
preprocessed_tokens = preprocess_corpus(news_corpus)

# Step 6: Calculate term frequency

term_frequency = calculate_term_frequency(preprocessed_tokens)
print("Term Frequency:", term_frequency)

# Step 7: Lexical richness example

lexical_richness_example = "The quick brown fox jumps over the lazy dog.
This is a simple sentence."
tokens_lr = word_tokenize(lexical_richness_example)
lexical_richness = len(set(tokens_lr)) / len(tokens_lr)
print("Lexical Richness:", lexical_richness)

Term Frequency: {'sample': 1, 'news': 1, 'article': 1, 'contains': 1,

'words': 1, 'preprocess': 1}
Lexical Richness: 0.9375

contractions = {
"don't": "do not",
"doesn't": "does not",
"can't": "cannot",
"won't": "will not",
"haven't": "have not",
"hasn't": "has not",
"couldn't": "could not",
"shouldn't": "should not",
"wouldn't": "would not",
"it's": "it is",
"I'm": "I am",
"you're": "you are",
"they're": "they are",
"we're": "we are"
}

def expand_contractions(text):
words = text.split()
expanded_words = []
for word in words:
if word.lower() in contractions:
expanded_words.extend(contractions[word.lower()].split())
else:
expanded_words.append(word)
expanded_text = ' '.join(expanded_words)
return expanded_text

contraction_text = "I can't believe they're here. It's a nice day."

expanded_text = expand_contractions(contraction_text)
print("Expanded Text:", expanded_text)

Expanded Text: I cannot believe they are here. it is a nice day.

Conclusion: In conclusion, through comprehensive text preprocessing and analysis, including

language identification, length and token counting, unique token identification, and term
frequency calculation in a news corpus, the practical demonstrates a systematic approach to
understanding and extracting valuable insights from textual data, showcasing the importance of
effective preprocessing for subsequent tasks such as topic identification and lexical richness
assessment.

Date of Performance Date of Assessment Remark and Sign

22/08/23 29/08/23
Practical 3 : WordNet for Synonym and Antonym Detection
Task:
Find the synonym /antonym of a word using WordNet.

Illustrate the difference between stemming and lemmatizing

Submitted By: AI4116 Ayesha Shaikh

Code
!pip install nltk spacy
!python -m spacy download en
nltk.download('wordnet')

import nltk
from nltk.stem import PorterStemmer, SnowballStemmer, WordNetLemmatizer
import spacy

words = ["running", "flies", "better","Unhappiness ","Teacher ",

"happier","slowness","friendly", "jumps","Cats","Swimming "]

# NLTK Stemmers
porter = PorterStemmer()
snowball = SnowballStemmer("english")

# NLTK Lemmatizer
lemmatizer = WordNetLemmatizer()

# spaCy Lemmatizer
nlp = spacy.load("en_core_web_sm")

print("Actual words:" , words)

# Stemming
print("Porter Stemmer:", [porter.stem(word) for word in words])
print("Snowball Stemmer:", [snowball.stem(word) for word in words])
# Lemmatization
print("WordNet Lemmatizer:", [lemmatizer.lemmatize(word) for word in
words])
# spaCy Lemmatization
lemmatized_words_spacy = [token.lemma_ for token in nlp(" ".join(words))]
print("spaCy Lemmatization:",lemmatized_words_spacy)
from nltk.corpus import wordnet
word = "happy"
# word = "खुश"
# Get synsets (sets of synonyms) for the word
synsets = wordnet.synsets(word)

if synsets:
print("Synonyms:")
for synset in synsets:
synonyms = [lemma.name() for lemma in synset.lemmas()]
print(", ".join(synonyms))

# Get antonyms from the first synset

antonyms = [lemma.antonyms()[0].name() for lemma in
synsets[0].lemmas() if lemma.antonyms()]
if antonyms:
print("Antonyms:", ", ".join(antonyms))
else:
print("No antonyms found.")
else:
print("No synsets found for the word.")

Synonyms:
happy
felicitous, happy
glad, happy
happy, well-chosen
Antonyms: unhappy

Conclusion: In conclusion, WordNet proves to be a valuable resource for synonym and

antonym detection, offering a rich lexical database, while the difference between stemming
and lemmatizing lies in their approaches to reducing words to their base or root forms, with
stemming being more aggressive in its truncation.
Date of Performance Date of Assessment Remark and Sign

29/08/23 26/09/23

Practical 4 : N-gram Model

Task:
N Gram Model: Identify probability of next word occurrence using Bi-Gram Model

Submitted By: AI4116 Ayesha Shaikh

Code
def readData():
data = ['This is a dog','This is a cat','I love my cat','This is my
name ']
dat=[]
for i in range(len(data)):
for word in data[i].split():
dat.append(word)
print(dat)
return dat

def createBigram(data):
listOfBigrams = []
bigramCounts = {}
unigramCounts = {}
for i in range(len(data)-1):
if i < len(data) - 1 and data[i+1].islower():

listOfBigrams.append((data[i], data[i + 1]))

if (data[i], data[i+1]) in bigramCounts:

bigramCounts[(data[i], data[i + 1])] += 1
else:
bigramCounts[(data[i], data[i + 1])] = 1
if data[i] in unigramCounts:
unigramCounts[data[i]] += 1
else:
unigramCounts[data[i]] = 1
return listOfBigrams, unigramCounts, bigramCounts

def calcBigramProb(listOfBigrams, unigramCounts, bigramCounts):

listOfProb = {}
for bigram in listOfBigrams:
word1 = bigram[0]
word2 = bigram[1]
listOfProb[bigram] =
(bigramCounts.get(bigram))/(unigramCounts.get(word1))
return listOfProb

if __name__ == '__main__':
data = readData()
#data = ['this','is','my','cat']
listOfBigrams, unigramCounts, bigramCounts = createBigram(data)

print("\n All the possible Bigrams are ")

print(listOfBigrams)

print("\n Bigrams along with their frequency ")

print(bigramCounts)

print("\n Unigrams along with their frequency ")

print(unigramCounts)

bigramProb = calcBigramProb(listOfBigrams, unigramCounts,

bigramCounts)

print("\n Bigrams along with their probability ")

print(bigramProb)
inputList="This is a cat"
splt=inputList.split()
outputProb1 = 1
bilist=[]
bigrm=[]

for i in range(len(splt) - 1):

if i < len(splt) - 1:

bilist.append((splt[i], splt[i + 1]))

print("\n The bigrams in given sentence are ")

print(bilist)
for i in range(len(bilist)):
if bilist[i] in bigramProb:

outputProb1 *= bigramProb[bilist[i]]
else:

outputProb1 *= 0
print('\n' + 'Probablility of sentence = ' +inputList +
str(outputProb1))

def readData():
data = ['there is a big garden' ,'children play in a garden','they
play inside beautiful garden']
dat=[]
for i in range(len(data)):
for word in data[i].split():
dat.append(word)
print(dat)
return dat

if __name__ == '__main__':
data = readData()
listOfBigrams, unigramCounts, bigramCounts = createBigram(data)

print("\n All the possible Bigrams are ")

print(listOfBigrams)

print("\n Bigrams along with their frequency ")

print(bigramCounts)

print("\n Unigrams along with their frequency ")

print(unigramCounts)
bigramProb = calcBigramProb(listOfBigrams, unigramCounts,
bigramCounts)

print("\n Bigrams along with their probability ")

print(bigramProb)
inputList="they play in a big garden"
splt=inputList.split()
outputProb1 = 1
bilist=[]
bigrm=[]

for i in range(len(splt) - 1):

if i < len(splt) - 1:

bilist.append((splt[i], splt[i + 1]))

print("\n The bigrams in given sentence are ")

print(bilist)
for i in range(len(bilist)):
if bilist[i] in bigramProb:

outputProb1 *= bigramProb[bilist[i]]
else:

outputProb1 *= 0
print('\n' + 'Probablility of sentence \"\" = ' + str(outputProb1))

Task - Create any simple application using n-gram

import nltk
nltk.download('punkt')

def createTrigram(data):
listOfTrigrams = []
trigramCounts = {}
bigramCounts = {}
unigramCounts = {}

for i in range(len(data) - 2):

if i < len(data) - 2 and data[i + 1].islower():
listOfTrigrams.append((data[i], data[i + 1], data[i + 2]))
if (data[i], data[i + 1], data[i + 2]) in trigramCounts:
trigramCounts[(data[i], data[i + 1], data[i + 2])] += 1
else:
trigramCounts[(data[i], data[i + 1], data[i + 2])] = 1

bigram = (data[i], data[i + 1])

if bigram in bigramCounts:
bigramCounts[bigram] += 1
else:
bigramCounts[bigram] = 1

if data[i] in unigramCounts:
unigramCounts[data[i]] += 1
else:
unigramCounts[data[i]] = 1

return listOfTrigrams, unigramCounts, bigramCounts, trigramCounts

# Example data
data = ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy",
"dog"]

# Call the function

listOfTrigrams, unigramCounts, bigramCounts, trigramCounts =
createTrigram(data)

print("List of Trigrams:", listOfTrigrams)

print("Unigram Counts:", unigramCounts)
print("Bigram Counts:", bigramCounts)
print("Trigram Counts:", trigramCounts)

List of Trigrams: [('The', 'quick', 'brown'), ('quick', 'brown', 'fox'),

('brown', 'fox', 'jumps'), ('fox', 'jumps', 'over'), ('jumps', 'over',
'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog')]
Unigram Counts: {'The': 1, 'quick': 1, 'brown': 1, 'fox': 1, 'jumps': 1,
'over': 1, 'the': 1}
Bigram Counts: {('The', 'quick'): 1, ('quick', 'brown'): 1, ('brown',
'fox'): 1, ('fox', 'jumps'): 1, ('jumps', 'over'): 1, ('over', 'the'): 1,
('the', 'lazy'): 1}
Trigram Counts: {('The', 'quick', 'brown'): 1, ('quick', 'brown', 'fox'):
1, ('brown', 'fox', 'jumps'): 1, ('fox', 'jumps', 'over'): 1, ('jumps',
'over', 'the'): 1, ('over', 'the', 'lazy'): 1, ('the', 'lazy', 'dog'): 1}
Conclusion: The Bi-Gram Model effectively estimates the probability of the next word
occurrence based on the preceding word, providing a simple yet practical approach to language
modeling.

Date of Performance Date of Assessment Remark and Sign

26/09/23 03/10/23

Practical 5 : Word Semantics: One hot encoding, TD Matrix, TF-IDF, Word2Vec,

PPMI
Task:
Write a python code for the following task:

Take any text corpus and calculate one hot encoded vector, calculate TD matrix, TF-IDF for some token
terms, PPMI for finding corresponding word of a given word, use Word2Vec for word embedding.

Submitted By: AI4116 Ayesha Shaikh

Code
from sklearn.feature_extraction.text import TfidfVectorizer

d1= "data science is one of the most important fields of science"

d2= "this is one of the best data science courses"
d3="data scientists analyze data"

doc_corpus=[d1,d2,d3]
print(doc_corpus)
vec=TfidfVectorizer(stop_words='english')
matrix=vec.fit_transform(doc_corpus)
print("Feature Names n",vec.get_feature_names_out())
print("Sparse Matrix n",matrix.shape,"n",matrix.toarray())
import pandas as pd
import numpy as np
corpus = ['data science is one of the most important fields of science',
'this is one of the best data science courses',
'data scientists analyze data' ]
create a word set for corpus
words_set = set()
for doc in corpus:
words = doc.split(' ')
words_set = words_set.union(set(words))

print('Number of words in the corpus:',len(words_set))

print('The words in the corpus: \n', words_set)

Computing Term Frequency

from sklearn.feature_extraction.text import TfidfVectorizer

# assign documents
d0= "data science is one of the most important fields of science"
d1= "this is one of the best data science courses"
d2="data scientists analyze data"
# merge documents into a single corpus
string = [d0, d1, d2]
# create object
tfidf = TfidfVectorizer()
# get tf-df values
result = tfidf.fit_transform(string)

# get idf values

print('\nidf values:')
for ele1, ele2 in zip(tfidf.get_feature_names_out(), tfidf.idf_):
print(ele1, ':', ele2)

# get indexing
print('\nWord indexes:')
print(tfidf.vocabulary_)

# display tf-idf values

print('\ntf-idf value:')
print(result)

# in matrix form
print('\ntf-idf values in matrix form:')
print(result.toarray())
Conclusion: In conclusion, the implemented Python code successfully demonstrated various
word vectorization techniques, including one-hot encoding, term-document matrix (TD), term
frequency-inverse document frequency (TF-IDF), positive pointwise mutual information (PPMI),
and Word2Vec, showcasing the versatility of these methods in capturing semantic relationships
and contextual information within a given text corpus.

Date of Performance Date of Assessment Remark and Sign

3/10/23 17/10/23

Practical 6 : Sentiment Detection of Sentence

Task:
Task: Write a python code to identify sentiment of the sentence. Implement task using at least TWO
DIFFERNT APPROACHES Compre the performance of both.

Submitted By: AI4116 Ayesha Shaikh

Code
import nltk
from nltk.sentiment.vader import SentimentIntensityAnalyzer

nltk.download('vader_lexicon')

def classify_sentiment(sentence):
analyzer = SentimentIntensityAnalyzer()
sentiment_scores = analyzer.polarity_scores(sentence)

# Determine sentiment based on compound score

compound_score = sentiment_scores['compound']

if compound_score >= 0.05:

return "Positive"
elif compound_score <= -0.05:
return "Negative"
else:
return "Neutral"
# Example usage:
sentence = "I love this product, it's amazing!"
sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")

sentence = "This is a terrible experience."

sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")

sentence = "This is a neutral sentence."

sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")

sentence = "This is not a good movie."

sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")

import nltk
nltk.download('opinion_lexicon')

# Define positive and negative word dictionaries from NLTK

from nltk.corpus import opinion_lexicon

positive_words = set(opinion_lexicon.positive())
negative_words = set(opinion_lexicon.negative())

def preprocess_sentence(sentence):
# Convert the sentence to lowercase and split it into words
words = sentence.lower().split()
return words

def classify_sentiment(sentence):
words = preprocess_sentence(sentence)

# Initialize sentiment score

sentiment_score = 0
for word in words:
if word in positive_words:
sentiment_score += 1
elif word in negative_words:
sentiment_score -= 1
print("Current word :: ", word , " Sentiment score :: ",
sentiment_score)
# Classify the sentiment based on the score
if sentiment_score > 0:
return "Positive"
elif sentiment_score < 0:
return "Negative"
else:
return "Neutral"

# Example usage:
'''sentence = "I love this product, it's amazing!"
sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")

sentence = "This is a terrible experience."

sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")

sentence = "This is a neutral sentence."

sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")
'''
sentence = "This is not a good movie."
sentiment = classify_sentiment(sentence)
print(f"Sentence sentiment: {sentiment}")

Current word :: this Sentiment score :: 0

Current word :: is Sentiment score :: 0
Current word :: not Sentiment score :: 0
Current word :: a Sentiment score :: 0
Current word :: good Sentiment score :: 1
Current word :: movie. Sentiment score :: 1
Sentence sentiment: Positive
[nltk_data] Downloading package opinion_lexicon to /root/nltk_data...
[nltk_data] Package opinion_lexicon is already up-to-date!

Conclusion: The sentiment detection task was implemented using two different
approaches, and their performance was compared, revealing insights into the
effectiveness of each method for accurately identifying the sentiment of a given
sentence.

Date of Performance Date of Assessment Remark and Sign

03/10/23 17/10/23

Practical 7 : POS Tagging (HMM) + (NLTK)

Task:
Task 1: Write a code in python (using a ready function) to input some text from user and identify POS tag
of each token in it.

Task 2: How HMM can be used for POS Tagging? Illustratepython code for Transition Probability and
Emmision Probability Calculation.

Submitted By: AI4116 Ayesha Shaikh

Code

# Use simple NLTK POS tagger(Any readymade function) for identifying POS
tag of the input sentence.

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize
stop_words = set(stopwords.words('english'))
# Dummy text

txt1 = "She is writing code."

# sent_tokenize is one of instances of

# PunktSentenceTokenizer from the nltk.tokenize.punkt module

tokenized = sent_tokenize(txt1)
for i in tokenized:

# Word tokenizers is used to find the words

# and punctuation in a string
wordsList = nltk.word_tokenize(i)

# removing stop words from wordList

wordsList = [w for w in wordsList if not w in stop_words]

# Using a Tagger. Which is part-of-speech

# tagger or POS-tagger.
tagged = nltk.pos_tag(wordsList)

print(tagged)

# Importing libraries
import nltk
import numpy as np
import pandas as pd
import random
from sklearn.model_selection import train_test_split
import pprint, time

#download the treebank corpus from nltk

nltk.download('treebank')

#download the universal tagset from nltk

nltk.download('universal_tagset')

# reading the Treebank tagged sentences

nltk_data = list(nltk.corpus.treebank.tagged_sents(tagset='universal'))

#print the first two sentences along with tags

print(nltk_data[:2])
#print each word with its respective tag for first two sentences
for sent in nltk_data[:2]:
for tuple in sent:
print(tuple)

# split data into training and validation set in the ratio 80:20
train_set,test_set
=train_test_split(nltk_data,train_size=0.80,test_size=0.20,random_state =
101)

# create list of train and test tagged words

train_tagged_words = [ tup for sent in train_set for tup in sent ]
test_tagged_words = [ tup for sent in test_set for tup in sent ]
print(len(train_tagged_words))
print(len(test_tagged_words))

# check some of the tagged words.

train_tagged_words[:5]

#use set datatype to check how many unique tags are present in training
data
tags = {tag for word,tag in train_tagged_words}
print(len(tags))
print(tags)

# check total words in vocabulary

vocab = {word for word,tag in train_tagged_words}

# compute Emission Probability

def word_given_tag(word, tag, train_bag = train_tagged_words):
tag_list = [pair for pair in train_bag if pair[1]==tag]
count_tag = len(tag_list)#total number of times the passed tag
occurred in train_bag
w_given_tag_list = [pair[0] for pair in tag_list if pair[0]==word]
#now calculate the total number of times the passed word occurred as the
passed tag.
count_w_given_tag = len(w_given_tag_list)

return (count_w_given_tag, count_tag)

# compute Transition Probability

def t2_given_t1(t2, t1, train_bag = train_tagged_words):
tags = [pair[1] for pair in train_bag]
count_t1 = len([t for t in tags if t==t1])
count_t2_t1 = 0
for index in range(len(tags)-1):
if tags[index]==t1 and tags[index+1] == t2:
count_t2_t1 += 1
return (count_t2_t1, count_t1)

# creating t x t transition matrix of tags, t= no of tags

# Matrix(i, j) represents P(jth tag after the ith tag)

tags_matrix = np.zeros((len(tags), len(tags)), dtype='float32')

for i, t1 in enumerate(list(tags)):
for j, t2 in enumerate(list(tags)):
tags_matrix[i, j] = t2_given_t1(t2, t1)[0]/t2_given_t1(t2, t1)[1]

print(tags_matrix)

# convert the matrix to a df for better readability

#the table is same as the transition table shown in section 3 of article
tags_df = pd.DataFrame(tags_matrix, columns = list(tags),
index=list(tags))
display(tags_df)

AD CO NU PR NO VE AD
DE ADJ PRT . X
V NJ M ON UN RB P
T

DE 0.00 0.01 0.20 0.00 0.02 0.00 0.01 0.00 0.63 0.04 0.00 0.04
T 6037 2074 6411 0431 2855 0287 7393 3306 5906 0247 9918 5134

AD 0.07 0.08 0.13 0.00 0.02 0.01 0.13 0.01 0.03 0.33 0.11 0.02
V 1373 1458 0721 6982 9868 4740 9255 2025 2196 9022 9472 2886

AD 0.00 0.00 0.06 0.01 0.02 0.01 0.06 0.00 0.69 0.01 0.08 0.02
J 5243 5243 3301 6893 1748 1456 6019 0194 6893 1456 0583 0971

C
0.12 0.05 0.11 0.00 0.04 0.00 0.03 0.06 0.34 0.15 0.05 0.00
O
3491 7080 3611 0549 0615 4391 5126 0373 9067 0384 5982 9330
NJ
Conclusion: In conclusion, the provided Python code utilizing NLTK allows users to input text
and obtain the corresponding Part-of-Speech (POS) tags for each token, while Hidden Markov
Models (HMM) can be employed for POS tagging through the calculation of Transition
Probability and Emission Probability in a systematic and illustrative manner.

Date of Performance Date of Assessment Remark and Sign

17/10/23 21/11/23

Practical 8 : CRF POS Tagging + LSTM POS Tagging

Task:
Task: Write a python code to assign POS tag to the input stream of words using CRF as well as LSTM.
Compare the performance of both the taggers.

Submitted By: AI4116 Ayesha Shaikh

Code:
pip install python-crfsuite

import nltk
nltk.download('treebank')

import pycrfsuite
from nltk.corpus import treebank
from sklearn.model_selection import train_test_split

# Load the Penn Treebank dataset

tagged_sentences = treebank.tagged_sents()

# Feature extraction function

def word2features(sent, i):
word = sent[i][0]
features = {
'word': word,
'is_first': i == 0,
'is_last': i == len(sent) - 1,
'prev_word': '' if i == 0 else sent[i - 1][0],
'next_word': '' if i == len(sent) - 1 else sent[i + 1][0]
}
return features

def sent2features(sent):
return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
return [label for word, label in sent]

# Prepare the data

X = [sent2features(sent) for sent in tagged_sentences]
y = [sent2labels(sent) for sent in tagged_sentences]

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Train the CRF model

trainer = pycrfsuite.Trainer(verbose=False)
for xseq, yseq in zip(X_train, y_train):
trainer.append(xseq, yseq)

trainer.set_params({
'c1': 1.0,
'c2': 1e-3,
'max_iterations': 50,
'feature.possible_transitions': True
})

model_filename = 'pos_tagger_model.crfsuite'
trainer.train(model_filename)

# Test the model

tagger = pycrfsuite.Tagger()
tagger.open(model_filename)

# Tag a sentence
example_sentence = treebank.sents()[0]
features = sent2features(example_sentence)
tags = tagger.tag(features)

for word, tag in zip(example_sentence, tags):

print(f'{word}/{tag}', end=' ')

# Evaluate the model

from sklearn.metrics import classification_report

y_pred = [tagger.tag(xseq) for xseq in X_test]

y_test_flat = [label for label_seq in y_test for label in label_seq]
y_pred_flat = [label for label_seq in y_pred for label in label_seq]

print("\nClassification Report:")
print(classification_report(y_test_flat, y_pred_flat))
This/DT is/VBZ a/DT sample/NNP sentence/NNP for/IN POS/NNP tagging./NNP

import pycrfsuite

# Load the CRF model

model_filename = 'pos_tagger_model.crfsuite'
tagger = pycrfsuite.Tagger()
tagger.open(model_filename)

# Sample sentence
sample_sentence = "This is a sample sentence for POS tagging."

# Tokenize the sample sentence

sample_tokens = sample_sentence.split()

# Extract features for the sample sentence

sample_features = [word2features([(word, '')], 0) for word in
sample_tokens]

# Predict POS tags for the sample sentence

predicted_tags = tagger.tag(sample_features)

# Combine words and predicted tags

tagged_sentence = ' '.join([f'{word}/{tag}' for word, tag in
zip(sample_tokens, predicted_tags)])

# Print the tagged sentence

print(tagged_sentence)
Write python code for POS tagging using LSTM

import numpy as np
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import treebank
from nltk.tokenize import word_tokenize

# Load the Penn Treebank dataset

tagged_sentences = treebank.tagged_sents()

# Create the vocabulary and encode labels

words = set(word.lower() for sentence in tagged_sentences for word, tag in
sentence)
words = list(words)
words.append("ENDPAD")
n_words = len(words)

tags = set(tag for sentence in tagged_sentences for word, tag in sentence)

n_tags = len(tags)

word2idx = {w: i for i, w in enumerate(words)}

tag2idx = {t: i for i, t in enumerate(tags)}

# Convert sentences and labels to numerical format

X = [[word2idx[word.lower()] for word, tag in sent] for sent in
tagged_sentences]
y = [[tag2idx[tag] for word, tag in sent] for sent in tagged_sentences]

# Padding sequences to a fixed length

max_len = 100
X = pad_sequences(X, maxlen=max_len, padding="post")
y = pad_sequences(y, maxlen=max_len, padding="post")

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)

# Create and compile the LSTM model

model = Sequential()
model.add(Embedding(input_dim=n_words, output_dim=50,
input_length=max_len))
model.add(LSTM(units=100, return_sequences=True))
model.add(Dense(n_tags, activation="softmax"))

model.compile(optimizer="adam", loss="sparse_categorical_crossentropy",
metrics=["accuracy"])

# Train the model

model.fit(X_train, y_train, batch_size=32, epochs=5, validation_split=0.1)

# Evaluate the model

loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Loss: {loss:.4f}, Test Accuracy: {accuracy:.4f}")

sample_sentence = "This is a sample sentence for POS tagging.".split()

sample_sequence = [word2idx.get(word.lower(), word2idx["ENDPAD"]) for word
in sample_sentence]

# Pad the sequence to match the model's input shape

sample_sequence = pad_sequences([sample_sequence], maxlen=max_len,
padding="post")

predicted_tags = model.predict(sample_sequence)

predicted_tags = [np.argmax(tag) for tag in predicted_tags[0]]

predicted_tags = [list(tag2idx.keys())[list(tag2idx.values()).index(tag)]
for tag in predicted_tags]

for word, tag in zip(sample_sentence, predicted_tags):

print(f"{word}/{tag}", end=" ")

1/1 [==============================] - 0s 444ms/step

This/DT is/DT a/DT sample/DT sentence/NN for/IN POS/NN tagging./NN
Conclusion: The comparison of CRF and LSTM POS tagging reveals variations in performance,
with CRF demonstrating robust sequential labeling, while LSTM exhibits the ability to capture
intricate patterns in the input stream.

Date of Performance Date of Assessment Remark and Sign

17/10/23 21/11/23

Practical 9 : NER (Named Entity Recognition)

Task:
Task: Write a python code to identify the Named Entities from the input text.

Submitted By: AI4116 Ayesha Shaikh

Code:
! pip install spacy
! pip install nltk
! python -m spacy download en_core_web_sm
import spacy
from spacy import displacy
from spacy import tokenizer
nlp = spacy.load('en_core_web_sm')

text =('''Python is an interpreted, high-level and general-purpose

programming language.
Pythons design philosophy emphasizes code readability with
its notable use of significant indentation.
Its language constructs and object-oriented approach aim to
help programmers write clear and
logical code for small and large-scale projects''')
# text2 = # copy the paragraphs from https://www.python.org/doc/essays/
doc = nlp(text)
#doc2 = nlp(text2)
sentences = list(doc.sents)
print(sentences)

# tokenization
for token in doc:
print(token.text)
# print entities
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print(ents)
# now we use displaycy function on doc2
displacy.render(doc, style='ent', jupyter=True)

# import modules and download packages

import nltk
nltk.download('words')
nltk.download('punkt')
nltk.download('maxent_ne_chunker')
nltk.download('averaged_perceptron_tagger')
nltk.download('state_union')
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

# process the text and print Named entities

# tokenization
train_text = state_union.raw()
sample_text = state_union.raw("/content/2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)

# function
def get_named_entity():
try:
for i in tokenized:
words = nltk.word_tokenize(i)
tagged = nltk.pos_tag(words)
namedEnt = nltk.ne_chunk(tagged, binary=False)
namedEnt.draw()
except:
pass

get_named_entity()

text =('''HI I am Atharva, I am from Aurangabad, Maharashtra, India.

Currently I am persuing B-tech degree from Deogiri college''')
# text2 = # copy the paragraphs from https://www.python.org/doc/essays/
doc = nlp(text)
#doc2 = nlp(text2)
sentences = list(doc.sents)
print(sentences)

HI
I
am
Atharva
,
I
am
from
Aurangabad
,
Maharashtra
,
India
.
Currently
I
am
persuing
B
-
tech
degree
from
Deogiri
college
[('Atharva', 8, 15, 'PERSON'), ('Aurangabad', 27, 37, 'GPE'), ('Maharashtra',
39, 50, 'GPE'), ('India', 52, 57, 'GPE'), ('Deogiri', 102, 109, 'ORG')]
HI I am Atharva PERSON , I am from Aurangabad GPE , Maharashtra GPE , India GPE . Currently I am

persuing B-tech degree from Deogiri ORG college

Conclusion: In conclusion, the provided Python code successfully performs Named Entity
Recognition (NER) on input text, accurately identifying and extracting entities such as names,
locations, and organizations.

Date of Performance Date of Assessment Remark and Sign

21/11/23

The Ultimate C Handbook
75% (8)
The Ultimate C Handbook
60 pages
CPP MCQ
No ratings yet
CPP MCQ
75 pages
Text File Question and Answers
No ratings yet
Text File Question and Answers
5 pages
Access Specifiers in C++
No ratings yet
Access Specifiers in C++
8 pages
Placement Guide
100% (1)
Placement Guide
207 pages
Ccs369 - Text and Speech Analysis - Lab Manual
100% (1)
Ccs369 - Text and Speech Analysis - Lab Manual
23 pages
CCS369 - Text and Speech Analysis
No ratings yet
CCS369 - Text and Speech Analysis
31 pages
NLP Lab Manual (R20)
50% (2)
NLP Lab Manual (R20)
24 pages
MCA (Master of Computer Application)
No ratings yet
MCA (Master of Computer Application)
39 pages
C PROGRAMMING Interview Questions
100% (1)
C PROGRAMMING Interview Questions
10 pages
Part 4: Implementing The Solution in Python
No ratings yet
Part 4: Implementing The Solution in Python
5 pages
20BCP112 - NLP Lab - LAB - Manual
No ratings yet
20BCP112 - NLP Lab - LAB - Manual
65 pages
Mmmut Syllabus 2024
No ratings yet
Mmmut Syllabus 2024
25 pages
"Basic of C and C++ ": Summer Training Report
No ratings yet
"Basic of C and C++ ": Summer Training Report
48 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
25 pages
Ccs339 Text and Speech Analysis Lab Manual
No ratings yet
Ccs339 Text and Speech Analysis Lab Manual
51 pages
20BCP123 - NLP Lab Manual
No ratings yet
20BCP123 - NLP Lab Manual
45 pages
Natural Language Processing Lab Manual
No ratings yet
Natural Language Processing Lab Manual
24 pages
BSC Multimedia PDF
100% (1)
BSC Multimedia PDF
17 pages
21CSS101J - Unit1 - Updated 2
No ratings yet
21CSS101J - Unit1 - Updated 2
185 pages
CCS369-Text and Speech Analysis Lab (1-9)
No ratings yet
CCS369-Text and Speech Analysis Lab (1-9)
37 pages
NLP - Record (Weeks 1-12)
No ratings yet
NLP - Record (Weeks 1-12)
41 pages
SPR 05 NLTK
No ratings yet
SPR 05 NLTK
18 pages
2 TextProc 2023
No ratings yet
2 TextProc 2023
35 pages
Ai&Ml Bai601 NLP Lab Manual
No ratings yet
Ai&Ml Bai601 NLP Lab Manual
48 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Tsa Labmanual
No ratings yet
Tsa Labmanual
26 pages
Language Engineering - Section
No ratings yet
Language Engineering - Section
20 pages
NLP Op
No ratings yet
NLP Op
16 pages
Corpora
No ratings yet
Corpora
48 pages
Apex Institute of Technology Natural Language Processing (20CST354)
No ratings yet
Apex Institute of Technology Natural Language Processing (20CST354)
43 pages
NLP Record
No ratings yet
NLP Record
23 pages
Unit 2 - C Flow Control
No ratings yet
Unit 2 - C Flow Control
57 pages
Compiler Record
No ratings yet
Compiler Record
50 pages
Text Analysis With NLTK Cheatsheet PDF
No ratings yet
Text Analysis With NLTK Cheatsheet PDF
3 pages
Tsarecord
No ratings yet
Tsarecord
22 pages
Lab1 IR
No ratings yet
Lab1 IR
14 pages
Soundarya 256 NLP Practs
No ratings yet
Soundarya 256 NLP Practs
14 pages
NLP Lecture2 Text Pre Processing
No ratings yet
NLP Lecture2 Text Pre Processing
54 pages
Storage Classes in C Programming
No ratings yet
Storage Classes in C Programming
7 pages
7 Exp
No ratings yet
7 Exp
6 pages
Programming For Problem Solving Using C Lab r20 Lab Manual
No ratings yet
Programming For Problem Solving Using C Lab r20 Lab Manual
43 pages
R22 NLP Python Programs
No ratings yet
R22 NLP Python Programs
15 pages
Lab 1
No ratings yet
Lab 1
5 pages
5.arithmetic Expression - Reviewed
No ratings yet
5.arithmetic Expression - Reviewed
44 pages
Module03 Embeddings
No ratings yet
Module03 Embeddings
102 pages
NLP FinAL
No ratings yet
NLP FinAL
27 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
Final NLP Lab File
No ratings yet
Final NLP Lab File
28 pages
TSA Student
No ratings yet
TSA Student
20 pages
NLP Lab Programs
No ratings yet
NLP Lab Programs
18 pages
NLP Manual (1-12)
No ratings yet
NLP Manual (1-12)
54 pages
NLPNOTES
No ratings yet
NLPNOTES
26 pages
C DPP-2
No ratings yet
C DPP-2
4 pages
Experiment No. 8 Lexical Diversity: 1 Objective
No ratings yet
Experiment No. 8 Lexical Diversity: 1 Objective
3 pages
SK NLP Practical (FS)
No ratings yet
SK NLP Practical (FS)
22 pages
3.word Level Analysis-Tokenization Stemming
No ratings yet
3.word Level Analysis-Tokenization Stemming
8 pages
Text Processing
No ratings yet
Text Processing
16 pages
Sree017 NLP
No ratings yet
Sree017 NLP
3 pages
x0 Process
No ratings yet
x0 Process
4 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
NLP Notes-1
No ratings yet
NLP Notes-1
11 pages
Array, Structure and Pointers in C - For DSA
No ratings yet
Array, Structure and Pointers in C - For DSA
10 pages
Midterm 1
No ratings yet
Midterm 1
5 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
03 Word Tokenization 14-26
No ratings yet
03 Word Tokenization 14-26
6 pages
7 TextAnalysis
No ratings yet
7 TextAnalysis
3 pages
Batch 2
No ratings yet
Batch 2
13 pages
Semantic Analysis Theory1
No ratings yet
Semantic Analysis Theory1
16 pages
5 Basic Text Processing
No ratings yet
5 Basic Text Processing
6 pages
C Programming
No ratings yet
C Programming
186 pages
Shubham Jade MSC It 31031420010 NLP Practical Journal
No ratings yet
Shubham Jade MSC It 31031420010 NLP Practical Journal
17 pages
Program C
No ratings yet
Program C
24 pages
C++ Exercise 2
No ratings yet
C++ Exercise 2
23 pages
1 - Write A Python Program To Perform Following Tasks On Text A) Tokenization
No ratings yet
1 - Write A Python Program To Perform Following Tasks On Text A) Tokenization
13 pages
CSE Assignment - 2
No ratings yet
CSE Assignment - 2
10 pages
Text Analysis With NLTK Cheatsheet
No ratings yet
Text Analysis With NLTK Cheatsheet
3 pages
C Arrays and Pointers: - in Java, Pointers Are Easy To Deal With
No ratings yet
C Arrays and Pointers: - in Java, Pointers Are Easy To Deal With
38 pages
SRS Final
No ratings yet
SRS Final
30 pages
Lab 07 - Programming Threads
No ratings yet
Lab 07 - Programming Threads
9 pages
C Programming - Functions
No ratings yet
C Programming - Functions
9 pages
CSE1071 Remedial Jan2014 Assignment-3 Answers
No ratings yet
CSE1071 Remedial Jan2014 Assignment-3 Answers
17 pages
3716 Marketing-Strategy
No ratings yet
3716 Marketing-Strategy
15 pages
BJ 7
No ratings yet
BJ 7
4 pages
Cse 3103
No ratings yet
Cse 3103
2 pages
3716 - A2 - Ayesha Shaikh - AML CA2
No ratings yet
3716 - A2 - Ayesha Shaikh - AML CA2
4 pages
The Transport Layer Services
No ratings yet
The Transport Layer Services
4 pages
Porting A Program To Dynamic C: 022-0044 Rev. J
No ratings yet
Porting A Program To Dynamic C: 022-0044 Rev. J
6 pages
National Institute of Technology, Jamshedpur: Department of Computer Science and Engineering
No ratings yet
National Institute of Technology, Jamshedpur: Department of Computer Science and Engineering
2 pages
Cpp2 Functions
No ratings yet
Cpp2 Functions
25 pages
Simplifying Data Science With Python
From Everand
Simplifying Data Science With Python
Billy David millican
No ratings yet
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet

All Practicals

Uploaded by

All Practicals

Uploaded by

Practical 1 : C Program Code

Read a sample .TXT file and print as it is on screen

Count the number of words / tokens in the file.

Count the number of unique words / tokens in the file.

Submitted By: AI4116 Ayesha Shaikh

#define BUFFER_SIZE 1000

int countOccurrences(FILE *fptr, const char *word);

printf("Enter word to search in file: ");

fptr = fopen(path, "r");

printf("Unable to open file.\n");

printf("Please check you have read/write previleges.\n");

wCount = countOccurrences(fptr, word);

printf("'%s' is found %d times in file.", word, wCount);

int countOccurrences(FILE *fptr, const char *word)

int index, count;

while ((fgets(str, BUFFER_SIZE, fptr)) != NULL)

while ((pos = strstr(str + index, word)) != NULL)

index = (pos - str) + 1;

filePointer = fopen("cyber.txt", "r");

printf("cyber file failed to open.");

printf("The file is now opened.\n");

while (fgets(dataToBeRead, 1000, filePointer)

Date of Performance Date of Assessment Remark and Sign

Identify the language of it

Count the length of the string

Count the unique tokens in the string

Show any one example illustrating lexical richness of the text.

Submitted By: AI4116 Ayesha Shaikh

from langdetect import detect

text = "Hi Darshan How are you ?"

# 1. Identify langauge of it.

# 2. Count the length of the string

# 3. Count the number of tokens in the string(using split function and

# 4. Count the unqiue tokens in the string.

# Step 5: Preprocess news corpus

# Step 6: Calculate term frequency

# Step 7: Lexical richness example

Term Frequency: {'sample': 1, 'news': 1, 'article': 1, 'contains': 1,

contraction_text = "I can't believe they're here. It's a nice day."

Expanded Text: I cannot believe they are here. it is a nice day.

Conclusion: In conclusion, through comprehensive text preprocessing and analysis, including

Date of Performance Date of Assessment Remark and Sign

Illustrate the difference between stemming and lemmatizing

Submitted By: AI4116 Ayesha Shaikh

words = ["running", "flies", "better","Unhappiness ","Teacher ",

print("Actual words:" , words)

# Get antonyms from the first synset

Conclusion: In conclusion, WordNet proves to be a valuable resource for synonym and

Practical 4 : N-gram Model

Submitted By: AI4116 Ayesha Shaikh

listOfBigrams.append((data[i], data[i + 1]))

if (data[i], data[i+1]) in bigramCounts:

def calcBigramProb(listOfBigrams, unigramCounts, bigramCounts):

print("\n All the possible Bigrams are ")

print("\n Bigrams along with their frequency ")

print("\n Unigrams along with their frequency ")

bigramProb = calcBigramProb(listOfBigrams, unigramCounts,

print("\n Bigrams along with their probability ")

for i in range(len(splt) - 1):

bilist.append((splt[i], splt[i + 1]))

print("\n The bigrams in given sentence are ")

print("\n All the possible Bigrams are ")

print("\n Bigrams along with their frequency ")

print("\n Unigrams along with their frequency ")

print("\n Bigrams along with their probability ")

for i in range(len(splt) - 1):

bilist.append((splt[i], splt[i + 1]))

print("\n The bigrams in given sentence are ")

Task - Create any simple application using n-gram

for i in range(len(data) - 2):

bigram = (data[i], data[i + 1])

return listOfTrigrams, unigramCounts, bigramCounts, trigramCounts

# Call the function

print("List of Trigrams:", listOfTrigrams)

List of Trigrams: [('The', 'quick', 'brown'), ('quick', 'brown', 'fox'),

Date of Performance Date of Assessment Remark and Sign

int countOccurrences(FILE fptr, const char word);

int countOccurrences(FILE fptr, const char word)