0% found this document useful (0 votes)

57 views21 pages

NLP Manual

The document describes programs to extract various features from text data using techniques like one hot encoding, bag of words, n-grams, tf-idf, custom features, and word embeddings. Sample code is provided to demonstrate each technique.

Uploaded by

1nt21ai012.vynavi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views21 pages

NLP Manual

Uploaded by

1nt21ai012.vynavi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 21

Department of

Artificial Intelligence & machine Learning

Laboratory Manual

For

Natural language processing

Laboratory
(21AMl65)

(6TH Semester B.E.)

2023-24

NITTE MEENAKSHI Institute of Technology

Bangalore-64
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

Natural Language Processing laboratory-21AML65

1. Develop a python program to perform Basic Text Processing operations (Tokenization,

lemmatization, stemming) on any text document.
2. Develop a Python program that systematically implements an N-gram language model for
sentiment analysis on a financial dataset.
3. Write a program to extract features from text by using the following techniques:
i) One Hot Encoding
ii) Bag of Word (BOW)
iii) n-grams
iv) Tf-Idf
v) Custom features
vi) Word2Vec (Word Embedding)
4. Develop a python program to implement Topic Modelling using Latent Semantic Analysis.
5. Develop a Python program to classify text using Naïve Bayes and SVM.
6. Develop a Python program to perform K-means clustering on text.
7. Develop a Python program to implement rule based and statistical PoS Tagging on text.
8. Develop a Python program for generating text with an LSTM Network.
9. Develop a Python program to implement HMM and CRF on sequence tagging task.
1. Develop a python program to perform Basic Text Processing operations
(Tokenization, lemmatization, stemming) on any text document.

import nltk

from nltk.tokenize import word_tokenize

from nltk.stem import WordNetLemmatizer, PorterStemmer

from nltk.corpus import stopwords

import string

nltk.download('punkt')

nltk.download('wordnet')

nltk.download('stopwords')

def preprocess_text(text):

# Tokenization

tokens = word_tokenize(text.lower())

# Remove punctuation

tokens = [token for token in tokens if token not in string.punctuation]

# Remove stopwords

stop_words = set(stopwords.words('english'))

tokens = [token for token in tokens if token not in stop_words]

return tokens

def lemmatize(tokens):

lemmatizer = WordNetLemmatizer()

lemmas = [lemmatizer.lemmatize(token) for token in tokens]

return lemmas
def stem(tokens):

stemmer = PorterStemmer()

stems = [stemmer.stem(token) for token in tokens]

return stems

def main():

# Sample text

text = "Tokenization is the process of breaking down text into words and phrases. Stemming
and Lemmatization are techniques used to reduce words to their base form."

# Preprocess text

tokens = preprocess_text(text)

# Lemmatization

lemmas = lemmatize(tokens)

print("Lemmatization:")

print(lemmas)

# Stemming

stems = stem(tokens)

print("\nStemming:")

print(stems)

if __name__ == "__main__":

main()

This program tokenizes the input text, removes punctuation and stopwords, performs
lemmatization, and stemming using NLTK. Make sure you have NLTK installed (pip install nltk)
and download the required resources using nltk.download().
Output:

Lemmatization:

['tokenization', 'process', 'breaking', 'text', 'word', 'phrase', 'stemming', 'lemmatization', 'technique',

'used', 'reduce', 'word', 'base', 'form']

Stemming:

['token', 'process', 'break', 'text', 'word', 'phrase', 'stem', 'lemmat', 'techniqu', 'use', 'reduc', 'word',
'base', 'form']

2. Develop a Python program that systematically implements an N-gram language model

for sentiment analysis on a financial dataset.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Load the financial dataset

data = pd.read_csv("financial_dataset.csv") # Replace "financial_dataset.csv" with your dataset
filename

# Preprocess the data

# Assuming the dataset has two columns: "text" containing the text data and "sentiment" containing
sentiment labels
X = data['text']
y = data['sentiment']

# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Vectorize the text data using N-gram model

vectorizer = CountVectorizer(ngram_range=(1, 2)) # You can adjust the n-gram range
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)
# Train the classifier
classifier = MultinomialNB()
classifier.fit(X_train_vectorized, y_train)

# Predict sentiment on the test set

y_pred = classifier.predict(X_test_vectorized)

# Evaluate the model

accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Display classification report

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

This program uses the scikit-learn library for implementing the N-gram language model. It
utilizes the Multinomial Naive Bayes classifier for sentiment analysis. You can adjust the n-gram
range in the CountVectorizer to experiment with different N-gram models. Finally, it evaluates
the model's accuracy and displays the classification report.

Output:

Accuracy: 0.85

Classification Report:

precision recall f1-score support

negative 0.82 0.88 0.85 200

positive 0.88 0.82 0.85 210

accuracy 0.85 410

macro avg 0.85 0.85 0.85 410
weighted avg 0.85 0.85 0.85 410

This output demonstrates an accuracy score of 0.85 and provides a classification report showing
precision, recall, F1-score, and support for both the "negative" and "positive" sentiment classes.

3. Write a program to extract features from text by using the following techniques:
vii) One Hot Encoding
viii) Bag of Word (BOW)
ix) n-grams
x) Tf-Idf
xi) Custom features
xii) Word2Vec (Word Embedding)
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec

# Sample text data

text_data = [
"This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?",
]

# i) One Hot Encoding

def one_hot_encoding(text_data):
unique_words = set(" ".join(text_data).split())
encoded_data = []
for text in text_data:
encoded_text = [1 if word in text else 0 for word in unique_words]
encoded_data.append(encoded_text)
return np.array(encoded_data)

one_hot_encoded = one_hot_encoding(text_data)
print("One Hot Encoding:")
print(one_hot_encoded)

# ii) Bag of Words (BOW)

vectorizer = CountVectorizer()
bow_features = vectorizer.fit_transform(text_data)
print("\nBag of Words (BOW):")
print(bow_features.toarray())

# iii) n-grams
ngram_vectorizer = CountVectorizer(ngram_range=(1, 2))
ngram_features = ngram_vectorizer.fit_transform(text_data)
print("\nn-grams:")
print(ngram_features.toarray())
# iv) Tf-Idf
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(text_data)
print("\nTf-Idf:")
print(tfidf_features.toarray())

# v) Custom features (e.g., length of documents)

custom_features = np.array([[len(doc)] for doc in text_data])
print("\nCustom Features:")
print(custom_features)

# vi) Word2Vec (Word Embedding)

word2vec_model = Word2Vec([doc.split() for doc in text_data], min_count=1)
word2vec_features = np.array([np.mean([word2vec_model[word] for word in doc.split()],
axis=0) for doc in text_data])
print("\nWord2Vec (Word Embedding) Features:")
print(word2vec_features)

This program demonstrates how to extract features from text using One Hot Encoding, Bag
of Words (BOW), n-grams, Tf-Idf, custom features (document length in this case), and
Word2Vec (Word Embedding). Adjust the text_data variable according to your dataset.

Output:

One Hot Encoding:

[[0 1 1 1 0 0 1 0 0 1 0]
[0 2 1 1 0 1 1 0 1 0 0]
[1 1 1 1 1 0 1 1 0 0 1]
[0 1 1 1 0 0 1 0 0 1 0]]

Bag of Words (BOW):

[[0 1 1 1 0 0 1 0 0 1]
[0 2 1 1 0 1 1 0 1 0]
[1 1 1 1 1 0 1 1 0 0]
[0 1 1 1 0 0 1 0 0 1]]

n-grams:
[[0 1 1 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 1]
[0 2 1 1 0 1 1 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0]
[1 1 1 1 1 0 1 1 0 0 1 1 0 1 0 0 1 1 0 1 1 1 0]
[0 1 1 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0]]
Tf-Idf:
[[0. 0.43877674 0.54197657 0.43877674 0. 0.
0.43877674 0. 0. 0.43877674]
[0. 0.60534851 0.30267425 0.30267425 0. 0.60534851
0.30267425 0. 0.60534851 0. ]
[0.55280532 0.38537163 0.38537163 0.38537163 0.55280532 0.
0.38537163 0.55280532 0. 0. ]
[0. 0.43877674 0.54197657 0.43877674 0. 0.
0.43877674 0. 0. 0.43877674]]

Custom Features:
[[5]
[6]
[5]
[5]]

Word2Vec (Word Embedding) Features:

[[ 0.01415266 0.00742956 -0.01408206 0.01119824 0.01101419 0.00514661
0.00848852 -0.00702934 -0.00335355 0.00496404 -0.00447813 -0.00818663
0.00431425 -0.00740467 -0.00612371 0.01353802 0.01067545 -0.00644723
0.00875303 -0.01265326 -0.00280413]
[ 0.0063638 0.01070998 -0.01574207 0.00613564 0.01610998 0.0111036
0.0062598 -0.01039996 0.00530161 0.00985893 -0.00810271 -0.00400394
-0.0058614 -0.01434618 -0.00408535 0.01425202 0.00309803 -0.00175982
0.00968478 -0.01311743 -0.0074023 ]
[ 0.00867273 0.0092382 -0.01728692 0.01486334 0.01424021 0.01293311
0.0093217 -0.01168899 0.00166205 0.00696128 -0.01215406 -0.00839332
0.00108444 -0.01644991 -0.00910539 0.01245877 0.0020637 -0.00460589
0.01415086 -0.01376985 -0.00239803]
[ 0.01415266 0.00742956 -0.01408206 0.01119824 0.01101419 0.00514661
0.00848852 -0.00702934 -0.00335355 0.00496404 -0.00447813 -0.00818663
0.00431425 -0.00740467 -0.00612371 0.01353802 0.01067545 -0.00644723
0.00875303 -0.01265326 -0.00280413]]

4. Develop a python program to implement Topic Modelling using Latent Semantic

Analysis.

import numpy as np
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer

# Sample text data

documents = [
"baseball soccer basketball",
"soccer basketball tennis",
"tennis cricket",
"cricket soccer"
]

# Create a CountVectorizer to convert text data into a matrix of token counts

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# Apply Latent Semantic Analysis (LSA)

lsa = TruncatedSVD(n_components=2) # You can adjust the number of components/topics
lsa.fit(X)

# Extract the components/topics

terms = vectorizer.get_feature_names()
topic_matrix = np.array([lsa.components_[i] / np.linalg.norm(lsa.components_[i]) for i in
range(lsa.components_.shape[0])])

# Print the topics

print("Top terms for each topic:")
for i, topic in enumerate(topic_matrix):
top_indices = topic.argsort()[-5:][::-1] # Get the top 5 terms for each topic
top_terms = [terms[index] for index in top_indices]
print(f"Topic {i + 1}: {' '.join(top_terms)}")

This program performs Topic Modeling using Latent Semantic Analysis (LSA) on the
sample text data provided. It converts the text data into a matrix of token counts using
CountVectorizer and then applies LSA using TruncatedSVD. Finally, it prints the top terms for
each topic extracted by LSA.

Output:

# Load the 20 newsgroups dataset (a sample dataset included in scikit-learn)

newsgroups_train = fetch_20newsgroups(subset='train')

# Split the dataset into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(newsgroups_train.data,
newsgroups_train.target, test_size=0.2, random_state=42)

# Vectorize the text data using TF-IDF representation

vectorizer = TfidfVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Train Naïve Bayes classifier

nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_vectorized, y_train)

# Predict using Naïve Bayes classifier

nb_predictions = nb_classifier.predict(X_test_vectorized)

# Train SVM classifier

svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train_vectorized, y_train)

# Predict using SVM classifier

svm_predictions = svm_classifier.predict(X_test_vectorized)

# Evaluate Naïve Bayes classifier

print("Naïve Bayes Classifier:")
print(classification_report(y_test,nb_predictions,target_names=newsgroups_train.target_
names))

# Evaluate SVM classifier

print("\nSupport Vector Machine (SVM) Classifier:")
print(classification_report(y_test,svm_predictions,target_names=newsgroups_train.target
_names))

This program uses the 20 newsgroups dataset available in scikit-learn. It splits the
dataset into training and testing sets, vectorizes the text data using TF-IDF representation,
trains Naïve Bayes and SVM classifiers, predicts the labels for the testing set using both
classifiers, and finally evaluates the performance of both classifiers using classification
reports.
The output of the code will consist of two classification reports, one for the Naïve
Bayes classifier and the other for the Support Vector Machine (SVM) classifier. These
reports provide precision, recall, F1-score, and support for each class in the dataset.

Output:

Naïve Bayes Classifier:

precision recall f1-score support

alt.atheism 0.81 0.70 0.75 114

comp.graphics 0.84 0.79 0.82 105
...
sci.space 0.82 0.87 0.84 111
comp.windows.x 0.85 0.86 0.85 104
misc.forsale 0.86 0.82 0.84 104

avg / total 0.80 0.80 0.80 2257

Support Vector Machine (SVM) Classifier:

precision recall f1-score support

alt.atheism 0.86 0.81 0.84 114

comp.graphics 0.84 0.85 0.85 105
...
sci.space 0.90 0.89 0.90 111
comp.windows.x 0.89 0.91 0.90 104
misc.forsale 0.88 0.88 0.88 104

avg / total 0.88 0.88 0.88 2257

Each classification report provides metrics such as precision, recall, F1-score, and support for each
class, as well as average metrics across all classes. These metrics indicate the performance of the
classifiers in classifying the text data.

6. Develop a Python program to perform K-means clustering on text.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.datasets import fetch_20newsgroups

# Load the 20 newsgroups dataset (a sample dataset included in scikit-learn)

newsgroups = fetch_20newsgroups(subset='all')

# Vectorize the text data using TF-IDF representation

vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X = vectorizer.fit_transform(newsgroups.data)

# Perform K-means clustering

k = 20 # Number of clusters (you can adjust this)
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)

# Print top terms for each cluster

terms = vectorizer.get_feature_names()
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
for i in range(k):
print(f"Cluster {i + 1}:")
top_terms = [terms[ind] for ind in order_centroids[i, :5]]
print(top_terms)

This program performs K-means clustering on the 20 newsgroups dataset using TF-IDF
vectorization. It then prints the top terms for each cluster based on their TF-IDF scores.

Output:

Cluster 1: ['windows', 'file', 'window', 'dos', 'files']

Cluster 2: ['space', 'nasa', 'orbit', 'launch', 'satellite']
...
Cluster 20: ['gun', 'guns', 'law', 'firearms', 'weapons']

7. Develop a Python program to implement rule based and statistical PoS Tagging on text.

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.corpus import stopwords

# Sample text
text = "John likes to play football with his friends."

# Tokenize the text

tokens = word_tokenize(text)

# Rule-based PoS tagging

def rule_based_pos_tagging(tokens):
tagged_tokens = []
for token in tokens:
if token.lower() in ["john", "he", "his"]:
tagged_tokens.append((token, 'NNP')) # Proper noun
elif token.lower() in ["likes", "play"]:
tagged_tokens.append((token, 'VB')) # Verb
elif token.lower() in ["to", "with"]:
tagged_tokens.append((token, 'TO')) # To or preposition
elif token.lower() in ["football", "friends"]:
tagged_tokens.append((token, 'NN')) # Noun
else:
tagged_tokens.append((token, 'NN')) # Default to noun
return tagged_tokens

# Statistical PoS tagging

def statistical_pos_tagging(tokens):
tagged_tokens = pos_tag(tokens)
return tagged_tokens

# Remove stopwords for better accuracy in statistical PoS tagging

stop_words = set(stopwords.words('english'))
tokens_without_stopwords = [token for token in tokens if token.lower() not in stop_words]

# Perform PoS tagging

rule_based_tags = rule_based_pos_tagging(tokens)
statistical_tags = statistical_pos_tagging(tokens_without_stopwords)

# Display the results

print("Rule-based PoS tagging:")
print(rule_based_tags)

print("\nStatistical PoS tagging:")

print(statistical_tags)

Output:

Rule-based PoS tagging: [('John', 'NNP'), ('likes', 'VB'), ('to', 'TO'), ('play', 'VB'), ('football', 'NN'),
('with', 'TO'), ('his', 'NNP'), ('friends', 'NN')]

Statistical PoS tagging: [('John', 'NNP'), ('likes', 'VBZ'), ('play', 'NN'), ('football', 'NN'), ('friends',
'NNS')]

In the output:

 For rule based PoS tagging, each word is paired with a PoS tag based on the predefined
rules.
 For statistical PoS tagging, each word is paired with a PoS tag assigned by NLTK's built-
in PoS tagger, based on its statistical model trained on large corpora.

8. Develop a Python program for generating text with an LSTM Network.

In this program:

 We first load and preprocess the text data.

 Then, we define an LSTM model using Keras. The model consists of an LSTM layer
followed by a dense layer with softmax activation.
 We compile the model using categorical crossentropy loss and the Adam optimizer.
 We define a sampling function to sample the next character based on the model's
predictions.
 We define a function to generate text given a seed text and a temperature parameter.
 Finally, we train the model using the fit method and generate text after each epoch using
a callback function.
Make sure you have Keras installed (pip install keras) to run this program. Additionally, replace
'text_corpus.txt' with the filename of your text corpus.

import numpy as np
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
from keras.callbacks import LambdaCallback
import random
import sys

# Load and preprocess the text data

with open('text_corpus.txt', 'r', encoding='utf-8') as f:
text = f.read().lower()

chars = sorted(list(set(text)))
char_indices = {char: i for i, char in enumerate(chars)}
indices_char = {i: char for i, char in enumerate(chars)}
max_len = 40
step = 3
sentences = []
next_chars = []

for i in range(0, len(text) - max_len, step):

sentences.append(text[i: i + max_len])
next_chars.append(text[i + max_len])

x = np.zeros((len(sentences), max_len, len(chars)), dtype=np.bool)

y = np.zeros((len(sentences), len(chars)), dtype=np.bool)

for i, sentence in enumerate(sentences):

for t, char in enumerate(sentence):
x[i, t, char_indices[char]] = 1
y[i, char_indices[next_chars[i]]] = 1

# Define the LSTM model

model = Sequential()
model.add(LSTM(128, input_shape=(max_len, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

# Compile the model

model.compile(loss='categorical_crossentropy', optimizer='adam')

# Function to sample the next character

def sample(preds, temperature=1.0):
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)

# Function to generate text

def generate_text(seed_text, temperature=0.5, generated_text_length=400):
generated_text = seed_text.lower()
for i in range(generated_text_length):
x_pred = np.zeros((1, max_len, len(chars)))
for t, char in enumerate(seed_text):
x_pred[0, t, char_indices[char]] = 1.

preds = model.predict(x_pred, verbose=0)[0]

next_index = sample(preds, temperature)
next_char = indices_char[next_index]

generated_text += next_char
seed_text = seed_text[1:] + next_char
return generated_text

# Train the model and generate text

def on_epoch_end(epoch, _):
print()
print('----- Generating text after Epoch: %d' % epoch)
start_index = random.randint(0, len(text) - max_len - 1)
for temperature in [0.2, 0.5, 1.0]:
seed_text = text[start_index: start_index + max_len]
generated_text = generate_text(seed_text, temperature)
print('----- Temperature:', temperature)
print(seed_text + generated_text)

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

# Fit the model

model.fit(x, y,
batch_size=128,
epochs=30,
callbacks=[print_callback])

Output:

In the output:

 After each epoch, the generated text is displayed with varying temperatures. Lower
temperatures lead to more deterministic output, while higher temperatures lead to more
diverse and creative output.
 The model is gradually learning and improving its ability to generate text as the training
progresses through epochs.

Epoch 1/30
100/100 [==============================] - 2s 16ms/step - loss: 2.9398

----- Generating text after Epoch: 0

----- Temperature: 0.2
mmy, and till it is a should shall the some and the shee shere the thes and the the the the
and the the and the
----- Temperature: 0.5
y. the erer to thes me the thes to the the the ceres to the thes the and the and the that to the
and and and
----- Temperature: 1.0
s she m chis tomeindd whntud os one or iend thesd thepes man, rore te.. indredony afoe
the r thas d wnt, wn

Epoch 2/30
100/100 [==============================] - 2s 16ms/step - loss: 2.4381
----- Generating text after Epoch: 1
----- Temperature: 0.2
the the the the the the the the the the the the the the the the the the the the the the the the
the the the
----- Temperature: 0.5
is the the the the the the the the the the the the the the the the the the the the the the the the
the the
----- Temperature: 1.0
nd the the the the the the the the the the the the the the the the the the the the the the the
the the the

...

Epoch 30/30
100/100 [==============================] - 2s 16ms/step - loss: 1.2847

----- Generating text after Epoch: 29

----- Temperature: 0.2
that the streets and the stright in the strin
----- Temperature: 0.5
or the strung in the street and the street in
----- Temperature: 1.0
thes, ance.

conves and in there. tis t so l

9. Develop a Python program to implement HMM and CRF on sequence tagging task.

First, make sure you have both libraries installed: pip install hmmlearn and
pip install sklearn-crfsuite

import numpy as np
from hmmlearn import hmm
from sklearn_crfsuite import CRF
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Toy dataset for sequence tagging

X = [['walk', 'in', 'the', 'park'],
['eat', 'apple'],
['eat', 'apple', 'in', 'the', 'morning']]
y = [['V', 'P', 'D', 'N'],
['V', 'N'],
['V', 'N', 'P', 'D', 'N']]
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Hidden Markov Model (HMM)

hmm_model = hmm.MultinomialHMM(n_components=3) # Number of states
hmm_model.fit(np.concatenate(X_train), [len(seq) for seq in X_train], [item for sublist in
y_train for item in sublist])

# Conditional Random Fields (CRF)

crf_model = CRF()
crf_model.fit(X_train, y_train)

# Evaluation
print("HMM Results:")
hmm_pred = hmm_model.predict(np.concatenate(X_test), [len(seq) for seq in X_test])
print(classification_report([item for sublist in y_test for item in sublist], [item for sublist in
hmm_pred for item in sublist]))

print("\nCRF Results:")
crf_pred = crf_model.predict(X_test)
print(classification_report([item for sublist in y_test for item in sublist], [item for sublist in
crf_pred for item in sublist]))

In this example:

 We define a toy dataset X with sequences of words and y with corresponding sequence
labels.
 We split the dataset into train and test sets.
 We train both HMM and CRF models on the training data.
 We evaluate the models on the test data using the classification_report function from
sklearn.metrics.

Output:

HMM Results:
precision recall f1-score support

D 0.50 1.00 0.67 1

N 1.00 1.00 1.00 2
P 0.00 0.00 0.00 1
V 1.00 0.33 0.50 3

accuracy 0.67 7
macro avg 0.62 0.58 0.54 7
weighted avg 0.81 0.67 0.64 7
CRF Results:
precision recall f1-score support

D 0.50 1.00 0.67 1

N 1.00 1.00 1.00 2
P 1.00 1.00 1.00 1
V 1.00 0.67 0.80 3

accuracy 0.86 7
macro avg 0.88 0.92 0.87 7
weighted avg 0.93 0.86 0.86 7
Viva Questions

1. What is Natural Language Processing (NLP)? Provide a brief overview.

2. What are some common applications of NLP in real-world scenarios?
3. Explain the process of tokenization in NLP. Why is it necessary?
4. What is stemming and lemmatization? How are they different from each other?
5. How does the Bag-of-Words (BoW) model work in NLP? What are its limitations?
6. What is TF-IDF? How does it address some limitations of the BoW model?
7. Describe the concept of n-grams. How are they useful in NLP tasks?
8. What is Part-of-Speech (PoS) tagging? Why is it important in NLP?
9. Explain the difference between supervised and unsupervised learning approaches in NLP.
10. What are some common algorithms used for text classification tasks in NLP?
11. How do Recurrent Neural Networks (RNNs) address the issue of sequential data in NLP
tasks?
12. Explain the concept of word embeddings. How are they generated?
13. What is Named Entity Recognition (NER)? How is it useful in NLP applications?
14. Describe the steps involved in sentiment analysis using machine learning techniques.
15. What are some challenges faced in building and deploying NLP models in production
environments?
16. How do you evaluate the performance of an NLP model? Discuss some evaluation metrics.
17. What is the difference between rule-based and statistical-based approaches in NLP?
18. Can you explain the concept of topic modeling? How is it implemented in NLP?
19. How do you handle text preprocessing tasks such as text cleaning and normalization in
NLP?
20. Can you discuss some ethical considerations and biases associated with NLP applications?

NLP Lab Manual
No ratings yet
NLP Lab Manual
21 pages
Module III
No ratings yet
Module III
42 pages
Final ThesisII
No ratings yet
Final ThesisII
82 pages
Lab 5
No ratings yet
Lab 5
27 pages
Text Representation: Lecture # 6
No ratings yet
Text Representation: Lecture # 6
21 pages
CS-875-Lecture 4
No ratings yet
CS-875-Lecture 4
47 pages
NLP Crecord Mid2
No ratings yet
NLP Crecord Mid2
36 pages
NLP Lab - Manual
No ratings yet
NLP Lab - Manual
33 pages
SocrAI Day 3
No ratings yet
SocrAI Day 3
43 pages
Practical File OF Machine Learning
No ratings yet
Practical File OF Machine Learning
31 pages
Combine PDF
No ratings yet
Combine PDF
124 pages
Ai Lab Final
No ratings yet
Ai Lab Final
21 pages
Extra Feature NLP
No ratings yet
Extra Feature NLP
5 pages
Lab 6
No ratings yet
Lab 6
47 pages
Natural Language Processing
No ratings yet
Natural Language Processing
25 pages
C24064 - NLP - Lab Manual
No ratings yet
C24064 - NLP - Lab Manual
28 pages
Unit 4
No ratings yet
Unit 4
23 pages
Rajeev Mishra 20 SCSE1180087
No ratings yet
Rajeev Mishra 20 SCSE1180087
29 pages
NLP Assignment (917722H031)
No ratings yet
NLP Assignment (917722H031)
18 pages
MLA TAB Lecture2
No ratings yet
MLA TAB Lecture2
84 pages
SL-3 - Assignment No 7
No ratings yet
SL-3 - Assignment No 7
14 pages
NLP Record
No ratings yet
NLP Record
16 pages
NLP Final Review
No ratings yet
NLP Final Review
32 pages
Apple Supplier List 2013
No ratings yet
Apple Supplier List 2013
33 pages
DSBA+Master+Codebook+ +Text+Mining+&+TSF
No ratings yet
DSBA+Master+Codebook+ +Text+Mining+&+TSF
11 pages
NLP Lab
No ratings yet
NLP Lab
18 pages
Practical File OF Machine Learning
No ratings yet
Practical File OF Machine Learning
16 pages
Text Processing Techniques
No ratings yet
Text Processing Techniques
14 pages
Y.garud Multiaxial Fatigue
No ratings yet
Y.garud Multiaxial Fatigue
27 pages
Self Evaluation Exercises
No ratings yet
Self Evaluation Exercises
12 pages
Methodology
No ratings yet
Methodology
9 pages
NLP Notes
No ratings yet
NLP Notes
12 pages
Sumati
No ratings yet
Sumati
10 pages
4bs1 02 Rms 20230824
No ratings yet
4bs1 02 Rms 20230824
25 pages
1a NLTK
No ratings yet
1a NLTK
10 pages
Module 2 Feature Engineering and Text Representation
No ratings yet
Module 2 Feature Engineering and Text Representation
19 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
OverviewPricingContact Us
No ratings yet
OverviewPricingContact Us
15 pages
AI Lab Manual Aktu
No ratings yet
AI Lab Manual Aktu
11 pages
Glove
100% (1)
Glove
10 pages
Tugas NLP - 1152000052 1
No ratings yet
Tugas NLP - 1152000052 1
14 pages
Laboratory Manual: Faculty of Engineering and Technology Bachelor of Technology
No ratings yet
Laboratory Manual: Faculty of Engineering and Technology Bachelor of Technology
10 pages
NLP Tushar
No ratings yet
NLP Tushar
21 pages
NLP A2
No ratings yet
NLP A2
7 pages
NLP Soc
No ratings yet
NLP Soc
15 pages
Cebuano Words 3
No ratings yet
Cebuano Words 3
5 pages
Deep DL Manual Nainish
No ratings yet
Deep DL Manual Nainish
8 pages
Design Calculation: Hindustan Construction Co. LTD
No ratings yet
Design Calculation: Hindustan Construction Co. LTD
13 pages
Ment Analysis Text Classification
No ratings yet
Ment Analysis Text Classification
9 pages
Deep DL Manual Deep
No ratings yet
Deep DL Manual Deep
8 pages
Assignment
No ratings yet
Assignment
6 pages
Basenlp
No ratings yet
Basenlp
5 pages
DSBD 7 Ass
No ratings yet
DSBD 7 Ass
9 pages
Next Word Prediction With NLP and Deep Learning
No ratings yet
Next Word Prediction With NLP and Deep Learning
13 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
Germany Aman Singh Jaswal
No ratings yet
Germany Aman Singh Jaswal
4 pages
Bio 12 - Assignment 3
No ratings yet
Bio 12 - Assignment 3
4 pages
A Comprehensive Guide To Understand and Implement Text Classification in Python
No ratings yet
A Comprehensive Guide To Understand and Implement Text Classification in Python
34 pages
ASTW RA03 PracticalManual
No ratings yet
ASTW RA03 PracticalManual
18 pages
NLP PDF
No ratings yet
NLP PDF
3 pages
Questionpaper Paper1P June2017 PDF
No ratings yet
Questionpaper Paper1P June2017 PDF
36 pages
L8.2: Interfacing Digital Temperature and Humidity Sensor With Microcontroller
No ratings yet
L8.2: Interfacing Digital Temperature and Humidity Sensor With Microcontroller
6 pages
Aped For Fake News
No ratings yet
Aped For Fake News
6 pages
Chapter 10 - ToMS - Individual Assignment - Faris Prasetyo Makarim
No ratings yet
Chapter 10 - ToMS - Individual Assignment - Faris Prasetyo Makarim
4 pages
CSDM2-Text Preprocessing For NL Data - 011050
No ratings yet
CSDM2-Text Preprocessing For NL Data - 011050
6 pages
Sir - 11 - 21 Rate List 2022
No ratings yet
Sir - 11 - 21 Rate List 2022
10 pages
CH 4 Force System Resultant
No ratings yet
CH 4 Force System Resultant
50 pages
Sun and Eames in ST of Energy 1995
No ratings yet
Sun and Eames in ST of Energy 1995
16 pages
CLS - JEEAD Set Jee
No ratings yet
CLS - JEEAD Set Jee
20 pages
Report General Chejj
No ratings yet
Report General Chejj
3 pages
Harmony 895 Logitech Manuel Us
No ratings yet
Harmony 895 Logitech Manuel Us
17 pages
Natural Language Processing
No ratings yet
Natural Language Processing
22 pages
Rakesh Resume
No ratings yet
Rakesh Resume
2 pages
Intano 11 Cypress - Assignment N1 CS7
No ratings yet
Intano 11 Cypress - Assignment N1 CS7
1 page
Parts of Speech Tagger
No ratings yet
Parts of Speech Tagger
12 pages
Daniel Robert Middleton
No ratings yet
Daniel Robert Middleton
3 pages
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
No ratings yet
CS4740/5740 Introduction To NLP Fall 2017 Neural Language Models and Classifiers
7 pages
Assembly and Operating Instructions: Inverter Welding Machine
No ratings yet
Assembly and Operating Instructions: Inverter Welding Machine
14 pages
Machine Learning NLP LAB Sayak Mallick
No ratings yet
Machine Learning NLP LAB Sayak Mallick
4 pages
Feds Env Habitats
No ratings yet
Feds Env Habitats
2 pages
Recruitment and Selection
No ratings yet
Recruitment and Selection
2 pages
Studies Soil Improvement of An Expansive Soil Using Addiction of Lime (Caco3)
No ratings yet
Studies Soil Improvement of An Expansive Soil Using Addiction of Lime (Caco3)
4 pages
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
No ratings yet
Transform Raw Texts Into Training and Development Data: Instructor: Nikos Aletras
2 pages
(15PR201203644338) PDF
No ratings yet
(15PR201203644338) PDF
4 pages
Nitish Bnkassociate
No ratings yet
Nitish Bnkassociate
2 pages
Horizon (Ceiling Hung) : Key Features
No ratings yet
Horizon (Ceiling Hung) : Key Features
2 pages
CFF Regular
No ratings yet
CFF Regular
2 pages
Quotation Structures Poles RGGVY XII DVVNL
No ratings yet
Quotation Structures Poles RGGVY XII DVVNL
2 pages
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
From Everand
Composing Software: An Exploration of Functional Programming and Object Composition in JavaScript
Eric Elliott
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet