0% found this document useful (0 votes)
57 views21 pages

NLP Manual

The document describes programs to extract various features from text data using techniques like one hot encoding, bag of words, n-grams, tf-idf, custom features, and word embeddings. Sample code is provided to demonstrate each technique.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views21 pages

NLP Manual

The document describes programs to extract various features from text data using techniques like one hot encoding, bag of words, n-grams, tf-idf, custom features, and word embeddings. Sample code is provided to demonstrate each technique.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Department of

Artificial Intelligence & machine Learning


Laboratory Manual

For

Natural language processing


Laboratory
(21AMl65)

(6TH Semester B.E.)


2023-24

NITTE MEENAKSHI Institute of Technology


Bangalore-64
DEPARTMENT OF ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING

Natural Language Processing laboratory-21AML65

1. Develop a python program to perform Basic Text Processing operations (Tokenization,


lemmatization, stemming) on any text document.
2. Develop a Python program that systematically implements an N-gram language model for
sentiment analysis on a financial dataset.
3. Write a program to extract features from text by using the following techniques:
i) One Hot Encoding
ii) Bag of Word (BOW)
iii) n-grams
iv) Tf-Idf
v) Custom features
vi) Word2Vec (Word Embedding)
4. Develop a python program to implement Topic Modelling using Latent Semantic Analysis.
5. Develop a Python program to classify text using Naïve Bayes and SVM.
6. Develop a Python program to perform K-means clustering on text.
7. Develop a Python program to implement rule based and statistical PoS Tagging on text.
8. Develop a Python program for generating text with an LSTM Network.
9. Develop a Python program to implement HMM and CRF on sequence tagging task.
1. Develop a python program to perform Basic Text Processing operations
(Tokenization, lemmatization, stemming) on any text document.

import nltk

from nltk.tokenize import word_tokenize

from nltk.stem import WordNetLemmatizer, PorterStemmer

from nltk.corpus import stopwords

import string

nltk.download('punkt')

nltk.download('wordnet')

nltk.download('stopwords')

def preprocess_text(text):

# Tokenization

tokens = word_tokenize(text.lower())

# Remove punctuation

tokens = [token for token in tokens if token not in string.punctuation]

# Remove stopwords

stop_words = set(stopwords.words('english'))

tokens = [token for token in tokens if token not in stop_words]

return tokens

def lemmatize(tokens):

lemmatizer = WordNetLemmatizer()

lemmas = [lemmatizer.lemmatize(token) for token in tokens]

return lemmas
def stem(tokens):

stemmer = PorterStemmer()

stems = [stemmer.stem(token) for token in tokens]

return stems

def main():

# Sample text

text = "Tokenization is the process of breaking down text into words and phrases. Stemming
and Lemmatization are techniques used to reduce words to their base form."

# Preprocess text

tokens = preprocess_text(text)

# Lemmatization

lemmas = lemmatize(tokens)

print("Lemmatization:")

print(lemmas)

# Stemming

stems = stem(tokens)

print("\nStemming:")

print(stems)

if __name__ == "__main__":

main()

This program tokenizes the input text, removes punctuation and stopwords, performs
lemmatization, and stemming using NLTK. Make sure you have NLTK installed (pip install nltk)
and download the required resources using nltk.download().
Output:

Lemmatization:

['tokenization', 'process', 'breaking', 'text', 'word', 'phrase', 'stemming', 'lemmatization', 'technique',


'used', 'reduce', 'word', 'base', 'form']

Stemming:

['token', 'process', 'break', 'text', 'word', 'phrase', 'stem', 'lemmat', 'techniqu', 'use', 'reduc', 'word',
'base', 'form']

2. Develop a Python program that systematically implements an N-gram language model


for sentiment analysis on a financial dataset.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Load the financial dataset


data = pd.read_csv("financial_dataset.csv") # Replace "financial_dataset.csv" with your dataset
filename

# Preprocess the data


# Assuming the dataset has two columns: "text" containing the text data and "sentiment" containing
sentiment labels
X = data['text']
y = data['sentiment']

# Split the data into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Vectorize the text data using N-gram model


vectorizer = CountVectorizer(ngram_range=(1, 2)) # You can adjust the n-gram range
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)
# Train the classifier
classifier = MultinomialNB()
classifier.fit(X_train_vectorized, y_train)

# Predict sentiment on the test set


y_pred = classifier.predict(X_test_vectorized)

# Evaluate the model


accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Display classification report


print("\nClassification Report:")
print(classification_report(y_test, y_pred))

This program uses the scikit-learn library for implementing the N-gram language model. It
utilizes the Multinomial Naive Bayes classifier for sentiment analysis. You can adjust the n-gram
range in the CountVectorizer to experiment with different N-gram models. Finally, it evaluates
the model's accuracy and displays the classification report.

Output:

Accuracy: 0.85

Classification Report:

precision recall f1-score support

negative 0.82 0.88 0.85 200


positive 0.88 0.82 0.85 210

accuracy 0.85 410


macro avg 0.85 0.85 0.85 410
weighted avg 0.85 0.85 0.85 410

This output demonstrates an accuracy score of 0.85 and provides a classification report showing
precision, recall, F1-score, and support for both the "negative" and "positive" sentiment classes.

3. Write a program to extract features from text by using the following techniques:
vii) One Hot Encoding
viii) Bag of Word (BOW)
ix) n-grams
x) Tf-Idf
xi) Custom features
xii) Word2Vec (Word Embedding)
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from gensim.models import Word2Vec

# Sample text data


text_data = [
"This is the first document.",
"This document is the second document.",
"And this is the third one.",
"Is this the first document?",
]

# i) One Hot Encoding


def one_hot_encoding(text_data):
unique_words = set(" ".join(text_data).split())
encoded_data = []
for text in text_data:
encoded_text = [1 if word in text else 0 for word in unique_words]
encoded_data.append(encoded_text)
return np.array(encoded_data)

one_hot_encoded = one_hot_encoding(text_data)
print("One Hot Encoding:")
print(one_hot_encoded)

# ii) Bag of Words (BOW)


vectorizer = CountVectorizer()
bow_features = vectorizer.fit_transform(text_data)
print("\nBag of Words (BOW):")
print(bow_features.toarray())

# iii) n-grams
ngram_vectorizer = CountVectorizer(ngram_range=(1, 2))
ngram_features = ngram_vectorizer.fit_transform(text_data)
print("\nn-grams:")
print(ngram_features.toarray())
# iv) Tf-Idf
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(text_data)
print("\nTf-Idf:")
print(tfidf_features.toarray())

# v) Custom features (e.g., length of documents)


custom_features = np.array([[len(doc)] for doc in text_data])
print("\nCustom Features:")
print(custom_features)

# vi) Word2Vec (Word Embedding)


word2vec_model = Word2Vec([doc.split() for doc in text_data], min_count=1)
word2vec_features = np.array([np.mean([word2vec_model[word] for word in doc.split()],
axis=0) for doc in text_data])
print("\nWord2Vec (Word Embedding) Features:")
print(word2vec_features)

This program demonstrates how to extract features from text using One Hot Encoding, Bag
of Words (BOW), n-grams, Tf-Idf, custom features (document length in this case), and
Word2Vec (Word Embedding). Adjust the text_data variable according to your dataset.

Output:

One Hot Encoding:


[[0 1 1 1 0 0 1 0 0 1 0]
[0 2 1 1 0 1 1 0 1 0 0]
[1 1 1 1 1 0 1 1 0 0 1]
[0 1 1 1 0 0 1 0 0 1 0]]

Bag of Words (BOW):


[[0 1 1 1 0 0 1 0 0 1]
[0 2 1 1 0 1 1 0 1 0]
[1 1 1 1 1 0 1 1 0 0]
[0 1 1 1 0 0 1 0 0 1]]

n-grams:
[[0 1 1 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 1]
[0 2 1 1 0 1 1 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0]
[1 1 1 1 1 0 1 1 0 0 1 1 0 1 0 0 1 1 0 1 1 1 0]
[0 1 1 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0]]
Tf-Idf:
[[0. 0.43877674 0.54197657 0.43877674 0. 0.
0.43877674 0. 0. 0.43877674]
[0. 0.60534851 0.30267425 0.30267425 0. 0.60534851
0.30267425 0. 0.60534851 0. ]
[0.55280532 0.38537163 0.38537163 0.38537163 0.55280532 0.
0.38537163 0.55280532 0. 0. ]
[0. 0.43877674 0.54197657 0.43877674 0. 0.
0.43877674 0. 0. 0.43877674]]

Custom Features:
[[5]
[6]
[5]
[5]]

Word2Vec (Word Embedding) Features:


[[ 0.01415266 0.00742956 -0.01408206 0.01119824 0.01101419 0.00514661
0.00848852 -0.00702934 -0.00335355 0.00496404 -0.00447813 -0.00818663
0.00431425 -0.00740467 -0.00612371 0.01353802 0.01067545 -0.00644723
0.00875303 -0.01265326 -0.00280413]
[ 0.0063638 0.01070998 -0.01574207 0.00613564 0.01610998 0.0111036
0.0062598 -0.01039996 0.00530161 0.00985893 -0.00810271 -0.00400394
-0.0058614 -0.01434618 -0.00408535 0.01425202 0.00309803 -0.00175982
0.00968478 -0.01311743 -0.0074023 ]
[ 0.00867273 0.0092382 -0.01728692 0.01486334 0.01424021 0.01293311
0.0093217 -0.01168899 0.00166205 0.00696128 -0.01215406 -0.00839332
0.00108444 -0.01644991 -0.00910539 0.01245877 0.0020637 -0.00460589
0.01415086 -0.01376985 -0.00239803]
[ 0.01415266 0.00742956 -0.01408206 0.01119824 0.01101419 0.00514661
0.00848852 -0.00702934 -0.00335355 0.00496404 -0.00447813 -0.00818663
0.00431425 -0.00740467 -0.00612371 0.01353802 0.01067545 -0.00644723
0.00875303 -0.01265326 -0.00280413]]

4. Develop a python program to implement Topic Modelling using Latent Semantic


Analysis.

import numpy as np
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer

# Sample text data


documents = [
"baseball soccer basketball",
"soccer basketball tennis",
"tennis cricket",
"cricket soccer"
]

# Create a CountVectorizer to convert text data into a matrix of token counts


vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# Apply Latent Semantic Analysis (LSA)


lsa = TruncatedSVD(n_components=2) # You can adjust the number of components/topics
lsa.fit(X)

# Extract the components/topics


terms = vectorizer.get_feature_names()
topic_matrix = np.array([lsa.components_[i] / np.linalg.norm(lsa.components_[i]) for i in
range(lsa.components_.shape[0])])

# Print the topics


print("Top terms for each topic:")
for i, topic in enumerate(topic_matrix):
top_indices = topic.argsort()[-5:][::-1] # Get the top 5 terms for each topic
top_terms = [terms[index] for index in top_indices]
print(f"Topic {i + 1}: {' '.join(top_terms)}")

This program performs Topic Modeling using Latent Semantic Analysis (LSA) on the
sample text data provided. It converts the text data into a matrix of token counts using
CountVectorizer and then applies LSA using TruncatedSVD. Finally, it prints the top terms for
each topic extracted by LSA.

Output:

Top terms for each topic:


Topic 1: soccer basketball cricket tennis baseball
Topic 2: tennis cricket soccer basketball

5. Develop a Python program to classify text using Naïve Bayes and SVM.
import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

# Load the 20 newsgroups dataset (a sample dataset included in scikit-learn)


newsgroups_train = fetch_20newsgroups(subset='train')

# Split the dataset into training and testing sets


X_train, X_test, y_train, y_test = train_test_split(newsgroups_train.data,
newsgroups_train.target, test_size=0.2, random_state=42)

# Vectorize the text data using TF-IDF representation


vectorizer = TfidfVectorizer()
X_train_vectorized = vectorizer.fit_transform(X_train)
X_test_vectorized = vectorizer.transform(X_test)

# Train Naïve Bayes classifier


nb_classifier = MultinomialNB()
nb_classifier.fit(X_train_vectorized, y_train)

# Predict using Naïve Bayes classifier


nb_predictions = nb_classifier.predict(X_test_vectorized)

# Train SVM classifier


svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train_vectorized, y_train)

# Predict using SVM classifier


svm_predictions = svm_classifier.predict(X_test_vectorized)

# Evaluate Naïve Bayes classifier


print("Naïve Bayes Classifier:")
print(classification_report(y_test,nb_predictions,target_names=newsgroups_train.target_
names))

# Evaluate SVM classifier


print("\nSupport Vector Machine (SVM) Classifier:")
print(classification_report(y_test,svm_predictions,target_names=newsgroups_train.target
_names))

This program uses the 20 newsgroups dataset available in scikit-learn. It splits the
dataset into training and testing sets, vectorizes the text data using TF-IDF representation,
trains Naïve Bayes and SVM classifiers, predicts the labels for the testing set using both
classifiers, and finally evaluates the performance of both classifiers using classification
reports.
The output of the code will consist of two classification reports, one for the Naïve
Bayes classifier and the other for the Support Vector Machine (SVM) classifier. These
reports provide precision, recall, F1-score, and support for each class in the dataset.

Output:

Naïve Bayes Classifier:


precision recall f1-score support

alt.atheism 0.81 0.70 0.75 114


comp.graphics 0.84 0.79 0.82 105
...
sci.space 0.82 0.87 0.84 111
comp.windows.x 0.85 0.86 0.85 104
misc.forsale 0.86 0.82 0.84 104

avg / total 0.80 0.80 0.80 2257

Support Vector Machine (SVM) Classifier:


precision recall f1-score support

alt.atheism 0.86 0.81 0.84 114


comp.graphics 0.84 0.85 0.85 105
...
sci.space 0.90 0.89 0.90 111
comp.windows.x 0.89 0.91 0.90 104
misc.forsale 0.88 0.88 0.88 104

avg / total 0.88 0.88 0.88 2257

Each classification report provides metrics such as precision, recall, F1-score, and support for each
class, as well as average metrics across all classes. These metrics indicate the performance of the
classifiers in classifying the text data.

6. Develop a Python program to perform K-means clustering on text.

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.datasets import fetch_20newsgroups

# Load the 20 newsgroups dataset (a sample dataset included in scikit-learn)


newsgroups = fetch_20newsgroups(subset='all')

# Vectorize the text data using TF-IDF representation


vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)
X = vectorizer.fit_transform(newsgroups.data)

# Perform K-means clustering


k = 20 # Number of clusters (you can adjust this)
kmeans = KMeans(n_clusters=k, random_state=42)
kmeans.fit(X)

# Print top terms for each cluster


terms = vectorizer.get_feature_names()
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
for i in range(k):
print(f"Cluster {i + 1}:")
top_terms = [terms[ind] for ind in order_centroids[i, :5]]
print(top_terms)

This program performs K-means clustering on the 20 newsgroups dataset using TF-IDF
vectorization. It then prints the top terms for each cluster based on their TF-IDF scores.

Output:

Cluster 1: ['windows', 'file', 'window', 'dos', 'files']


Cluster 2: ['space', 'nasa', 'orbit', 'launch', 'satellite']
...
Cluster 20: ['gun', 'guns', 'law', 'firearms', 'weapons']

7. Develop a Python program to implement rule based and statistical PoS Tagging on text.

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.corpus import stopwords

# Sample text
text = "John likes to play football with his friends."

# Tokenize the text


tokens = word_tokenize(text)

# Rule-based PoS tagging


def rule_based_pos_tagging(tokens):
tagged_tokens = []
for token in tokens:
if token.lower() in ["john", "he", "his"]:
tagged_tokens.append((token, 'NNP')) # Proper noun
elif token.lower() in ["likes", "play"]:
tagged_tokens.append((token, 'VB')) # Verb
elif token.lower() in ["to", "with"]:
tagged_tokens.append((token, 'TO')) # To or preposition
elif token.lower() in ["football", "friends"]:
tagged_tokens.append((token, 'NN')) # Noun
else:
tagged_tokens.append((token, 'NN')) # Default to noun
return tagged_tokens

# Statistical PoS tagging


def statistical_pos_tagging(tokens):
tagged_tokens = pos_tag(tokens)
return tagged_tokens

# Remove stopwords for better accuracy in statistical PoS tagging


stop_words = set(stopwords.words('english'))
tokens_without_stopwords = [token for token in tokens if token.lower() not in stop_words]

# Perform PoS tagging


rule_based_tags = rule_based_pos_tagging(tokens)
statistical_tags = statistical_pos_tagging(tokens_without_stopwords)

# Display the results


print("Rule-based PoS tagging:")
print(rule_based_tags)

print("\nStatistical PoS tagging:")


print(statistical_tags)

Output:

Rule-based PoS tagging: [('John', 'NNP'), ('likes', 'VB'), ('to', 'TO'), ('play', 'VB'), ('football', 'NN'),
('with', 'TO'), ('his', 'NNP'), ('friends', 'NN')]

Statistical PoS tagging: [('John', 'NNP'), ('likes', 'VBZ'), ('play', 'NN'), ('football', 'NN'), ('friends',
'NNS')]

In the output:

 For rule based PoS tagging, each word is paired with a PoS tag based on the predefined
rules.
 For statistical PoS tagging, each word is paired with a PoS tag assigned by NLTK's built-
in PoS tagger, based on its statistical model trained on large corpora.

8. Develop a Python program for generating text with an LSTM Network.


In this program:

 We first load and preprocess the text data.


 Then, we define an LSTM model using Keras. The model consists of an LSTM layer
followed by a dense layer with softmax activation.
 We compile the model using categorical crossentropy loss and the Adam optimizer.
 We define a sampling function to sample the next character based on the model's
predictions.
 We define a function to generate text given a seed text and a temperature parameter.
 Finally, we train the model using the fit method and generate text after each epoch using
a callback function.
Make sure you have Keras installed (pip install keras) to run this program. Additionally, replace
'text_corpus.txt' with the filename of your text corpus.

import numpy as np
from keras.models import Sequential
from keras.layers import LSTM, Dense, Dropout
from keras.callbacks import LambdaCallback
import random
import sys

# Load and preprocess the text data


with open('text_corpus.txt', 'r', encoding='utf-8') as f:
text = f.read().lower()

chars = sorted(list(set(text)))
char_indices = {char: i for i, char in enumerate(chars)}
indices_char = {i: char for i, char in enumerate(chars)}
max_len = 40
step = 3
sentences = []
next_chars = []

for i in range(0, len(text) - max_len, step):


sentences.append(text[i: i + max_len])
next_chars.append(text[i + max_len])

x = np.zeros((len(sentences), max_len, len(chars)), dtype=np.bool)


y = np.zeros((len(sentences), len(chars)), dtype=np.bool)

for i, sentence in enumerate(sentences):


for t, char in enumerate(sentence):
x[i, t, char_indices[char]] = 1
y[i, char_indices[next_chars[i]]] = 1

# Define the LSTM model


model = Sequential()
model.add(LSTM(128, input_shape=(max_len, len(chars))))
model.add(Dense(len(chars), activation='softmax'))

# Compile the model


model.compile(loss='categorical_crossentropy', optimizer='adam')

# Function to sample the next character


def sample(preds, temperature=1.0):
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)

# Function to generate text


def generate_text(seed_text, temperature=0.5, generated_text_length=400):
generated_text = seed_text.lower()
for i in range(generated_text_length):
x_pred = np.zeros((1, max_len, len(chars)))
for t, char in enumerate(seed_text):
x_pred[0, t, char_indices[char]] = 1.

preds = model.predict(x_pred, verbose=0)[0]


next_index = sample(preds, temperature)
next_char = indices_char[next_index]

generated_text += next_char
seed_text = seed_text[1:] + next_char
return generated_text

# Train the model and generate text


def on_epoch_end(epoch, _):
print()
print('----- Generating text after Epoch: %d' % epoch)
start_index = random.randint(0, len(text) - max_len - 1)
for temperature in [0.2, 0.5, 1.0]:
seed_text = text[start_index: start_index + max_len]
generated_text = generate_text(seed_text, temperature)
print('----- Temperature:', temperature)
print(seed_text + generated_text)

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

# Fit the model


model.fit(x, y,
batch_size=128,
epochs=30,
callbacks=[print_callback])

Output:

In the output:

 After each epoch, the generated text is displayed with varying temperatures. Lower
temperatures lead to more deterministic output, while higher temperatures lead to more
diverse and creative output.
 The model is gradually learning and improving its ability to generate text as the training
progresses through epochs.

Epoch 1/30
100/100 [==============================] - 2s 16ms/step - loss: 2.9398

----- Generating text after Epoch: 0


----- Temperature: 0.2
mmy, and till it is a should shall the some and the shee shere the thes and the the the the
and the the and the
----- Temperature: 0.5
y. the erer to thes me the thes to the the the ceres to the thes the and the and the that to the
and and and
----- Temperature: 1.0
s she m chis tomeindd whntud os one or iend thesd thepes man, rore te.. indredony afoe
the r thas d wnt, wn

Epoch 2/30
100/100 [==============================] - 2s 16ms/step - loss: 2.4381
----- Generating text after Epoch: 1
----- Temperature: 0.2
the the the the the the the the the the the the the the the the the the the the the the the the
the the the
----- Temperature: 0.5
is the the the the the the the the the the the the the the the the the the the the the the the the
the the
----- Temperature: 1.0
nd the the the the the the the the the the the the the the the the the the the the the the the
the the the

...

Epoch 30/30
100/100 [==============================] - 2s 16ms/step - loss: 1.2847

----- Generating text after Epoch: 29


----- Temperature: 0.2
that the streets and the stright in the strin
----- Temperature: 0.5
or the strung in the street and the street in
----- Temperature: 1.0
thes, ance.

conves and in there. tis t so l

9. Develop a Python program to implement HMM and CRF on sequence tagging task.

First, make sure you have both libraries installed: pip install hmmlearn and
pip install sklearn-crfsuite

import numpy as np
from hmmlearn import hmm
from sklearn_crfsuite import CRF
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Toy dataset for sequence tagging


X = [['walk', 'in', 'the', 'park'],
['eat', 'apple'],
['eat', 'apple', 'in', 'the', 'morning']]
y = [['V', 'P', 'D', 'N'],
['V', 'N'],
['V', 'N', 'P', 'D', 'N']]
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Hidden Markov Model (HMM)


hmm_model = hmm.MultinomialHMM(n_components=3) # Number of states
hmm_model.fit(np.concatenate(X_train), [len(seq) for seq in X_train], [item for sublist in
y_train for item in sublist])

# Conditional Random Fields (CRF)


crf_model = CRF()
crf_model.fit(X_train, y_train)

# Evaluation
print("HMM Results:")
hmm_pred = hmm_model.predict(np.concatenate(X_test), [len(seq) for seq in X_test])
print(classification_report([item for sublist in y_test for item in sublist], [item for sublist in
hmm_pred for item in sublist]))

print("\nCRF Results:")
crf_pred = crf_model.predict(X_test)
print(classification_report([item for sublist in y_test for item in sublist], [item for sublist in
crf_pred for item in sublist]))

In this example:

 We define a toy dataset X with sequences of words and y with corresponding sequence
labels.
 We split the dataset into train and test sets.
 We train both HMM and CRF models on the training data.
 We evaluate the models on the test data using the classification_report function from
sklearn.metrics.

Output:

HMM Results:
precision recall f1-score support

D 0.50 1.00 0.67 1


N 1.00 1.00 1.00 2
P 0.00 0.00 0.00 1
V 1.00 0.33 0.50 3

accuracy 0.67 7
macro avg 0.62 0.58 0.54 7
weighted avg 0.81 0.67 0.64 7
CRF Results:
precision recall f1-score support

D 0.50 1.00 0.67 1


N 1.00 1.00 1.00 2
P 1.00 1.00 1.00 1
V 1.00 0.67 0.80 3

accuracy 0.86 7
macro avg 0.88 0.92 0.87 7
weighted avg 0.93 0.86 0.86 7
Viva Questions

1. What is Natural Language Processing (NLP)? Provide a brief overview.


2. What are some common applications of NLP in real-world scenarios?
3. Explain the process of tokenization in NLP. Why is it necessary?
4. What is stemming and lemmatization? How are they different from each other?
5. How does the Bag-of-Words (BoW) model work in NLP? What are its limitations?
6. What is TF-IDF? How does it address some limitations of the BoW model?
7. Describe the concept of n-grams. How are they useful in NLP tasks?
8. What is Part-of-Speech (PoS) tagging? Why is it important in NLP?
9. Explain the difference between supervised and unsupervised learning approaches in NLP.
10. What are some common algorithms used for text classification tasks in NLP?
11. How do Recurrent Neural Networks (RNNs) address the issue of sequential data in NLP
tasks?
12. Explain the concept of word embeddings. How are they generated?
13. What is Named Entity Recognition (NER)? How is it useful in NLP applications?
14. Describe the steps involved in sentiment analysis using machine learning techniques.
15. What are some challenges faced in building and deploying NLP models in production
environments?
16. How do you evaluate the performance of an NLP model? Discuss some evaluation metrics.
17. What is the difference between rule-based and statistical-based approaches in NLP?
18. Can you explain the concept of topic modeling? How is it implemented in NLP?
19. How do you handle text preprocessing tasks such as text cleaning and normalization in
NLP?
20. Can you discuss some ethical considerations and biases associated with NLP applications?

You might also like