0% found this document useful (0 votes)

6 views19 pages

Expt 6

NLP lab experiment of POS tagging using HMM

Uploaded by

Samruddhi Chillure

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views19 pages

Expt 6

NLP lab experiment of POS tagging using HMM

Uploaded by

Samruddhi Chillure

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 19

GS Mandal’s

Maharashtra Institute of Technology, Aurangabad

Laboratory manual -Practical Experiment Instruction
sheet

Department of Emerging Sciences and

Technology
Class:BATU Subject Code: Subject: Lab Natural Language Programming
AIDS BTAIL 707
Div: Roll No: Name of Student:

EXPERIMENT NO :6

Aim:Implement Named Entity Recognizer(NER).

Objective:

To implement a Named Entity Recognizer that identifies entities such as people, organizations,
locations, dates, etc., in a text using NLP techniques.

Theory:

Named Entity Recognition (NER) is an NLP task that involves locating and classifying named
entities (e.g., person names, locations, organizations, etc.) in a given text. NER is crucial for
information retrieval, question answering, and machine translation.

NER is commonly implemented using supervised learning models such as Conditional Random
Fields (CRF), Hidden Markov Models (HMM), or neural network-based models like LSTM,
BiLSTM, and Transformers.

Steps:

1. Import Libraries: Import libraries such as spaCy or transformers for NER.

2. Load Data: Prepare or load a dataset that contains sentences labeled with named entities.
3. Preprocessing: Tokenize the text, remove stopwords, and perform any necessary text
normalization.
4. Model Selection: Use a pre-trained model (like spaCy, transformers) or train a custom
NER model using CRF or LSTM.
5. Training: If using a custom model, split the data into training and test sets and train the
model.
6. Evaluation: Evaluate the model on the test set and compute metrics like F1-score,
precision, and recall.
7. Prediction: Use the trained model to predict named entities in new texts.
Program:

import spacy

# Load pre-trained NER model

nlp = spacy.load('en_core_web_sm')

# Input text
text = "Apple is looking at buying U.K. startup for $1 billion."

# Process text
doc = nlp(text)

# Extract named entities

for ent in doc.ents:
print(ent.text, ent.label_)
Expected Output:
Apple ORG
U.K. GPE
$1 billion MONEY
Conclusion:
NER models can be used to extract meaningful entities from text, which is useful in many NLP
applications.
Assessment Questions:

1. What is Named Entity Recognition (NER)?

2. Which models are commonly used for NER?
3. How is the performance of an NER model evaluated?
GS Mandal’s
Maharashtra Institute of Technology, Aurangabad
Laboratory manual -Practical Experiment Instruction
sheet

Department of Emerging Sciences and

Technology
Class:BATU Subject Code: Subject: Lab Natural Language Programming
AIDS BTAIL 707
Div: Roll No: Name of Student:
EXPERIMENT NO :7

Aim:Implement Semantic Role Labeling (SRL) to Identify Named Entities

Objective:

-To implement Semantic Role Labeling (SRL)

-To identify the predicate-argument structure of a sentence and

-To classify named entities based on their semantic roles.

Theory:

Semantic Role Labeling (SRL) is the process of determining the role of words or phrases in a sentence
concerning a predicate. SRL answers "Who did what to whom, when, where, and how?" It can be used to
identify named entities by understanding their roles in sentences.

Steps:

1. Import Libraries: Import necessary libraries like allennlp or transformers.

2. Load Pre-trained Model: Load a pre-trained SRL model.
3. Input Text: Define the input text on which SRL is to be performed.
4. Predict Semantic Roles: Use the model to predict semantic roles for each token in the text.
5. Extract Named Entities: Identify entities and their roles.

Program:

from allennlp.predictors.predictor import Predictor

import allennlp_models.tagging

# Load pre-trained SRL model

predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bert-base-srl-
2020.03.24.tar.gz")

# Input text
sentence = "John sold his car to Mike last week."
# Predict SRL
results = predictor.predict(sentence=sentence)

# Display results
for verb in results['verbs']:
print(verb['description'])

Expected Output:

[ARG0: John] [V: sold] [ARG1: his car] [ARG2: to Mike] [ARGM-TMP: last week]

Conclusion:

Semantic Role Labeling helps in understanding the roles of different entities and provides a deeper
understanding of sentence structure.

Assessment Questions:

1. What is Semantic Role Labeling?

2. How does SRL differ from NER?
3. What are common use cases of SRL?
GS Mandal’s
Maharashtra Institute of Technology, Aurangabad
Laboratory manual -Practical Experiment Instruction
sheet

Department of Emerging Sciences and

Technology
Class:BATU Subject Code: Subject: Lab Natural Language Programming
AIDS BTAIL 707
Div: Roll No: Name of Student:
EXPERIMENT NO :8

Aim:Implement Text Classifier Using Logistic Regression Model

Objective:

To build a text classification model using a Logistic Regression algorithm to classify text data
into different categories.

Theory:

Text classification is the process of assigning categories to text. Logistic Regression is a linear
model that is widely used for binary and multi-class classification problems. In NLP, Logistic
Regression can be used to classify documents into categories based on word frequencies or TF-
IDF scores.

Steps:

1. Import Libraries: Use libraries like sklearn and nltk.

2. Load Dataset: Load a text dataset (e.g., spam detection, sentiment analysis).
3. Text Preprocessing: Tokenization, stopword removal, stemming, and converting text to
lower case.
4. Feature Extraction: Convert text data to numerical form using CountVectorizer or TF-
IDF Vectorizer.
5. Train-Test Split: Split the dataset into training and testing sets.
6. Model Training: Train a Logistic Regression model using the training set.
7. Evaluation: Evaluate the model's performance on the test set using accuracy, precision,
recall, and F1-score.
Program:

import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LogisticRegression

from sklearn.metrics import classification_report

# Load dataset

df = pd.read_csv('spam.csv', encoding='latin-1')[['v1', 'v2']]

df.columns = ['label', 'message']

# Preprocessing

df['message'] = df['message'].str.lower()

# Feature Extraction

vectorizer = TfidfVectorizer(stop_words='english')

X = vectorizer.fit_transform(df['message'])

y = df['label']

# Train-test split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model Training

model = LogisticRegression()

model.fit(X_train, y_train)

# Evaluation

y_pred = model.predict(X_test)

print(classification_report(y_test, y_pred))

Expected Output:
precision recall f1-score support
ham 0.97 0.98 0.98 965
spam 0.91 0.88 0.89 150
accuracy 0.96 1115
macro avg 0.94 0.93 0.93 1115
weighted avg 0.96 0.96 0.96 1115

Conclusion:

Logistic Regression is effective for text classification tasks and provides good performance with
simple implementation.

Assessment Questions:

1. What is Logistic Regression, and how is it used for text classification?

2. Explain the difference between Count Vectorization and TF-IDF Vectorization.
3. What evaluation metrics are used to assess classification models?
GS Mandal’s
Maharashtra Institute of Technology, Aurangabad
Laboratory manual -Practical Experiment Instruction
sheet

Department of Emerging Sciences and

Technology
Class:BATU Subject Code: Subject: Lab Natural Language Programming
AIDS BTAIL 707
Div: Roll No: Name of Student:
EXPERIMENT NO :8
Aim:Implement a Movie Reviews Sentiment Classifier

Objective:

To implement a sentiment classifier for movie reviews using Natural Language Processing
(NLP) techniques.

Theory:

Sentiment analysis is the task of determining whether a piece of text (e.g., a movie review)
expresses a positive, negative, or neutral sentiment. Commonly used machine learning models
for sentiment classification include Naive Bayes, Logistic Regression, and Support Vector
Machines, as well as deep learning models like LSTM.

Steps:

1. Import Libraries: Use libraries such as nltk, sklearn, and keras.

2. Load Dataset: Load a dataset like the IMDb movie review dataset.
3. Text Preprocessing: Tokenize text, remove stopwords, and apply
stemming/lemmatization.
4. Feature Extraction: Convert text into numerical format using TF-IDF Vectorizer or word
embeddings.
5. Model Training: Train a classification model (e.g., Logistic Regression, Naive Bayes).
6. Evaluation: Evaluate the model performance using accuracy, precision, recall, and F1-
score.

Program:
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

# Load dataset
reviews = load_files('txt_sentoken')
X, y = reviews.data, reviews.target

# Preprocessing
X = [doc.decode('utf-8').lower() for doc in X]

# Feature Extraction
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_vec = vectorizer.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.25,
random_state=42)

# Model Training
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluation
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
Expected Output:
precision recall f1-score support
0 0.89 0.86 0.88 250
1 0.87 0.89 0.88 250
accuracy 0.88 500
macro avg 0.88 0.88 0.88 500
weighted avg 0.88 0.88 0.88 500
Conclusion:
The sentiment classifier provides a reasonably accurate prediction for movie reviews
using Logistic Regression and TF-IDF features.

Assessment Questions:
1. What is sentiment analysis, and why is it important?
2. What techniques can be used for feature extraction in sentiment analysis?
3. Explain the use of Logistic Regression in sentiment classification.
GS Mandal’s
Maharashtra Institute of Technology, Aurangabad
Laboratory manual -Practical Experiment Instruction
sheet

Department of Emerging Sciences and

Technology
Class:BATU Subject Code: Subject: Lab Natural Language Programming
AIDS BTAIL 707
Div: Roll No: Name of Student:
EXPERIMENT NO :10
Aim:Implement RNN for Sequence Labeling

Objective:

To implement a Recurrent Neural Network (RNN) for sequence labeling tasks such as Named Entity
Recognition (NER) or Part-of-Speech (POS) tagging.

Theory:

Sequence labeling involves assigning a categorical label to each element in a sequence. RNNs are particularly
suitable for this task as they can capture dependencies in sequential data. Variants like LSTM and GRU can
handle long-term dependencies more effectively.

Steps:
1. Import Libraries: Import tensorflow or pytorch.
2. Load Data: Use a dataset like the CoNLL 2003 NER dataset.
3. Preprocessing: Convert words to indices, handle padding, and convert labels to a suitable format.
4. Model Definition: Define an RNN model (e.g., LSTM).
5. Model Training: Compile the model and train it on the dataset.
6. Evaluation: Evaluate the model's performance on the test set.

Program:
import tensorflow as tf
from tensorflow.keras.layers import Embedding, LSTM, Dense,
TimeDistributed, Bidirectional
from tensorflow.keras.models import Sequential

# Dummy data preparation

sentences = [[1, 2, 3, 4], [4, 5, 6, 7]]
labels = [[0, 0, 1, 0], [0, 1, 0, 0]]

# Model definition
model = Sequential([
Embedding(input_dim=10, output_dim=8, input_length=4),
Bidirectional(LSTM(units=64, return_sequences=True)),
TimeDistributed(Dense(2, activation='softmax'))
])

# Model compilation
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Model training
model.fit(sentences, labels, epochs=10)

# Model summary
model.summary()
Expected Output:

The expected output is the training logs with loss and accuracy metrics for each epoch.

Conclusion:

RNNs are effective for sequence labeling tasks, and their variants like LSTM and GRU can handle long
dependencies better than standard RNNs.

Assessment Questions:

1. What are the advantages of using RNNs for sequence labeling tasks?
2. What is the difference between RNN, LSTM, and GRU?
3. How can we prevent RNNs from vanishing gradient problems?
GS Mandal’s
Maharashtra Institute of Technology, Aurangabad
Laboratory manual -Practical Experiment Instruction
sheet

Department of Emerging Sciences and

Technology
Class:BATU Subject Code: Subject: Lab Natural Language Programming
AIDS BTAIL 707
Div: Roll No: Name of Student:

11. Implement POS Tagging Using LSTM

Objective:

To implement Part-of-Speech (POS) tagging using an LSTM model to assign grammatical tags to each word in a
given text.

Theory:

POS tagging assigns a part of speech to each word in a sentence. LSTM networks are highly effective for
sequential data like text, where the context provided by preceding words is essential for determining the POS
tags.

Steps:

1. Import Libraries: Use tensorflow or pytorch.

2. Load Data: Use datasets like the Penn Treebank or any tagged corpus.
3. Preprocessing: Tokenize the text, convert tokens to indices, and perform padding.
4. Model Definition: Define an LSTM-based model for POS tagging.
5. Model Training: Train the model with categorical cross-entropy loss.
6. Evaluation: Evaluate the model's accuracy on a test set.

Program:
import tensorflow as tf
from tensorflow.keras.layers import Embedding, LSTM, Dense,
TimeDistributed
from tensorflow.keras.models import Sequential

# Dummy data preparation

sentences = [[1, 2, 3], [4, 5, 6]]
pos_tags = [[1, 0, 2], [0, 2, 1]]

# Model definition
model = Sequential([
Embedding(input_dim=10, output_dim=8, input_length=3),
LSTM(units=64, return_sequences=True),
TimeDistributed(Dense(3, activation='softmax'))
])

# Model compilation
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Model training
model.fit(sentences, pos_tags, epochs=10)

# Model summary
model.summary()
Expected Output:

The expected output is the training logs showing loss and accuracy metrics for each epoch.

Conclusion:

LSTM-based models provide an efficient way to perform POS tagging by capturing dependencies in sequences.

Assessment Questions:

1. What are the benefits of using LSTM over simple RNNs for POS tagging?
2. Explain the architecture of an LSTM-based POS tagger.
3. What is the role of the TimeDistributed layer in Keras?
GS Mandal’s
Maharashtra Institute of Technology, Aurangabad
Laboratory manual -Practical Experiment Instruction
sheet

Department of Emerging Sciences and

Technology
Class:BATU Subject Code: Subject: Lab Natural Language Programming
AIDS BTAIL 707
Div: Roll No: Name of Student:

12. Word Sense Disambiguation by

LSTM/GRU
Objective:

To implement a Word Sense Disambiguation (WSD) model using LSTM or GRU networks to determine the
correct sense of a word based on its context.

Theory:

Word Sense Disambiguation (WSD) is the process of identifying which sense of a word is used in a sentence
when the word has multiple meanings. LSTM and GRU models are effective for WSD tasks as they can capture
context within sequences.

Steps:

1. Import Libraries: Use tensorflow or pytorch.

2. Load Data: Use datasets like SemCor, Senseval, or WordNet.
3. Preprocessing: Convert words and senses to indices, and handle padding.
4. Model Definition: Define an LSTM or GRU model for WSD.
5. Model Training: Train the model using cross-entropy loss.
6. Evaluation: Evaluate the model's performance using accuracy.

Program:
import tensorflow as tf

from tensorflow.keras.layers import Embedding, GRU, Dense

from tensorflow.keras.models import Sequential

# Dummy data preparation

sentences = [[1, 2, 3], [4, 5, 6]]

senses = [[1], [0]]

# Model definition

model = Sequential([

Embedding(input_dim=10, output_dim=8, input_length=3),

GRU(units=64),

Dense(2, activation='softmax')

])

# Model compilation

model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

# Model training

model.fit(sentences, senses, epochs=10)

# Model summary

model.summary()
Expected Output:

The expected output is the training logs with loss and accuracy metrics for each epoch.

Conclusion:

LSTM and GRU models can be effectively used for WSD tasks by leveraging their ability to capture long-term
dependencies in sequences.

Assessment Questions:

1. What is Word Sense Disambiguation (WSD)?

2. How can LSTM and GRU be used for WSD?
3. What challenges exist in implementing WSD using neural networks?
EXPECTED OBSERVATION / CALCULATION / RESULT)

Sample Output:

Conclusion:

C A P T