Expt 6
Expt 6
EXPERIMENT NO :6
To implement a Named Entity Recognizer that identifies entities such as people, organizations,
locations, dates, etc., in a text using NLP techniques.
Theory:
Named Entity Recognition (NER) is an NLP task that involves locating and classifying named
entities (e.g., person names, locations, organizations, etc.) in a given text. NER is crucial for
information retrieval, question answering, and machine translation.
NER is commonly implemented using supervised learning models such as Conditional Random
Fields (CRF), Hidden Markov Models (HMM), or neural network-based models like LSTM,
BiLSTM, and Transformers.
Steps:
import spacy
# Input text
text = "Apple is looking at buying U.K. startup for $1 billion."
# Process text
doc = nlp(text)
Theory:
Semantic Role Labeling (SRL) is the process of determining the role of words or phrases in a sentence
concerning a predicate. SRL answers "Who did what to whom, when, where, and how?" It can be used to
identify named entities by understanding their roles in sentences.
Steps:
Program:
# Input text
sentence = "John sold his car to Mike last week."
# Predict SRL
results = predictor.predict(sentence=sentence)
# Display results
for verb in results['verbs']:
print(verb['description'])
Expected Output:
[ARG0: John] [V: sold] [ARG1: his car] [ARG2: to Mike] [ARGM-TMP: last week]
Conclusion:
Semantic Role Labeling helps in understanding the roles of different entities and provides a deeper
understanding of sentence structure.
Assessment Questions:
To build a text classification model using a Logistic Regression algorithm to classify text data
into different categories.
Theory:
Text classification is the process of assigning categories to text. Logistic Regression is a linear
model that is widely used for binary and multi-class classification problems. In NLP, Logistic
Regression can be used to classify documents into categories based on word frequencies or TF-
IDF scores.
Steps:
import pandas as pd
# Load dataset
# Preprocessing
df['message'] = df['message'].str.lower()
# Feature Extraction
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df['message'])
y = df['label']
# Train-test split
# Model Training
model = LogisticRegression()
model.fit(X_train, y_train)
# Evaluation
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
Expected Output:
precision recall f1-score support
ham 0.97 0.98 0.98 965
spam 0.91 0.88 0.89 150
accuracy 0.96 1115
macro avg 0.94 0.93 0.93 1115
weighted avg 0.96 0.96 0.96 1115
Conclusion:
Logistic Regression is effective for text classification tasks and provides good performance with
simple implementation.
Assessment Questions:
Objective:
To implement a sentiment classifier for movie reviews using Natural Language Processing
(NLP) techniques.
Theory:
Sentiment analysis is the task of determining whether a piece of text (e.g., a movie review)
expresses a positive, negative, or neutral sentiment. Commonly used machine learning models
for sentiment classification include Naive Bayes, Logistic Regression, and Support Vector
Machines, as well as deep learning models like LSTM.
Steps:
Program:
from sklearn.datasets import load_files
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
# Load dataset
reviews = load_files('txt_sentoken')
X, y = reviews.data, reviews.target
# Preprocessing
X = [doc.decode('utf-8').lower() for doc in X]
# Feature Extraction
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_vec = vectorizer.fit_transform(X)
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_vec, y, test_size=0.25,
random_state=42)
# Model Training
model = LogisticRegression()
model.fit(X_train, y_train)
# Evaluation
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
Expected Output:
precision recall f1-score support
0 0.89 0.86 0.88 250
1 0.87 0.89 0.88 250
accuracy 0.88 500
macro avg 0.88 0.88 0.88 500
weighted avg 0.88 0.88 0.88 500
Conclusion:
The sentiment classifier provides a reasonably accurate prediction for movie reviews
using Logistic Regression and TF-IDF features.
Assessment Questions:
1. What is sentiment analysis, and why is it important?
2. What techniques can be used for feature extraction in sentiment analysis?
3. Explain the use of Logistic Regression in sentiment classification.
GS Mandal’s
Maharashtra Institute of Technology, Aurangabad
Laboratory manual -Practical Experiment Instruction
sheet
Objective:
To implement a Recurrent Neural Network (RNN) for sequence labeling tasks such as Named Entity
Recognition (NER) or Part-of-Speech (POS) tagging.
Theory:
Sequence labeling involves assigning a categorical label to each element in a sequence. RNNs are particularly
suitable for this task as they can capture dependencies in sequential data. Variants like LSTM and GRU can
handle long-term dependencies more effectively.
Steps:
1. Import Libraries: Import tensorflow or pytorch.
2. Load Data: Use a dataset like the CoNLL 2003 NER dataset.
3. Preprocessing: Convert words to indices, handle padding, and convert labels to a suitable format.
4. Model Definition: Define an RNN model (e.g., LSTM).
5. Model Training: Compile the model and train it on the dataset.
6. Evaluation: Evaluate the model's performance on the test set.
Program:
import tensorflow as tf
from tensorflow.keras.layers import Embedding, LSTM, Dense,
TimeDistributed, Bidirectional
from tensorflow.keras.models import Sequential
# Model definition
model = Sequential([
Embedding(input_dim=10, output_dim=8, input_length=4),
Bidirectional(LSTM(units=64, return_sequences=True)),
TimeDistributed(Dense(2, activation='softmax'))
])
# Model compilation
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Model training
model.fit(sentences, labels, epochs=10)
# Model summary
model.summary()
Expected Output:
The expected output is the training logs with loss and accuracy metrics for each epoch.
Conclusion:
RNNs are effective for sequence labeling tasks, and their variants like LSTM and GRU can handle long
dependencies better than standard RNNs.
Assessment Questions:
1. What are the advantages of using RNNs for sequence labeling tasks?
2. What is the difference between RNN, LSTM, and GRU?
3. How can we prevent RNNs from vanishing gradient problems?
GS Mandal’s
Maharashtra Institute of Technology, Aurangabad
Laboratory manual -Practical Experiment Instruction
sheet
To implement Part-of-Speech (POS) tagging using an LSTM model to assign grammatical tags to each word in a
given text.
Theory:
POS tagging assigns a part of speech to each word in a sentence. LSTM networks are highly effective for
sequential data like text, where the context provided by preceding words is essential for determining the POS
tags.
Steps:
Program:
import tensorflow as tf
from tensorflow.keras.layers import Embedding, LSTM, Dense,
TimeDistributed
from tensorflow.keras.models import Sequential
# Model definition
model = Sequential([
Embedding(input_dim=10, output_dim=8, input_length=3),
LSTM(units=64, return_sequences=True),
TimeDistributed(Dense(3, activation='softmax'))
])
# Model compilation
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# Model training
model.fit(sentences, pos_tags, epochs=10)
# Model summary
model.summary()
Expected Output:
The expected output is the training logs showing loss and accuracy metrics for each epoch.
Conclusion:
LSTM-based models provide an efficient way to perform POS tagging by capturing dependencies in sequences.
Assessment Questions:
1. What are the benefits of using LSTM over simple RNNs for POS tagging?
2. Explain the architecture of an LSTM-based POS tagger.
3. What is the role of the TimeDistributed layer in Keras?
GS Mandal’s
Maharashtra Institute of Technology, Aurangabad
Laboratory manual -Practical Experiment Instruction
sheet
To implement a Word Sense Disambiguation (WSD) model using LSTM or GRU networks to determine the
correct sense of a word based on its context.
Theory:
Word Sense Disambiguation (WSD) is the process of identifying which sense of a word is used in a sentence
when the word has multiple meanings. LSTM and GRU models are effective for WSD tasks as they can capture
context within sequences.
Steps:
Program:
import tensorflow as tf
# Model definition
model = Sequential([
GRU(units=64),
Dense(2, activation='softmax')
])
# Model compilation
# Model training
# Model summary
model.summary()
Expected Output:
The expected output is the training logs with loss and accuracy metrics for each epoch.
Conclusion:
LSTM and GRU models can be effectively used for WSD tasks by leveraging their ability to capture long-term
dependencies in sequences.
Assessment Questions:
Sample Output:
Conclusion:
C A P T