0% found this document useful (0 votes)
131 views21 pages

NLP Lab Manual

Uploaded by

Aditya Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
131 views21 pages

NLP Lab Manual

Uploaded by

Aditya Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

Oriental Institute of Science &

Technology, Bhopal

DEPARTMENT
OF
CSE-AIML

LAB MANUAL (AL 506 (B))

“NATURAL LANGUAGE PROCESSING”

BACHELOR OF Technology (B.tech) COURSE


SEMESTER – V
Experiment 1: Tokenization

Objective:

To understand and implement tokenization, which splits a sentence or document into words
or subwords.

Introduction:

Tokenization is the process of breaking text into individual units, such as words, subwords, or
sentences. It is a fundamental preprocessing step in NLP tasks like text classification,
sentiment analysis, and machine translation.

Tools:

● Python
● NLTK Library or SpaCy

Steps:

1. Install necessary libraries:

pip install nltk spacy

2. Write the Python program:

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt') # Required for tokenization

text = "Natural Language Processing is exciting. We learn how


machines understand text."

# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentence Tokenization:", sentences)

# Word Tokenization
words = word_tokenize(text)
print("Word Tokenization:", words)

3. Run the program and observe the outputs.

Sample Output:

Sentence Tokenization: ['Natural Language Processing is exciting.', 'We


learn how machines understand text.']
Word Tokenization: ['Natural', 'Language', 'Processing', 'is', 'exciting',
'.', 'We', 'learn', 'how', 'machines', 'understand', 'text', '.']
Experiment 2: Stopword Removal

Objective:

To implement stopword removal to eliminate unnecessary words from text data.

Introduction:

Stopwords are commonly used words (e.g., "is," "the," "a") that carry little semantic meaning
and are often removed in NLP preprocessing.

Tools:

● Python
● NLTK Library

Steps:

1. Install necessary libraries:

pip install nltk

2. Write the Python program:

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

nltk.download('stopwords')
nltk.download('punkt')

text = "This is an example of stopword removal in natural language


processing."
words = word_tokenize(text)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in
stop_words]

print("Original Words:", words)


print("Filtered Words (Stopwords Removed):", filtered_words)

3. Run the program and analyze the output.

Sample Output:

Original Words: ['This', 'is', 'an', 'example', 'of', 'stopword',


'removal', 'in', 'natural', 'language', 'processing', '.']
Filtered Words (Stopwords Removed): ['example', 'stopword', 'removal',
'natural', 'language', 'processing', '.']
Experiment 3: Lemmatization and Stemming

Objective:

To apply lemmatization and stemming to normalize text into its root form.

Introduction:

● Stemming: Reduces words to their root form by chopping off prefixes or suffixes
(e.g., running → run).
● Lemmatization: Produces the root word using dictionary-based methods (e.g., better
→ good).

Steps:

1. Install libraries:

pip install nltk

2. Write the Python program:

from nltk.stem import PorterStemmer, WordNetLemmatizer


from nltk.tokenize import word_tokenize
import nltk

nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')

text = "The striped bats are flying and better-running in the dark."
words = word_tokenize(text)

# Stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in words]
print("Stemmed Words:", stemmed_words)

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word, pos="v") for word in
words]
print("Lemmatized Words:", lemmatized_words)

3. Run the program and compare stemming vs lemmatization outputs.

Sample Output:

Stemmed Words: ['the', 'stripe', 'bat', 'are', 'fly', 'and', 'better-run',


'in', 'the', 'dark', '.']

Lemmatized Words: ['The', 'striped', 'bat', 'be', 'fly', 'and', 'better-


run', 'in', 'the', 'dark', '.']
Experiment 4: POS Tagging

Objective:

To perform Part-of-Speech (POS) tagging to classify words into their grammatical categories.

Introduction:

POS tagging assigns labels like noun, verb, adjective, etc., to each word in a sentence. It is
essential for understanding sentence structure.

Tools:

● Python
● NLTK Library

Steps:

1. Install required libraries:

pip install nltk

2. Write the Python program:

import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag

nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

text = "John plays football and enjoys coding in Python."


words = word_tokenize(text)

# POS Tagging
pos_tags = pos_tag(words)
print("POS Tags:", pos_tags)

3. Run the program and observe POS tags.

Sample Output:

less
Copy code
POS Tags: [('John', 'NNP'), ('plays', 'VBZ'), ('football', 'NN'), ('and',
'CC'), ('enjoys', 'VBZ'), ('coding', 'VBG'), ('in', 'IN'), ('Python',
'NNP'), ('.', '.')]
Experiment 5: Named Entity Recognition (NER)

Objective:

To extract named entities like names, organizations, and locations from a text.

Tools:

● Python
● SpaCy

Steps:

1. Install SpaCy and download the language model:

pip install spacy


python -m spacy download en_core_web_sm

2. Write the Python program:

import spacy

nlp = spacy.load("en_core_web_sm")
text = "Elon Musk founded SpaceX in California and launched Starlink
satellites."

doc = nlp(text)
print("Named Entities:")
for entity in doc.ents:
print(entity.text, "-", entity.label_)

3. Run the program and analyze the output.

Sample Output:

yaml
Copy code
Named Entities:
Elon Musk - PERSON
SpaceX - ORG
California - GPE
Starlink - ORG
Experiment 6: Bag of Words (BoW) Model

Objective:

To implement the Bag of Words model to convert text data into numerical vectors.

Introduction:

The Bag of Words model represents text data as a collection of word frequencies, ignoring
grammar and word order. It is widely used for text classification and information retrieval.

Tools:

● Python
● Scikit-learn

Steps:

1. Install the required library:

pip install scikit-learn

2. Write the Python program:

from sklearn.feature_extraction.text import CountVectorizer

# Input text data


documents = [
"Natural Language Processing is fun.",
"Machine Learning and NLP are closely related fields.",
"NLP is a subfield of AI."
]

# Convert text data to a Bag of Words representation


vectorizer = CountVectorizer()
X = vectorizer.fit_transform(documents)

# Display the results


print("Vocabulary:", vectorizer.vocabulary_)
print("BoW Representation:\n", X.toarray())

3. Run the program and analyze the output.

Sample Output:
lua
Copy code
Vocabulary: {'natural': 6, 'language': 4, 'processing': 7, 'is': 3, 'fun':
2, 'machine': 5, 'learning': 5, 'and': 0, 'nlp': 8, 'are': 1, 'closely': 9,
'related': 10, 'fields': 11, 'subfield': 12, 'ai': 13}
BoW Representation:
[[1 0 1 1 1 0 1 1 0 0 0 0 0 0]
[0 1 0 1 0 1 0 0 1 1 1 1 0 0]
[0 0 0 1 0 0 0 0 1 0 0 0 1 1]]
Experiment 7: TF-IDF (Term Frequency-Inverse Document Frequency)

Objective:

To implement the TF-IDF model for text representation.

Introduction:

TF-IDF is an advanced form of text vectorization that reflects the importance of words in a
corpus by considering both their frequency and rarity.

Tools:

● Python
● Scikit-learn

Steps:

1. Install the required library:

pip install scikit-learn

2. Write the Python program:

from sklearn.feature_extraction.text import TfidfVectorizer

# Input text data


documents = [
"NLP is a part of AI.",
"AI and NLP are growing fields.",
"The application of NLP is vast."
]

# Apply TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)

# Display the results


print("Vocabulary:", vectorizer.vocabulary_)
print("TF-IDF Representation:\n", X.toarray())

3. Run the program and analyze the TF-IDF values.

Sample Output:
Vocabulary: {'nlp': 4, 'is': 2, 'part': 5, 'of': 3, 'ai': 0, 'and': 1,
'are': 6, 'growing': 7, 'fields': 8, 'the': 9, 'application': 10, 'vast':
11}
TF-IDF Representation:
[[0.5 0. 0.5 0.5 0.5 0.5 0. 0. 0. 0. 0. 0. ]
[0. 0.5 0. 0. 0.5 0. 0.5 0.5 0.5 0. 0. 0. ]
[0. 0. 0.4 0.4 0.4 0. 0. 0. 0. 0.4 0.4 0.4]]
Experiment 8: Word Embeddings (Word2Vec)

Objective:

To implement Word2Vec to create word embeddings for a given text corpus.

Introduction:

Word embeddings map words into dense vector representations based on their semantic
meanings. Word2Vec uses neural networks to capture contextual information.

Tools:

● Python
● Gensim

Steps:

1. Install the Gensim library:

pip install gensim

2. Write the Python program:

from gensim.models import Word2Vec


from nltk.tokenize import word_tokenize

# Sample text data


sentences = [
"NLP is fun and exciting.",
"AI and machine learning are revolutionizing industries.",
"Text data requires preprocessing before analysis."
]

# Tokenize the sentences


tokenized_sentences = [word_tokenize(sentence.lower()) for sentence
in sentences]

# Train Word2Vec model


model = Word2Vec(tokenized_sentences, vector_size=50, window=3,
min_count=1)

# Test the word embeddings


print("Vector for 'nlp':\n", model.wv['nlp'])
print("Most similar words to 'ai':\n", model.wv.most_similar('ai'))

3. Run the program and analyze the word embeddings.

Sample Output:
Vector for 'nlp':
[ 0.036273 -0.018734 0.093470 ... (50 dimensions)]
Most similar words to 'ai':
[('machine', 0.76), ('industries', 0.65), ('learning', 0.60)]
Experiment 9: Text Classification using Naïve Bayes

Objective:

To classify text into categories using the Naïve Bayes algorithm.

Tools:

● Python
● Scikit-learn

Steps:

1. Install required libraries:

pip install scikit-learn

2. Write the Python program:

from sklearn.feature_extraction.text import CountVectorizer


from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB

# Sample data
data = [
"I love programming in Python.",
"Python is an amazing language.",
"I enjoy solving AI problems.",
"I dislike debugging errors.",
"Debugging is the worst part of programming."
]
labels = [1, 1, 1, 0, 0] # 1: Positive, 0: Negative

# Convert text to BoW representation


vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data)

# Split data into training and test sets


X_train, X_test, y_train, y_test = train_test_split(X, labels,
test_size=0.3, random_state=42)

# Train Naïve Bayes model


model = MultinomialNB()
model.fit(X_train, y_train)

# Test the model


predictions = model.predict(X_test)
print("Predicted Labels:", predictions)
Sample Output:
Predicted Labels: [1 0]
Experiment 10: Named Entity Recognition (NER) using SpaCy

Objective:

To perform Named Entity Recognition to identify entities like people, organizations, dates,
and locations.

Introduction:

NER identifies named entities in text and classifies them into predefined categories (e.g.,
PERSON, ORG, DATE, GPE). It is widely used in information extraction systems.

Tools:

● Python
● SpaCy Library

Steps:

1. Install SpaCy and load a pre-trained model:

pip install spacy


python -m spacy download en_core_web_sm

2. Write the Python program:

import spacy

# Load pre-trained SpaCy model


nlp = spacy.load("en_core_web_sm")

# Input text
text = "Elon Musk founded SpaceX in 2002, and its headquarters are in
California."

# Process the text


doc = nlp(text)

# Print named entities and their labels


print("Named Entities, Phrases, and Concepts:")
for ent in doc.ents:
print(f"{ent.text} - {ent.label_}")

3. Run the program and observe the named entities extracted.

Sample Output:
Named Entities, Phrases, and Concepts:
Elon Musk - PERSON
SpaceX - ORG
2002 - DATE
California - GPE
Basic Viva Questions:
1. What is Natural Language Processing (NLP)?

● Answer: NLP is a subfield of AI that enables machines to understand, interpret, and


generate human language.

2. What are the major components of NLP?

● Answer: NLP has two components:


o Natural Language Understanding (NLU): Deals with understanding and
interpreting language.
o Natural Language Generation (NLG): Deals with generating human-like
text.

3. What is Tokenization in NLP?

● Answer: Tokenization is the process of splitting a text into smaller units called
tokens, such as words, phrases, or sub-words.

4. What is a Corpus?

● Answer: A corpus is a large collection of text data used for training or evaluating
NLP models.

5. What is Stop-Word Removal?

● Answer: It is the process of removing commonly used words (e.g., "the," "a," "is")
that do not contribute to the meaning of a sentence.

6. What is Lemmatization and Stemming?

● Answer:
o Lemmatization: Reduces words to their dictionary base form (e.g., "running"
→ "run").
o Stemming: Reduces words to their root form by chopping off prefixes or
suffixes (e.g., "running" → "run").

7. What is the difference between NLP and NLU?

● Answer: NLP encompasses both understanding and generating language, while NLU
focuses specifically on understanding human language.

8. What are N-Grams?

● Answer: N-Grams are contiguous sequences of n words from a given text.


o Example:
▪ Unigram: "NLP"
▪ Bigram: "Natural Language"
▪ Trigram: "Natural Language Processing."

9. What is Bag of Words (BoW)?

● Answer: BoW is a representation of text where the order of words is ignored, and
only the word frequencies are considered.

Intermediate Viva Questions:


10. What is TF-IDF?

● Answer: TF-IDF (Term Frequency-Inverse Document Frequency) measures word


importance in a document relative to the entire corpus.

11. What is Named Entity Recognition (NER)?

● Answer: NER identifies and classifies named entities (e.g., names, dates, locations) in
text into predefined categories.

12. What are Word Embeddings?

● Answer: Word embeddings are dense vector representations of words that capture
their semantic relationships (e.g., Word2Vec, GloVe).

13. Explain Word2Vec.

● Answer: Word2Vec is a neural network-based word embedding model that uses the
CBOW (Continuous Bag of Words) and Skip-gram methods to map words to vector
space.

14. What is the difference between Bag of Words and TF-IDF?

● Answer:
o Bag of Words: Considers word frequency only.
o TF-IDF: Weighs word frequency by its importance across multiple
documents.

15. What is Sentiment Analysis?

● Answer: Sentiment analysis determines the polarity of text, such as whether it is


positive, negative, or neutral.

16. What is POS Tagging?


● Answer: Part-of-Speech tagging assigns parts of speech (e.g., noun, verb, adjective)
to each word in a sentence.

17. What are Stop Words, and why are they removed?

● Answer: Stop words are common words that carry little meaning (e.g., "and," "the").
They are removed to focus on important words.

18. What is Latent Dirichlet Allocation (LDA)?

● Answer: LDA is an unsupervised machine learning algorithm used for topic


modeling to discover hidden topics in a document collection.

19. Explain Cosine Similarity in NLP.

● Answer: Cosine similarity measures the similarity between two vectors by calculating
the cosine of the angle between them. It is commonly used for text similarity tasks.

Advanced Viva Questions:


20. What are Transformers in NLP?

● Answer: Transformers are deep learning models that use self-attention mechanisms to
process sequential data, enabling state-of-the-art results in NLP tasks. (e.g., BERT,
GPT).

21. What is the role of Attention Mechanism in Transformers?

● Answer: Attention mechanisms allow models to focus on relevant parts of the input
sequence when generating an output sequence.

22. What is BERT?

● Answer: BERT (Bidirectional Encoder Representations from Transformers) is a pre-


trained transformer-based language model that processes text bidirectionally to
understand context.

23. What is GPT?

● Answer: GPT (Generative Pretrained Transformer) is a transformer-based model for


text generation that processes input data unidirectionally.

24. What is Fine-Tuning in NLP?

● Answer: Fine-tuning involves training a pre-trained model on a specific task with a


small amount of task-specific data.
25. What is BLEU Score?

● Answer: BLEU (Bilingual Evaluation Understudy) is a metric used to evaluate the


quality of machine-translated text compared to a reference translation.

26. What are Sequence-to-Sequence Models?

● Answer: Seq2Seq models are used for tasks like machine translation, where input
sequences (e.g., sentences) are mapped to output sequences.

27. What is Tokenization in BERT?

● Answer: BERT uses WordPiece Tokenization, which breaks words into subwords
to handle out-of-vocabulary words.

28. What is the difference between Extractive and Abstractive


Summarization?

● Answer:
o Extractive Summarization: Selects sentences directly from the text.
o Abstractive Summarization: Generates new sentences to convey the essence
of the text.

29. What is Perplexity in Language Models?

● Answer: Perplexity is a measure of how well a language model predicts a sample.


Lower perplexity indicates better performance.

30. Explain the concept of Self-Attention.

● Answer: Self-attention allows models to weigh the importance of each word in a


sequence relative to others, enabling them to capture context effectively.

31. What is Zero-Shot and Few-Shot Learning in NLP?

● Answer:
o Zero-Shot Learning: The model performs a task without having seen
examples during training.
o Few-Shot Learning: The model learns to perform a task with only a few
labeled examples.

32. What is the difference between RNNs, LSTMs, and GRUs?

● Answer:
o RNN: Basic sequential model with vanishing gradient issues.
o LSTM (Long Short-Term Memory): Overcomes vanishing gradients using
gates.
o GRU (Gated Recurrent Unit): Simplified LSTM with fewer parameters.
33. What is Transfer Learning in NLP?

● Answer: Transfer learning involves applying a pre-trained model (like BERT or


GPT) to new tasks with minimal task-specific training.

34. What is a Language Model?

● Answer: A language model predicts the probability of a sequence of words.


Examples include GPT and BERT.

35. What are common challenges in NLP?

● Answer:
o Ambiguity in language
o Handling synonyms and polysemy
o Data sparsity and vocabulary limitations
o Multilingual processing
o Context understanding.

Basic Descriptive Questions


1. Explain the role of Natural Language Processing (NLP) in Artificial Intelligence.
o Discuss the relationship between NLP and AI, highlighting its significance in
enabling human-computer interaction through natural language.
2. What are the primary steps involved in the NLP pipeline? Explain each step with
examples.
o Include steps like Tokenization, Stop-Word Removal, Stemming,
Lemmatization, POS Tagging, Named Entity Recognition, and Parsing.
3. What is the difference between Stemming and Lemmatization? Provide
examples.
o Define both terms and compare them with examples to show how words like
"running" and "flies" are handled.
4. What are Stop Words in NLP? Why are they removed?
o Define stop words and explain why they are removed in text processing with
examples.
5. What is Tokenization? Describe its importance in text preprocessing.
o Explain the concept and importance of breaking text into tokens (words,
phrases, or sentences).
6. Explain the concept of Part-of-Speech (POS) tagging with an example.
o Discuss how words are assigned parts of speech (noun, verb, adjective) based
on their roles in a sentence.
7. What are N-Grams? Explain their usage in NLP with examples.
o Define n-grams (unigram, bigram, trigram) and provide examples to
demonstrate their applications.
8. What is a Bag of Words (BoW) model? What are its advantages and limitations?
o Explain how BoW represents text and highlight its drawbacks, such as losing
word order and semantics.
Intermediate Descriptive Questions
9. Explain Term Frequency-Inverse Document Frequency (TF-IDF). How is it
calculated?
o Provide the formula for TF and IDF and explain how TF-IDF works to
determine word importance in documents.
10. What is Named Entity Recognition (NER)? How is it useful in real-world
applications?
o Define NER and give examples of named entities like names, dates, and
locations. Discuss applications like information extraction and chatbots.
11. What are Word Embeddings? Compare Word2Vec, GloVe, and FastText.
o Define word embeddings and compare the three techniques in terms of their
training methods and outcomes.
12. Explain the architecture of the Word2Vec model. What is the difference between
CBOW and Skip-gram?
o Describe the Continuous Bag of Words (CBOW) and Skip-gram approaches
with examples.
13. What is Latent Dirichlet Allocation (LDA)? Explain its working and
applications.
o Describe how LDA works for topic modeling and provide examples of its real-
world use cases.
14. How does Sentiment Analysis work? Explain lexicon-based and machine
learning-based approaches.
o Compare lexicon-based methods (like Vader) with machine learning models
(like Naive Bayes or deep learning).
15. Explain the concept of Cosine Similarity and its role in text analysis.
o Provide the formula and describe how it measures the similarity between two
text vectors.
16. What is Parsing in NLP? Differentiate between Dependency Parsing and
Constituency Parsing.
o Define parsing and explain the differences with examples.
17. What is Topic Modeling? How is it different from text classification?
o Define topic modeling (unsupervised) and contrast it with supervised text
classification.
18. What is an n-gram language model? Explain its limitations.
o Define n-gram models and describe problems like data sparsity and inability to
capture long-term dependencies.

Advanced Descriptive Questions


19. Explain the architecture of the Transformer model. How does it differ from
RNNs and LSTMs?
o Describe the components of a Transformer (Encoder, Decoder, Self-Attention)
and explain how it overcomes the limitations of sequential models.
20. What is the Attention Mechanism in Transformers? How does self-attention
work?
o Provide a detailed explanation of the self-attention mechanism and its role in
capturing relationships between words in a sequence.
21. Describe BERT (Bidirectional Encoder Representations from Transformers).
What makes it bidirectional?
o Explain the architecture, training process (Masked Language Model and Next
Sentence Prediction), and bidirectionality.
22. What is GPT? How does it differ from BERT?
o Compare GPT (Generative Pretrained Transformer) and BERT in terms of
directionality, use cases, and architecture.
23. What is Sequence-to-Sequence Modeling? Explain its application in Machine
Translation.
o Describe Seq2Seq models with an example of how they are used for
translation tasks.
24. What is BLEU Score? How is it used to evaluate machine translation systems?
o Define BLEU score and explain its role in evaluating translations by
comparing outputs with reference texts.
25. What are the challenges in Natural Language Understanding (NLU)? How can
they be addressed?
o Discuss challenges like ambiguity, polysemy, and context understanding.
Suggest approaches like deep learning and large language models.
26. What is Transfer Learning in NLP? Explain with an example.
o Define transfer learning and explain how pre-trained models like BERT or
GPT are fine-tuned for specific tasks.
27. Explain the concept of Fine-Tuning in NLP. Why is it important?
o Describe how fine-tuning adapts a pre-trained model to a specific NLP task
(e.g., sentiment analysis).
28. What is the role of pre-trained language models in NLP? Discuss their
advantages.
o Highlight the benefits of using pre-trained models like reduced training time,
better performance, and minimal data requirements.
29. What are the differences between Extractive and Abstractive Summarization?
Give examples.
o Define both techniques and explain their differences with clear examples.
30. Explain the concept of Perplexity in NLP. How is it used to evaluate language
models?
o Define perplexity as a measure of a model's ability to predict text and discuss
its significance.
31. How does machine translation work using Transformer models? Explain the
process.
o Describe how the Transformer encoder-decoder architecture enables language
translation tasks.
32. What are the limitations of traditional NLP techniques? How do deep learning
models address them?
o Highlight issues like handling ambiguity and long dependencies in traditional
models, and explain how neural networks (LSTMs, Transformers) improve
performance.
33. What is Few-Shot Learning and Zero-Shot Learning in NLP? Provide real-
world applications.
o Define both concepts and explain how they enable models to generalize with
minimal labeled data.
34. How does Hugging Face Transformers simplify NLP model implementation?
o Explain Hugging Face libraries like pipeline() and their usage for tasks like
sentiment analysis, NER, and machine translation.
35. Explain the concept of Self-Attention Score and its calculation.
o Define self-attention and explain how attention weights are calculated.

You might also like