NLP Lab Manual
NLP Lab Manual
Technology, Bhopal
DEPARTMENT
OF
CSE-AIML
Objective:
To understand and implement tokenization, which splits a sentence or document into words
or subwords.
Introduction:
Tokenization is the process of breaking text into individual units, such as words, subwords, or
sentences. It is a fundamental preprocessing step in NLP tasks like text classification,
sentiment analysis, and machine translation.
Tools:
● Python
● NLTK Library or SpaCy
Steps:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
nltk.download('punkt') # Required for tokenization
# Sentence Tokenization
sentences = sent_tokenize(text)
print("Sentence Tokenization:", sentences)
# Word Tokenization
words = word_tokenize(text)
print("Word Tokenization:", words)
Sample Output:
Objective:
Introduction:
Stopwords are commonly used words (e.g., "is," "the," "a") that carry little semantic meaning
and are often removed in NLP preprocessing.
Tools:
● Python
● NLTK Library
Steps:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_words = [word for word in words if word.lower() not in
stop_words]
Sample Output:
Objective:
To apply lemmatization and stemming to normalize text into its root form.
Introduction:
● Stemming: Reduces words to their root form by chopping off prefixes or suffixes
(e.g., running → run).
● Lemmatization: Produces the root word using dictionary-based methods (e.g., better
→ good).
Steps:
1. Install libraries:
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
text = "The striped bats are flying and better-running in the dark."
words = word_tokenize(text)
# Stemming
stemmer = PorterStemmer()
stemmed_words = [stemmer.stem(word) for word in words]
print("Stemmed Words:", stemmed_words)
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_words = [lemmatizer.lemmatize(word, pos="v") for word in
words]
print("Lemmatized Words:", lemmatized_words)
Sample Output:
Objective:
To perform Part-of-Speech (POS) tagging to classify words into their grammatical categories.
Introduction:
POS tagging assigns labels like noun, verb, adjective, etc., to each word in a sentence. It is
essential for understanding sentence structure.
Tools:
● Python
● NLTK Library
Steps:
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
# POS Tagging
pos_tags = pos_tag(words)
print("POS Tags:", pos_tags)
Sample Output:
less
Copy code
POS Tags: [('John', 'NNP'), ('plays', 'VBZ'), ('football', 'NN'), ('and',
'CC'), ('enjoys', 'VBZ'), ('coding', 'VBG'), ('in', 'IN'), ('Python',
'NNP'), ('.', '.')]
Experiment 5: Named Entity Recognition (NER)
Objective:
To extract named entities like names, organizations, and locations from a text.
Tools:
● Python
● SpaCy
Steps:
import spacy
nlp = spacy.load("en_core_web_sm")
text = "Elon Musk founded SpaceX in California and launched Starlink
satellites."
doc = nlp(text)
print("Named Entities:")
for entity in doc.ents:
print(entity.text, "-", entity.label_)
Sample Output:
yaml
Copy code
Named Entities:
Elon Musk - PERSON
SpaceX - ORG
California - GPE
Starlink - ORG
Experiment 6: Bag of Words (BoW) Model
Objective:
To implement the Bag of Words model to convert text data into numerical vectors.
Introduction:
The Bag of Words model represents text data as a collection of word frequencies, ignoring
grammar and word order. It is widely used for text classification and information retrieval.
Tools:
● Python
● Scikit-learn
Steps:
Sample Output:
lua
Copy code
Vocabulary: {'natural': 6, 'language': 4, 'processing': 7, 'is': 3, 'fun':
2, 'machine': 5, 'learning': 5, 'and': 0, 'nlp': 8, 'are': 1, 'closely': 9,
'related': 10, 'fields': 11, 'subfield': 12, 'ai': 13}
BoW Representation:
[[1 0 1 1 1 0 1 1 0 0 0 0 0 0]
[0 1 0 1 0 1 0 0 1 1 1 1 0 0]
[0 0 0 1 0 0 0 0 1 0 0 0 1 1]]
Experiment 7: TF-IDF (Term Frequency-Inverse Document Frequency)
Objective:
Introduction:
TF-IDF is an advanced form of text vectorization that reflects the importance of words in a
corpus by considering both their frequency and rarity.
Tools:
● Python
● Scikit-learn
Steps:
# Apply TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(documents)
Sample Output:
Vocabulary: {'nlp': 4, 'is': 2, 'part': 5, 'of': 3, 'ai': 0, 'and': 1,
'are': 6, 'growing': 7, 'fields': 8, 'the': 9, 'application': 10, 'vast':
11}
TF-IDF Representation:
[[0.5 0. 0.5 0.5 0.5 0.5 0. 0. 0. 0. 0. 0. ]
[0. 0.5 0. 0. 0.5 0. 0.5 0.5 0.5 0. 0. 0. ]
[0. 0. 0.4 0.4 0.4 0. 0. 0. 0. 0.4 0.4 0.4]]
Experiment 8: Word Embeddings (Word2Vec)
Objective:
Introduction:
Word embeddings map words into dense vector representations based on their semantic
meanings. Word2Vec uses neural networks to capture contextual information.
Tools:
● Python
● Gensim
Steps:
Sample Output:
Vector for 'nlp':
[ 0.036273 -0.018734 0.093470 ... (50 dimensions)]
Most similar words to 'ai':
[('machine', 0.76), ('industries', 0.65), ('learning', 0.60)]
Experiment 9: Text Classification using Naïve Bayes
Objective:
Tools:
● Python
● Scikit-learn
Steps:
# Sample data
data = [
"I love programming in Python.",
"Python is an amazing language.",
"I enjoy solving AI problems.",
"I dislike debugging errors.",
"Debugging is the worst part of programming."
]
labels = [1, 1, 1, 0, 0] # 1: Positive, 0: Negative
Objective:
To perform Named Entity Recognition to identify entities like people, organizations, dates,
and locations.
Introduction:
NER identifies named entities in text and classifies them into predefined categories (e.g.,
PERSON, ORG, DATE, GPE). It is widely used in information extraction systems.
Tools:
● Python
● SpaCy Library
Steps:
import spacy
# Input text
text = "Elon Musk founded SpaceX in 2002, and its headquarters are in
California."
Sample Output:
Named Entities, Phrases, and Concepts:
Elon Musk - PERSON
SpaceX - ORG
2002 - DATE
California - GPE
Basic Viva Questions:
1. What is Natural Language Processing (NLP)?
● Answer: Tokenization is the process of splitting a text into smaller units called
tokens, such as words, phrases, or sub-words.
4. What is a Corpus?
● Answer: A corpus is a large collection of text data used for training or evaluating
NLP models.
● Answer: It is the process of removing commonly used words (e.g., "the," "a," "is")
that do not contribute to the meaning of a sentence.
● Answer:
o Lemmatization: Reduces words to their dictionary base form (e.g., "running"
→ "run").
o Stemming: Reduces words to their root form by chopping off prefixes or
suffixes (e.g., "running" → "run").
● Answer: NLP encompasses both understanding and generating language, while NLU
focuses specifically on understanding human language.
● Answer: BoW is a representation of text where the order of words is ignored, and
only the word frequencies are considered.
● Answer: NER identifies and classifies named entities (e.g., names, dates, locations) in
text into predefined categories.
● Answer: Word embeddings are dense vector representations of words that capture
their semantic relationships (e.g., Word2Vec, GloVe).
● Answer: Word2Vec is a neural network-based word embedding model that uses the
CBOW (Continuous Bag of Words) and Skip-gram methods to map words to vector
space.
● Answer:
o Bag of Words: Considers word frequency only.
o TF-IDF: Weighs word frequency by its importance across multiple
documents.
17. What are Stop Words, and why are they removed?
● Answer: Stop words are common words that carry little meaning (e.g., "and," "the").
They are removed to focus on important words.
● Answer: Cosine similarity measures the similarity between two vectors by calculating
the cosine of the angle between them. It is commonly used for text similarity tasks.
● Answer: Transformers are deep learning models that use self-attention mechanisms to
process sequential data, enabling state-of-the-art results in NLP tasks. (e.g., BERT,
GPT).
● Answer: Attention mechanisms allow models to focus on relevant parts of the input
sequence when generating an output sequence.
● Answer: Seq2Seq models are used for tasks like machine translation, where input
sequences (e.g., sentences) are mapped to output sequences.
● Answer: BERT uses WordPiece Tokenization, which breaks words into subwords
to handle out-of-vocabulary words.
● Answer:
o Extractive Summarization: Selects sentences directly from the text.
o Abstractive Summarization: Generates new sentences to convey the essence
of the text.
● Answer:
o Zero-Shot Learning: The model performs a task without having seen
examples during training.
o Few-Shot Learning: The model learns to perform a task with only a few
labeled examples.
● Answer:
o RNN: Basic sequential model with vanishing gradient issues.
o LSTM (Long Short-Term Memory): Overcomes vanishing gradients using
gates.
o GRU (Gated Recurrent Unit): Simplified LSTM with fewer parameters.
33. What is Transfer Learning in NLP?
● Answer:
o Ambiguity in language
o Handling synonyms and polysemy
o Data sparsity and vocabulary limitations
o Multilingual processing
o Context understanding.