0% found this document useful (0 votes)
23 views

NLP Lab Manual

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

NLP Lab Manual

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

NLP Lab Manual

Practical 1
Aim:
To understand and implement the basic text preprocessing techniques in Natural
Language Processing (NLP) such as tokenization, filtration, script validation,
stop word removal, and stemming, which are essential for preparing textual data
for further analysis.

Objective:
• To learn how to break down text into individual units (tokens).
• To filter out unwanted characters or irrelevant parts of the text.
• To validate and process the text according to a specific script (e.g.,
English, Hindi).
• To remove common stop words that do not contribute meaningfully to
analysis.
• To apply stemming to reduce words to their root forms.

Theoretical Background:
Text preprocessing is a crucial step in NLP to transform raw text into a format
suitable for machine learning models. It involves multiple techniques:
1. Tokenization:
o Tokenization is the process of splitting the text into smaller units,
such as words or sentences. These units (tokens) are the building
blocks for text analysis. For example, the sentence "Natural
language processing is fun" can be tokenized into ["Natural",
"language", "processing", "is", "fun"].
2. Filtration:
o Filtration refers to the process of removing unwanted characters,
punctuation marks, or non-alphabetic symbols from the text. It
helps clean the text for better accuracy during further processing.
3. Script Validation:
o Script validation ensures that the input text conforms to the
required script (e.g., only English letters for English text). This is
important for multilingual datasets where text from different
languages may need to be separated or processed differently.
4. Stop Word Removal:
o Stop words are common words like "the", "is", "in", etc., which do
not contribute significant meaning in NLP tasks. Removing stop
words reduces the dimensionality of the data and focuses on more
meaningful words.
5. Stemming:
o Stemming is the process of reducing words to their base or root
form. For example, "running", "runs", and "runner" can all be
reduced to "run". This helps in reducing complexity and achieving
uniformity in the text.

Conclusion:
Text preprocessing techniques like tokenization, filtration, script validation, stop
word removal, and stemming are essential steps in preparing textual data for
NLP tasks. These steps help reduce noise, ensure consistency, and improve the
performance of machine learning models by focusing on meaningful content in
the text.
Practical 2
Aim:
To perform and understand the process of morphological analysis in Natural
Language Processing (NLP), focusing on analyzing and identifying the structure
of words, including their root forms, affixes, parts of speech, and grammatical
information.

Objective:
• To study how words are formed and the internal structure of words.
• To break down words into morphemes, the smallest units of meaning.
• To explore how morphological analysis helps in determining the meaning
and syntactic role of words in NLP.
• To understand how different languages use morphological rules, such as
inflection and derivation.

Theoretical Background:
Morphological analysis is a core component of NLP that involves studying the
structure of words. It deals with how words are formed from morphemes, the
smallest meaning-bearing units of language. The two main types of morphology
are:
1. Inflectional Morphology:
o Inflectional morphology deals with how a word changes to express
different grammatical categories, such as tense, number, or case,
without changing the word's core meaning. For example, adding "-
ed" to "walk" forms "walked" (past tense), but the meaning of the
verb remains the same.
2. Derivational Morphology:
o Derivational morphology involves creating a new word with a
different meaning or part of speech by adding affixes (prefixes or
suffixes). For example, "happy" becomes "unhappy" (negation), or
"develop" becomes "development" (noun form).
Key components of morphological analysis:
• Morpheme: The smallest unit of meaning in a word, such as "un-",
"happy", or "-ed".
• Root: The base form of a word, to which affixes are added.
• Affix: A morpheme added to the root word, which can be a prefix, suffix,
infix, or circumfix.
Morphological analysis in NLP helps in:
• Lemmatization: Reducing words to their base form (lemma) using
linguistic rules (e.g., "better" → "good").
• Stemming: Reducing words to their root forms by stripping affixes
without considering linguistic rules (e.g., "running" → "run").
Morphological analysis is crucial in languages with rich morphology (e.g.,
Turkish, Finnish), where a single word can convey complex grammatical
information.

Conclusion:
Morphological analysis is an essential process in NLP that enables the
decomposition of words into morphemes, revealing their structure, meaning,
and grammatical features. By understanding and applying inflectional and
derivational rules, we can better analyze and process language for NLP tasks,
particularly in languages with complex word structures. This analysis improves
tasks like text classification, machine translation, and information retrieval by
recognizing the relationships between different word forms.
Practical 3
Aim:
To understand and implement the N-Gram model for analyzing the sequence of
words in a text, and to learn how to estimate the probability of word sequences
for various NLP tasks.

Objective:
• To learn how to represent text data as sequences of words or characters
using N-grams.
• To understand the concept of unigram, bigram, trigram, and higher-order
N-grams.
• To calculate the probability of word sequences using the N-Gram model.
• To explore how the N-Gram model can be used in predictive text
generation, machine translation, and other NLP applications.

Theoretical Background:
The N-Gram model is a probabilistic model used in NLP to predict the next
word in a sequence by considering the preceding N-1 words. It helps capture
word dependencies and context in text, which is crucial for tasks like speech
recognition, machine translation, and text prediction.
1. N-Gram Definition:
o An N-Gram is a contiguous sequence of N items (words or
characters) from a given text. For example, in the sentence "I love
natural language processing":
▪ Unigram (1-gram): ["I", "love", "natural", "language",
"processing"]
▪ Bigram (2-gram): ["I love", "love natural", "natural
language", "language processing"]
▪ Trigram (3-gram): ["I love natural", "love natural language",
"natural language processing"]
2. Types of N-Grams:
o Unigram Model (N=1): Assumes each word is independent of
others. The probability of a word is simply its occurrence in the
text.
o Bigram Model (N=2): Considers the probability of a word given
the previous word.
o Trigram Model (N=3): Extends this to two previous words. As N
increases, the model captures more context but becomes
computationally more expensive.
3. Mathematical Representation: The probability of a sequence of words
w1,w2,…,wnw_1, w_2, \dots, w_nw1,w2,…,wn in an N-Gram model is
represented as:
P(w1,w2,…,wn)=P(w1)×P(w2∣w1)×P(w3∣w1,w2)×⋯×P(wn∣w1,…,wn−1)P(w_
1, w_2, \dots, w_n) = P(w_1) \times P(w_2 | w_1) \times P(w_3 | w_1, w_2)
\times \dots \times P(w_n | w_1, \dots, w_{n-1})P(w1,w2,…,wn)=P(w1)×P(w2
∣w1)×P(w3∣w1,w2)×⋯×P(wn∣w1,…,wn−1)
4. Applications of N-Gram Model:
o Text Prediction: Predicting the next word based on previous
words.
o Speech Recognition: Understanding the context of words in
spoken language.
o Machine Translation: Helping translate text by considering word
pairs or triples.
o Spelling and Grammar Checkers: Identifying probable word
sequences to detect mistakes.

Conclusion:
The N-Gram model provides a simple yet effective way of modeling the
probability of sequences in text. By capturing word dependencies, it forms the
foundation of many NLP applications. Though limited by its local context, the
N-Gram model is widely used for its ease of implementation and computational
efficiency.
Practical 4
Aim:
To understand and implement Part of Speech (POS) tagging on text, identifying
and labeling each word with its respective grammatical category such as noun,
verb, adjective, etc., which aids in understanding the syntactic structure of
sentences.

Objective:
• To familiarize with the process of POS tagging in NLP.
• To learn how to assign grammatical labels (tags) to words in a sentence.
• To understand how POS tagging helps in syntactic and semantic analysis.
• To implement POS tagging using a POS tagging algorithm or NLP
library.

Theoretical Background:
Part of Speech (POS) tagging is a fundamental technique in NLP where each
word in a text is labeled with its corresponding grammatical category. The
grammatical categories, also known as "tags," include nouns, verbs, adjectives,
adverbs, pronouns, conjunctions, and others. POS tagging plays a vital role in
text analysis, syntactic parsing, machine translation, and other NLP tasks.
1. POS Tagging:
o POS tagging assigns a tag to each word in a sentence based on its
role in the sentence. For example, in the sentence "She is reading a
book," the POS tags are:
▪ She: Pronoun (PRP)
▪ is: Verb (VBZ)
▪ reading: Verb (VBG)
▪ a: Determiner (DT)
▪ book: Noun (NN)
2. Types of POS Taggers:
o Rule-based POS Tagger: Uses predefined grammatical rules to
tag words.
o Statistical POS Tagger: Uses probabilistic models (such as
Hidden Markov Models) to predict tags based on the context of the
word in the sentence.
o Machine Learning-based POS Tagger: Utilizes supervised
learning algorithms trained on labeled datasets for POS tagging.
3. POS Tagging Applications:
o Syntactic Parsing: POS tagging is the first step in understanding
sentence structure.
o Named Entity Recognition (NER): Helps in identifying names,
locations, dates, etc.
o Machine Translation: Accurate POS tagging improves translation
by maintaining syntactic integrity.
4. Common POS Tags:
o Noun (NN), Verb (VB), Adjective (JJ), Adverb (RB), Pronoun
(PRP), Preposition (IN), Determiner (DT), etc.

Conclusion:
POS tagging is an essential step in NLP that provides syntactic and semantic
meaning to text by categorizing words into their respective grammatical
categories. By understanding the structure of sentences through POS tags,
complex NLP tasks such as parsing, information extraction, and machine
translation can be effectively performed.
Practical 5
Aim:
To understand and implement chunking, a technique used in Natural Language
Processing (NLP) to group words or tokens into meaningful chunks such as
noun phrases, verb phrases, etc., which are useful for syntactic parsing and
information extraction.

Objective:
• To learn how chunking helps in identifying meaningful phrases from a
sentence.
• To use Part-of-Speech (POS) tags for creating chunks of words.
• To understand how chunking can be applied to extract entities and
improve text analysis tasks.

Theoretical Background:
Chunking, also known as shallow parsing, is a technique in NLP that groups
tokens into larger units called chunks. These chunks represent meaningful
phrases like noun phrases (NP) or verb phrases (VP). Unlike full syntactic
parsing, chunking does not focus on detailed grammatical relationships but
rather on finding base-level structure.
1. POS Tagging in Chunking:
o Chunking relies heavily on Part-of-Speech (POS) tagging. Each
word in a sentence is tagged with its corresponding part of speech
(e.g., noun, verb, adjective, etc.). Based on these POS tags, specific
patterns are identified to extract phrases. For example, a noun
phrase could be a sequence of adjectives followed by nouns.
o Example: In the sentence, "The quick brown fox jumps over the
lazy dog," the chunk "The quick brown fox" could be identified as
a noun phrase (NP).
2. Chunk Patterns:
o Chunking uses regular expression patterns based on POS tags to
identify groups of words. Common patterns include:
▪ Noun Phrase (NP): A sequence of adjectives and nouns (e.g.,
"quick brown fox").
▪ Verb Phrase (VP): A verb followed by related words (e.g.,
"jumps over").
3. Applications of Chunking:
o Chunking is useful for information extraction tasks, such as
identifying named entities (names, locations, dates).
o It simplifies complex sentences into meaningful structures for
downstream tasks such as machine translation, sentiment analysis,
or entity recognition.

Conclusion:
Chunking is a powerful NLP technique that helps to extract structured,
meaningful phrases from text using POS tags. By grouping words into noun or
verb phrases, chunking aids in simplifying text and is a key step in many NLP
pipelines, especially for tasks like syntactic analysis and information extraction.
Practical 6
Aim:
To understand and implement Named Entity Recognition (NER), a key Natural
Language Processing (NLP) technique that identifies and categorizes named
entities such as people, organizations, locations, dates, etc., from a given text.

Objective:
• To identify and classify named entities in a text into predefined categories
(e.g., person, organization, location, date).
• To learn how NER contributes to understanding the meaning and context
of a text.
• To explore the different algorithms and tools used for Named Entity
Recognition.
• To implement NER using an NLP library and analyze the results.

Theoretical Background:
Named Entity Recognition (NER):
NER is a sub-task of information extraction that seeks to locate and classify
named entities in text into predefined categories such as:
• Person: Names of individuals (e.g., "Albert Einstein").
• Organization: Names of companies, agencies, institutions (e.g.,
"Google", "United Nations").
• Location: Geographical locations, cities, countries (e.g., "Paris",
"India").
• Date/Time: Temporal expressions like dates and times (e.g., "October
13", "2024").
NER Techniques:
NER is generally implemented using one of the following approaches:
1. Rule-based Approaches:
o Using predefined rules and dictionaries to identify entities. While
easy to implement, they lack flexibility and fail in complex
contexts.
2. Machine Learning Approaches:
o Using supervised learning models where labeled datasets train
models to identify entities. Common algorithms include Hidden
Markov Models (HMM), Conditional Random Fields (CRF), and
more recently, neural networks (BiLSTM, Transformers).
3. Deep Learning Approaches:
o Neural network models, especially with advancements like
Bidirectional LSTMs and Transformer-based models like BERT,
have significantly improved NER performance. These models
consider the context of the entity within the sentence to make more
accurate predictions.
Applications of NER:
• Search engines: Enhance query understanding by identifying important
entities in the query.
• Information retrieval: Extract important names, dates, and organizations
from documents.
• Content summarization: Automatically detect key entities to summarize
articles or papers.
• Sentiment analysis: Determine public opinion about specific individuals
or organizations.

Conclusion:
Named Entity Recognition (NER) is a fundamental technique in NLP that
enables the identification of crucial entities within text. By categorizing these
entities, NER adds structure to unstructured data, making it easier to analyze
and derive insights. Implementing NER using both rule-based and machine
learning approaches demonstrates the power of automating information
extraction in various real-world applications.
Practical 7
Aim:
To simulate and understand word generation techniques in NLP using a virtual
lab environment, exploring how machines can generate coherent words and text
sequences based on given input.

Objective:
• To understand the concept of word generation in Natural Language
Processing.
• To explore various algorithms and models (e.g., n-grams, RNN, LSTM)
used for generating words.
• To implement word generation techniques using a virtual lab platform.
• To evaluate the quality and coherence of generated text.

Theoretical Background:
Word generation in NLP is a task where models are trained to predict and
generate words or sequences of words based on learned patterns from training
data. This process is essential for applications like text generation, chatbots, and
language modeling.
1. N-grams:
o N-grams are contiguous sequences of 'n' words used to predict the
next word in a sequence. For example, in a bigram model (n=2), "I
am" would predict "a" based on prior word patterns. N-grams
capture local dependencies but are limited by their fixed context
size.
2. Recurrent Neural Networks (RNN):
o RNNs are a class of neural networks that have loops allowing
information to persist. They are effective for sequential data, such
as text, as they maintain memory of previous words in the
sequence. However, RNNs can struggle with long-term
dependencies.
3. Long Short-Term Memory (LSTM):
o LSTMs are an improvement over RNNs, designed to handle long-
term dependencies more effectively. They use gating mechanisms
to control the flow of information and decide what to keep or
discard from previous word sequences.
4. Word Embeddings:
o Before generating words, NLP models often represent words as
vectors using word embeddings (e.g., Word2Vec, GloVe). These
embeddings capture semantic relationships between words,
allowing the model to generate contextually appropriate words.
5. Language Models:
o Pretrained models like GPT (Generative Pretrained Transformer)
are powerful for word generation, leveraging large-scale datasets
and complex architectures to generate coherent and meaningful
sequences of text.

Conclusion:
The virtual lab on word generation provides hands-on experience in
understanding and implementing various word generation models, including n-
grams, RNNs, and LSTMs. These techniques form the foundation of many NLP
applications, such as language models and text prediction systems. Through this
practical, students gain insight into the inner workings of machine-generated
language and how to assess its quality.
Practical 8
Aim:
To develop and implement a Sentiment Analysis system that classifies the
sentiment of textual data (e.g., positive, negative, or neutral) using NLP
techniques.

Objective:
• To understand how to preprocess text data for sentiment analysis.
• To learn how to use machine learning or deep learning algorithms to
classify text based on sentiment.
• To explore the role of feature extraction techniques (like TF-IDF or word
embeddings) in enhancing the model's performance.
• To evaluate the performance of the sentiment classifier using appropriate
metrics.

Theoretical Background:
Sentiment Analysis, also known as Opinion Mining, is a popular Natural
Language Processing (NLP) application that identifies and classifies the
sentiment expressed in a piece of text. The goal is to determine whether the
sentiment is positive, negative, or neutral. It is widely used in social media
analysis, customer feedback, and product reviews.
1. Text Preprocessing:
o Preprocessing is critical in sentiment analysis, which involves
cleaning and preparing the raw text data. Common steps include:
▪ Tokenization: Breaking the text into individual words or
tokens.
▪ Stop Word Removal: Removing common words like "the",
"is", etc.
▪ Lemmatization/Stemming: Reducing words to their root
forms to standardize the text.
▪ Filtration: Removing unwanted symbols, punctuation, or
noise from the text.
2. Feature Extraction:
o After preprocessing, text must be converted into a numerical
format for model training. This is achieved through techniques
such as:
▪ Bag of Words (BoW): Represents text as the frequency of
words in a document.
▪ TF-IDF (Term Frequency-Inverse Document
Frequency): Weighs words by their frequency and
uniqueness across multiple documents.
▪ Word Embeddings (e.g., Word2Vec, GloVe): Converts
words into dense vector representations, capturing semantic
relationships.
3. Sentiment Classification:
o The core of sentiment analysis involves training a model to classify
text based on sentiment. Various algorithms can be used for this,
such as:
▪ Logistic Regression: A simple and effective classifier for
binary sentiment analysis.
▪ Naive Bayes: A probabilistic model often used for text
classification.
▪ Support Vector Machines (SVM): A powerful classifier for
text data.
▪ Deep Learning: Models like LSTM (Long Short-Term
Memory) or CNN (Convolutional Neural Networks) can
handle more complex patterns in textual data.
4. Model Evaluation:
o The performance of the sentiment classifier can be evaluated using
metrics such as:
▪ Accuracy: The ratio of correctly predicted sentiments to
total predictions.
▪ Precision, Recall, F1-Score: For more detailed evaluation,
especially in cases of imbalanced datasets.
▪ Confusion Matrix: A visual representation of the model's
predictions compared to the actual sentiment.

Conclusion:
The Sentiment Analysis project illustrates the application of NLP techniques to
derive insights from text data. By preprocessing text, extracting features, and
applying machine learning algorithms, a sentiment classification model can
effectively analyze the emotional tone of texts. This project demonstrates how
sentiment analysis can be used in real-world applications like monitoring
customer feedback, product reviews, or social media trends.

You might also like