ANLP semVI Labmanual
ANLP semVI Labmanual
Even Semester
Our Vision
To foster and permeate higher and quality education with value added engineering, technology
programs, providing all facilities in terms of technology and platforms for all round development
with social awareness and nurture the youth with international competencies and exemplary level
of employability even under highly competitive environment so that they are innovative, adaptable
and capable of handling problems faced by our country and world at large.
Our Mission
The Institution is committed to mobilize the resources and equip itself with men and materials of
excellence, there by ensuring that the Institution becomes a pivotal center of service to Industry,
Academy, and society with the latest technology. RAIT engages different platforms such as
technology enhancing Student Technical Societies, Cultural platforms, Sports excellence centers,
Entrepreneurial Development Centers and a Societal Interaction Cell. To develop the college to
become an autonomous institution & deemed university at the earliest, we provide facilities for
advanced research and development programs on par with international standards. We also seek
to invite international and reputed national Institutions and Universities to collaborate with our
institution on the issues of common interest of teaching and learning sophistication
Index
Sr. No. Contents
1. List of Experiments
2. Experiment Plan and Course Outcomes
3. Study and Evaluation Scheme
4. Experiment No. 1
5. Experiment No. 2
6. Experiment No. 3
7. Experiment No. 4
8. Experiment No. 5
9. Experiment No. 6
10. Experiment No. 7
11. Experiment No. 8
12. Experiment No. 9
13. Experiment No. 10
List of Experiments
Sr. No. Experiments Name
1 To perform text preprocessing in python.
2 To Perform stemming operations on text.
3
To Perform lamitization operations on text and to understand the morphology
of a word by the use of Add-Delete table.
10 Mini-Project
CO2
Learn about generation of word forms
CO3
Understand use of add-Delete table for word
CO4
To apply add-one smoothing on sparse bigram table.
CO5
Understand POS Tagging using Markov model
CO6
Understand POS Tagging using Viterbi decoding
Experiment No. 1
To Perform text preprocessing in python.
Experiment No. 1
1)nltk LIBRARY:-
NLTK, or Natural Language Toolkit, is a Python Package that you can use for NLP. A lot
of the data that you could be analyzing is unstructured data and contains human-
readable text. Before you can analyze that data programmatically, you first need to
preprocess it.
2)TEXT LOWERCASE:-
It is necessary to convert the text to lower case as it is case sensitive.
Function Used:-
text.lower()
3)REMOVE NUMBERS:-
Python provide a regex module that has a built-in function sub() to remove numbers
from the string. This method replaces all the occurrences of the given pattern in the
string with a replacement string. If the pattern is not found in the string, then it returns
the same string.
Function Used:-
re.sub()
4)REMOVE PUNCTUATION:-
Using Translate():-
The First two arguments for string.translate method is empty strings, and the third
input is a Python list of the punctuation that should be removed. This instructs the
Python method to eliminate punctuation from a string. This is one of the best ways to
strip punctuation from a string
Function Used:-
text.translate()
Experiment No. 2
Objective :
To understand how stemming works on text
To learn how to use different algorithms for stemming operations.
Theory:
1)Stemming :
From Stemming we will process of getting the root form of a word. Root or Stem is the
part to which inflectional affixes(like -ed, -ize, etc) are added. We would create the
stem words by removing the prefix of suffix of a word. So, stemming a word may not
result in actual words.
going ---> go
If our sentences are not in tokens, then we need to convert it into tokens. After we
converted strings of text into tokens, then we can convert those word tokens into their
root form. These are the Porter stemmer, the snowball stemmer, and the Lancaster
Stemmer. We usually use Porter stemmer among them.
2) Porter Stemmer:
The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the
commoner morphological and inflexional endings from words in English. Its main use is
as part of a term normalisation process that is usually done when setting up
Information Retrieval systems.
Experiment No. 3
Aim: To Perform lemitization operations on text and to understand the morphology of a word by
the use of Add-Delete table.
Objective :
To understand how slemitization operations works on text
To learn morphology of a word by using Add-Delete table on VLab
Theory:
1)Lemitization :
As stemming, lemmatization do the same but the only difference is that lemmatization
ensures that root word belongs to the language. Because of the use of lemmatization we
will get the valid words. In NLTK(Natural language Toolkit), we use WordLemmatizer
to get the lemmas of words. We also need to provide a context for the lemmatization.So,
we added pos(parts-of-speech) as a parameter.
2) Morphology:
References:
Questions:
• What is meant by lemitization?
• Define Morphology.
• List different applications of lemitization
• Differentiate between Stemming and lemitization.
Applied Natural Language Processing Lab
Experiment No. 4
Aim: To learn to calculate bigrams from a given corpus and calculate probability of a sentence.
Objective :
To understand concept of N-Gram
To understand how to calculate probability for bigrams
Theory:
1) Probability of Sentence:
If we consider each word occurring in its correct location as an independent event,the
probability of the sentences is : P(w(1), w(2)..., w(n-1), w(n))
P(w(1)) * P(w(2) | w(1)) * P(w(3) | w(1)w(2)) ... P(w(n) | w(1)w(2) ... w(n-1))
2) Bigrams:
We can avoid this very long calculation by approximating that the probability of a given
word depends only on the probability of its previous words. This assumption is called
Markov assumption and such a model is called Markov model- bigrams. Bigrams can be
generalized to the n-gram which looks at (n-1) words in the past. A bigram is a first-
order Markov model.
We use <s> tag to mark the beginning and </s> as end of a sentence.
A bigram table for a given corpus can be generated and used as a lookup table for
calculating probability of sentences.
References:
Questions:
• Role of n-gram language model in NLP.\
• Define chain rule, bigram.
• Explain term perplexity.
• What is meant by Markov Model.
Experiment No. 5
Aim: To find POS tags of words in a sentence
Objective :
To understand concept part of speech tagging
To understand how to build pos tagger
Theory:
POS tagging could be the very first task in text processing for further downstream tasks
in NLP, like speech recognition, parsing, machine translation, sentiment analysis, etc.
The particular POS tag of a word can be used as a feature by various Machine Learning
algorithms used in Natural Language Processing.
Learn -> ADJECTIVE NLP -> NOUN from -> PREPOSITION Scaler -> NOUN
There are various techniques that can be used for POS tagging such as
• Rule-based POS tagging: The rule-based POS tagging models apply a set of
handwritten rules and use contextual information to assign POS tags to
words. These rules are often known as context frame rules. One such rule
might be: “If an ambiguous/unknown word ends with the suffix ‘ing’ and is
preceded by a Verb, label it as a Verb”.
• Transformation Based Tagging: The transformation-based approaches use
a pre-defined set of handcrafted rules as well as automatically induced
rules that are generated during training.
• Deep learning models: Various Deep learning models have been used for
POS tagging such as Meta-BiLSTM which have shown an impressive
accuracy of around 97 percent.
• Stochastic (Probabilistic) tagging: A stochastic approach includes
frequency, probability or statistics. The simplest stochastic approach finds
out the most frequently used tag for a specific word in the annotated
training data and uses this information to tag that word in the
unannotated text. But sometimes this approach comes up with sequences
of tags for sentences that are not acceptable according to the grammar
rules of a language. One such approach is to calculate the probabilities of
various tag sequences that are possible for a sentence and assign the POS
tags from the sequence with the highest probability. Hidden Markov
Models (HMMs) are probabilistic approaches to assign a POS Tag.
Conclusion: We Understood the concept of POS tagging and seen about pos tagger
Experiment No. 6
Objective :
To understand concept of Text Classification
To understand how to use Naive bayes classifier for text classification
Theory:
Naive Bayes classifiers have been heavily used for text classification and text analysis
machine learning problems.
Text Analysis is a major application field for machine learning algorithms. However the
raw data, a sequence of symbols (i.e. strings) cannot be fed directly to the algorithms
themselves as most of them expect numerical feature vectors with a fixed size rather
than the raw text documents with variable length.
In order to address this, scikit-learn provides utilities for the most common ways to
extract numerical features from text content, namely:
• tokenizing strings and giving an integer id for each possible token, for
instance by using white-spaces and punctuation as token separators.
• counting the occurrences of tokens in each document.
Conclusion: We Understood the concept of Naive bayes classifier and studied about
text classification using Naive Bayes Classifier.
Experiment No. 7
Objective :
To understand concept Viterbi decoding
To find POS tags of words in a sentence using Viterbi decoding.
Theory:
In this experiment it will be used to find the POS tag sequence for a given sentence.
When we have emission and transition matrix, various algorithms can be applied to
find out the POS tags for words. Some of possible algorithms are: Backward algorithm,
forward algorithm and viterbi algorithm. Here, in this experiment, you can get familiar
with Viterbi Decoding
Viterbi Decoding is based on dynamic programming. This algorithm takes emission and
transmission matrix as the input. Emission matrix gives us information about proabities
of a POS tag for a given word and transmission matrix gives the probability of transition
from one POS tag to another POS tag. It observes sequence of words and returns the
state sequences of POS tags along with its probability.
Conclusion: We Understood the concept of POS tags with the help of Viterbi Decoding
Experiment No. 8
Experiment No. 9
Aim: To understand the concept of chunking and get familiar with the basic chunk tagset.
Objective :
To understand concept Chunking
To understand chunk tagset and how to do chunking
Theory:
A chunk is a collection of basic familiar units that have been grouped together and
stored in a person’s memory. In natural language, chunks are collective higher order
units that have discrete grammatical meanings (noun groups or phrases, verb groups,
etc.)
Chunking is a process of extracting phrases (chunks) from unstructured text. Instead of
using a single word which may not represent the actual meaning of the text, it’s
recommended to use chunk or phrase
Chunk Types
The chunk types are based on the syntactic category part. Besides the head a chunk
also contains modifiers (like determiners, adjectives, postpositions in NPs).
1 Noun NP
2 Verb VP
3 Adverb ADVP
4 Adjectivial ADJP
5 Prepositional PP
In order to create an NP-chunk, we will first define a chunk grammar using POS tags,
consisting of rules that indicate how sentences should be chunked. We will define this
using a single regular expression rule.
In this case, we will define a simple grammar with a single regular-expression rule. This
rule says that an NP chunk should be formed whenever the chunker finds an optional
determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN) then
the Noun Phrase(NP) chunk should be formed.