0% found this document useful (0 votes)
66 views33 pages

ANLP semVI Labmanual

This lab manual document provides information about the Applied Natural Language Processing lab for the third year semester. It includes the vision, mission, and objectives of the institution. It also includes an index listing the 10 experiments that are part of the lab: text preprocessing in Python, stemming operations, lemmatization and morphology, calculating bigrams and sentence probabilities, part-of-speech tagging, Naive Bayes classification, Viterbi decoding for POS tagging, analyzing context and training corpus size for POS, chunking, and a mini-project. The document provides the framework and overview of the topics and experiments covered in the natural language processing lab.

Uploaded by

kun.dha.rt22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views33 pages

ANLP semVI Labmanual

This lab manual document provides information about the Applied Natural Language Processing lab for the third year semester. It includes the vision, mission, and objectives of the institution. It also includes an index listing the 10 experiments that are part of the lab: text preprocessing in Python, stemming operations, lemmatization and morphology, calculating bigrams and sentence probabilities, part-of-speech tagging, Naive Bayes classification, Viterbi decoding for POS tagging, analyzing context and training corpus size for POS, chunking, and a mini-project. The document provides the framework and overview of the topics and experiments covered in the natural language processing lab.

Uploaded by

kun.dha.rt22
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 33

Lab Manual

Third Year Semester-VI


Information Technology
Subject: Applied Natural
Language Processing Lab

Even Semester

Institutional Vision, Mission

Our Vision
To foster and permeate higher and quality education with value added engineering, technology
programs, providing all facilities in terms of technology and platforms for all round development
with social awareness and nurture the youth with international competencies and exemplary level
of employability even under highly competitive environment so that they are innovative, adaptable
and capable of handling problems faced by our country and world at large.

Our Mission
The Institution is committed to mobilize the resources and equip itself with men and materials of
excellence, there by ensuring that the Institution becomes a pivotal center of service to Industry,
Academy, and society with the latest technology. RAIT engages different platforms such as
technology enhancing Student Technical Societies, Cultural platforms, Sports excellence centers,
Entrepreneurial Development Centers and a Societal Interaction Cell. To develop the college to
become an autonomous institution & deemed university at the earliest, we provide facilities for
advanced research and development programs on par with international standards. We also seek
to invite international and reputed national Institutions and Universities to collaborate with our
institution on the issues of common interest of teaching and learning sophistication
Index
Sr. No. Contents
1. List of Experiments
2. Experiment Plan and Course Outcomes
3. Study and Evaluation Scheme
4. Experiment No. 1
5. Experiment No. 2
6. Experiment No. 3
7. Experiment No. 4
8. Experiment No. 5
9. Experiment No. 6
10. Experiment No. 7
11. Experiment No. 8
12. Experiment No. 9
13. Experiment No. 10

List of Experiments
Sr. No. Experiments Name
1 To perform text preprocessing in python.
2 To Perform stemming operations on text.

3
To Perform lamitization operations on text and to understand the morphology
of a word by the use of Add-Delete table.

4 To learn to calculate bigrams from a given corpus and calculate probability of


a sentence.
5 To find POS tags of words in a sentence
6 Text Classification using Naive Bayes Classifier
To find POS tags of words in a sentence using Viterbi decoding.
7
8
The experiment is to know the importance of context and size of training
corpus in learning Parts of Speech
9 To understand the concept of chunking and get familiar with the basic chunk
tagset.

10 Mini-Project

Experiment Plan & Course Outcome


Lab Outcomes:

CO1 Understand morphological features of word

CO2
Learn about generation of word forms

CO3
Understand use of add-Delete table for word

CO4
To apply add-one smoothing on sparse bigram table.

CO5
Understand POS Tagging using Markov model
CO6
Understand POS Tagging using Viterbi decoding

Study and Evaluation Scheme


Course
Course Name Teaching Scheme Credits Assigned
Code
Natural Theory Practical Tutorial Theory Practical Tutorial Total
ITLDL Language /oral
O6022 Processing
Lab -- 02 -- -- 01 -- 01

Course Code Course Name Examination Scheme


Natural Term Work Practical Total
ITLDLO6022 Language /oral
Processing 25 25 50
Lab

Applied Natural Language Processing Lab

Experiment No. 1
To Perform text preprocessing in python.

Experiment No. 1

Aim: To Perform text preprocessing in python.

Objective : To understand how text preprocessing works in python.


Theory:

1)nltk LIBRARY:-
NLTK, or Natural Language Toolkit, is a Python Package that you can use for NLP. A lot
of the data that you could be analyzing is unstructured data and contains human-
readable text. Before you can analyze that data programmatically, you first need to
preprocess it.

2)TEXT LOWERCASE:-
It is necessary to convert the text to lower case as it is case sensitive.
Function Used:-
text.lower()

3)REMOVE NUMBERS:-
Python provide a regex module that has a built-in function sub() to remove numbers
from the string. This method replaces all the occurrences of the given pattern in the
string with a replacement string. If the pattern is not found in the string, then it returns
the same string.
Function Used:-
re.sub()

4)REMOVE PUNCTUATION:-
Using Translate():-
The First two arguments for string.translate method is empty strings, and the third
input is a Python list of the punctuation that should be removed. This instructs the
Python method to eliminate punctuation from a string. This is one of the best ways to
strip punctuation from a string
Function Used:-
text.translate()

5)REMOVE DEFAULT STOPWORDS:-


Using Python’s NLTK Library:-
The NLTK Library is one of the oldest and most commonly used Python libraries for
Natural Language Processing. NLTK supports stop word removal, and you can find the
list of stop words in the corpus module. To remove stop words from a sentence, you
can divide your text into words and then remove the word if it exists in the list of stop
words provided by NLTK

Conclusion: We Understood the Python NLTK functions regarding text preprocessing.


References:
Questions:
• What is text preprocessing in python?
• What is meant by tokenization?
• Explain the various steps of text preprocessing?
Applied Natural Language Processing Lab

Experiment No. 2

To Perform stemming operations on text.


Experiment No. 2

Aim: To Perform stemming operations on text.

Objective :
To understand how stemming works on text
To learn how to use different algorithms for stemming operations.

Theory:

1)Stemming :

From Stemming we will process of getting the root form of a word. Root or Stem is the
part to which inflectional affixes(like -ed, -ize, etc) are added. We would create the
stem words by removing the prefix of suffix of a word. So, stemming a word may not
result in actual words.

For Example: Mangoes ---> Mango

Boys ---> Boy

going ---> go

If our sentences are not in tokens, then we need to convert it into tokens. After we
converted strings of text into tokens, then we can convert those word tokens into their
root form. These are the Porter stemmer, the snowball stemmer, and the Lancaster
Stemmer. We usually use Porter stemmer among them.

2) Porter Stemmer:

The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the
commoner morphological and inflexional endings from words in English. Its main use is
as part of a term normalisation process that is usually done when setting up
Information Retrieval systems.

Conclusion: We Understood the stemming operation on text using porter stemmer.


References:
Questions:
• What is meant by Stemming?
• What are the different types of morphology?
• What are different types of Stemming algorithms?
• Explain Porter Stemmer.
Applied Natural Language Processing Lab

Experiment No. 3

To Perform lemitization operations on text and to


understand the morphology of a word by the use of Add-
Delete table.
Experiment No. 3

Aim: To Perform lemitization operations on text and to understand the morphology of a word by
the use of Add-Delete table.

Objective :
To understand how slemitization operations works on text
To learn morphology of a word by using Add-Delete table on VLab

Theory:

1)Lemitization :

As stemming, lemmatization do the same but the only difference is that lemmatization
ensures that root word belongs to the language. Because of the use of lemmatization we
will get the valid words. In NLTK(Natural language Toolkit), we use WordLemmatizer
to get the lemmas of words. We also need to provide a context for the lemmatization.So,
we added pos(parts-of-speech) as a parameter.

2) Morphology:

Morphemes are considered as smallest meaningful units of language. These


morphemes can either be a root word(play) or affix(-ed). Combination of these
morphemes is called morphological process. So, word "played" is made out of 2
morphemes "play" and "-ed". Thus finding all parts of a word(morphemes) and thus
describing properties of a word is called "Morphological Analysis".
Conclusion: We Understood the lemitization operation on text and also we studied the
morphology of word by table on VLab.

References:

Questions:
• What is meant by lemitization?
• Define Morphology.
• List different applications of lemitization
• Differentiate between Stemming and lemitization.
Applied Natural Language Processing Lab

Experiment No. 4

To learn to calculate bigrams from a given corpus and


calculate probability of a sentence.
Experiment No. 4

Aim: To learn to calculate bigrams from a given corpus and calculate probability of a sentence.

Objective :
To understand concept of N-Gram
To understand how to calculate probability for bigrams

Theory:

A combination of words forms a sentence. However, such a formation is meaningful


only when the words are arranged in some order.
Eg: Sit I car in the

Such a sentence is not grammatically acceptable. However some perfectly grammatical


sentences can be nonsensical too!

Eg: Colorless green ideas sleep furiously

One easy way to handle such unacceptable sentences is by assigning probabilities to


the strings of words i.e, how likely the sentence is.

1) Probability of Sentence:
If we consider each word occurring in its correct location as an independent event,the
probability of the sentences is : P(w(1), w(2)..., w(n-1), w(n))

Using chain rule: =

P(w(1)) * P(w(2) | w(1)) * P(w(3) | w(1)w(2)) ... P(w(n) | w(1)w(2) ... w(n-1))

2) Bigrams:

We can avoid this very long calculation by approximating that the probability of a given
word depends only on the probability of its previous words. This assumption is called
Markov assumption and such a model is called Markov model- bigrams. Bigrams can be
generalized to the n-gram which looks at (n-1) words in the past. A bigram is a first-
order Markov model.

Therefore , P(w(1), w(2)..., w(n-1), w(n)) = P(w(2)|w(1)) P(w(3)|w(2)) ... P(w(n)|w(n-1))

We use <s> tag to mark the beginning and </s> as end of a sentence.

A bigram table for a given corpus can be generated and used as a lookup table for
calculating probability of sentences.

Conclusion: We Understood the concept of bigrams and how to calculate probability of


sentence using chain rule.

References:
Questions:
• Role of n-gram language model in NLP.\
• Define chain rule, bigram.
• Explain term perplexity.
• What is meant by Markov Model.

Experiment No. 5
Aim: To find POS tags of words in a sentence

Objective :
To understand concept part of speech tagging
To understand how to build pos tagger

Theory:

Part-of-speech (POS) tagging is an important Natural Language Processing (NLP)


concept that categorizes words in the text corpus with a particular part of speech tag
(e.g., Noun, Verb, Adjective, etc.)

POS tagging could be the very first task in text processing for further downstream tasks
in NLP, like speech recognition, parsing, machine translation, sentiment analysis, etc.

The particular POS tag of a word can be used as a feature by various Machine Learning
algorithms used in Natural Language Processing.

Example Sentence : Learn NLP from Scaler

Learn -> ADJECTIVE NLP -> NOUN from -> PREPOSITION Scaler -> NOUN

There are various techniques that can be used for POS tagging such as

• Rule-based POS tagging: The rule-based POS tagging models apply a set of
handwritten rules and use contextual information to assign POS tags to
words. These rules are often known as context frame rules. One such rule
might be: “If an ambiguous/unknown word ends with the suffix ‘ing’ and is
preceded by a Verb, label it as a Verb”.
• Transformation Based Tagging: The transformation-based approaches use
a pre-defined set of handcrafted rules as well as automatically induced
rules that are generated during training.
• Deep learning models: Various Deep learning models have been used for
POS tagging such as Meta-BiLSTM which have shown an impressive
accuracy of around 97 percent.
• Stochastic (Probabilistic) tagging: A stochastic approach includes
frequency, probability or statistics. The simplest stochastic approach finds
out the most frequently used tag for a specific word in the annotated
training data and uses this information to tag that word in the
unannotated text. But sometimes this approach comes up with sequences
of tags for sentences that are not acceptable according to the grammar
rules of a language. One such approach is to calculate the probabilities of
various tag sequences that are possible for a sentence and assign the POS
tags from the sequence with the highest probability. Hidden Markov
Models (HMMs) are probabilistic approaches to assign a POS Tag.

POS tagging - Hidden Markov Model

Hidden Markov Model has two important components-

1)Transition Probabilities: The one-step transition probability is the probability of


transitioning from one state to another in a single step.

2)Emission Probabilties: : The output probabilities for an observation from state.


Emission probabilities B = { bi,k = bi(ok) = P(ok | qi) }, where ok is an Observation.
Informally, B is the probability that the output is ok given that the current state is qi

Conclusion: We Understood the concept of POS tagging and seen about pos tagger
Experiment No. 6

Aim: Text Classification using Naive Bayes Classifier

Objective :
To understand concept of Text Classification
To understand how to use Naive bayes classifier for text classification

Theory:

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’


Theorem. It is not a single algorithm but a family of algorithms where all of them share
a common principle, i.e. every pair of features being classified is independent of each
other.

Naive Bayes classifiers have been heavily used for text classification and text analysis
machine learning problems.

Text Analysis is a major application field for machine learning algorithms. However the
raw data, a sequence of symbols (i.e. strings) cannot be fed directly to the algorithms
themselves as most of them expect numerical feature vectors with a fixed size rather
than the raw text documents with variable length.

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’


Theorem. It is not a single algorithm but a family of algorithms where all of them share
a common principle, i.e. every pair of features being classified is independent of each
other.
Text Analysis is a major application field for machine learning algorithms. However the
raw data, a sequence of symbols (i.e. strings) cannot be fed directly to the algorithms
themselves as most of them expect numerical feature vectors with a fixed size rather
than the raw text documents with variable length.

In order to address this, scikit-learn provides utilities for the most common ways to
extract numerical features from text content, namely:

• tokenizing strings and giving an integer id for each possible token, for
instance by using white-spaces and punctuation as token separators.
• counting the occurrences of tokens in each document.

Conclusion: We Understood the concept of Naive bayes classifier and studied about
text classification using Naive Bayes Classifier.
Experiment No. 7

Aim: To find POS tags of words in a sentence using Viterbi decoding.

Objective :
To understand concept Viterbi decoding
To find POS tags of words in a sentence using Viterbi decoding.

Theory:

In this experiment it will be used to find the POS tag sequence for a given sentence.
When we have emission and transition matrix, various algorithms can be applied to
find out the POS tags for words. Some of possible algorithms are: Backward algorithm,
forward algorithm and viterbi algorithm. Here, in this experiment, you can get familiar
with Viterbi Decoding

Viterbi Decoding is based on dynamic programming. This algorithm takes emission and
transmission matrix as the input. Emission matrix gives us information about proabities
of a POS tag for a given word and transmission matrix gives the probability of transition
from one POS tag to another POS tag. It observes sequence of words and returns the
state sequences of POS tags along with its probability.

Conclusion: We Understood the concept of POS tags with the help of Viterbi Decoding
Experiment No. 8
Experiment No. 9

Aim: To understand the concept of chunking and get familiar with the basic chunk tagset.

Objective :
To understand concept Chunking
To understand chunk tagset and how to do chunking

Theory:

A chunk is a collection of basic familiar units that have been grouped together and
stored in a person’s memory. In natural language, chunks are collective higher order
units that have discrete grammatical meanings (noun groups or phrases, verb groups,
etc.)
Chunking is a process of extracting phrases (chunks) from unstructured text. Instead of
using a single word which may not represent the actual meaning of the text, it’s
recommended to use chunk or phrase

Chunk Types

The chunk types are based on the syntactic category part. Besides the head a chunk
also contains modifiers (like determiners, adjectives, postpositions in NPs).

The basic types of chunks in English are:

Chunk type Tag Name

1 Noun NP

2 Verb VP
3 Adverb ADVP

4 Adjectivial ADJP

5 Prepositional PP

In order to create an NP-chunk, we will first define a chunk grammar using POS tags,
consisting of rules that indicate how sentences should be chunked. We will define this
using a single regular expression rule.

In this case, we will define a simple grammar with a single regular-expression rule. This
rule says that an NP chunk should be formed whenever the chunker finds an optional
determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN) then
the Noun Phrase(NP) chunk should be formed.

The result is a tree, which we can either print or display graphically.

Conclusion: We Understood the concept of Chunking and tagset of different chunks

You might also like