NLP Lab Manual
NLP Lab Manual
Lab Manual
Final Year Semester-VIII
Subject: Natural Language Processing
Even Semester
1
Department of Computer Engineering
Our Vision
To foster and permeate higher and quality education with value added engineering, technology programs,
providing all facilities in terms of technology and platforms for all round development with societal
awareness and nurture the youth with international competencies and exemplary level of employability
even under highly competitive environment so that they are innovative adaptable and capable of handling
problems faced by our country and world at large.
Our Mission
The Institution is committed to mobilize the resources and equip itself with men and materials of
excellence thereby ensuring that the Institution becomes pivotal center of service to Industry, academia,
and society with the latest technology. RAIT engages different platforms such as technology enhancing
Student Technical Societies, Cultural platforms, Sports excellence centers, Entrepreneurial Development
Center and Societal Interaction Cell. To develop the college to become an autonomous Institution &
deemed university at the earliest with facilities for advanced research and development programs on par
with international standards. To invite international and reputed national Institutions and Universities to
collaborate with our institution on the issues of common interest of teaching and learning sophistication.
It is our earnest endeavour to produce high quality engineering professionals who are
innovative and inspiring, thought and action leaders, competent to solve problems faced
by society, nation and world at large by striving towards very high standards in learning,
teaching and training methodology.
Dr. Vijay
D.PatilPresident, 2
RAES
Department of Computer Engineering
Vision
To impart higher and quality education in computer science with value added engineering and technology
programs to prepare technically sound, ethically strong engineers with social awareness. To extend the
facilities, to meet the fast changing requirements and nurture the youths with international competencies
and exemplary level of employability and research under highly competitive environments.
Mission
• To mobilize the resources and equip the institution with men and materials of excellence to
provide knowledge and develop technologies in the thrust areas of computer science and
Engineering.
• To provide the diverse platforms of sports, technical, co-curricular and extracurricular activities
for the overall development of student with ethical attitude.
• To prepare the students to sustain the impact of computer education for social needs
encompassing industry, educational institutions and public service.
• To collaborate with IITs, reputed universities and industries for the technical and overall
upliftment of students for continuing learning and entrepreneurship.
3
Department of Computer Engineering
3. Broad Base
To provide broad education necessary to understand the science of computer engineering and the
impact of it in a global and social context.
4. Techno-leader
To provide exposure to emerging cutting edge technologies, adequate training & opportunities to
work as teams on multidisciplinary projects with effective communication skills and leadership
qualities.
5. Practice citizenship
To provide knowledge of professional and ethical responsibility and to contribute to society
through active engagement with professional societies, schools, civic organizations or other
community activities.
4
Department of Computer Engineering
PO2: Problem analysis: Identify, formulate, review research literature, and analyze complex
engineering problems reaching substantiated conclusions using first principles of mathematics,
natural sciences, and engineering sciences..
PO5: Modern tool usage: Create, select, and apply appropriate techniques, resources, and
modern engineering and IT tools including prediction and modeling to complex engineering
activities with an understanding of the limitations.
PO6: The engineer and society: Apply reasoning informed by the contextual knowledge to
assess societal, health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering practice.
PO7: Environment and sustainability: Understand the impact of the professional engineering
solutions in societal and environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.
PO8: Ethics: Apply ethical principles and commit to professional ethics and responsibilities and
norms of the engineering practice.
PO9: Individual and team work: Function effectively as an individual, and as a member or
leader in diverse teams, and in multidisciplinary settings.
PO11: Project management and finance: Demonstrate knowledge and understanding of the
engineering and management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.
PO12: Life-long learning: Recognize the need for, and have the preparation and ability to
engage in independent and life-long learning in the broadest context of technological change.
5
Department of Computer Engineering
PSO1: To build competencies towards problem solving with an ability to understand, identify,
analyze and design the problem, implement and validate the solution including both hardware
and software.
PSO2: To build appreciation and knowledge acquiring of current computer techniques with an
ability to use skills and tools necessary for computing practice.
PSO3: To be able to match the industry requirements in the area of computer science and
engineering. To equip skills to adopt and imbibe new technologies.
6
Department of Computer Engineering
Index
Sr. No. Contents Page No.
1. List of Experiments 8-9
2. Experiment Plan and Course Outcomes 10
Mapping of Course Outcomes – Program
3. 11-12
Outcomes and Program Specific outcome
4. Study and Evaluation Scheme 13
5. Experiment No. 1 14
6. Experiment No. 2 21
7. Experiment No. 3 24
8. Experiment No. 4 28
9. Experiment No. 5 32
10. Experiment No. 6 37
11. Experiment No. 7 43
12. Experiment No. 8 47
13. Experiment No. 9 51
14. Mini Project 55
7
Department of Computer Engineering
List of Experiments
Sr.
Experiments Name
No.
7. Identify semantic relationships between the words from given text (Use WordNet
Dictionary) .
10. Mini Project: One real life Natural Language application to be implemented (Use
standard Datasets available on the web).
8
Department of Computer Engineering
Course Outcomes:
9
Department of Computer Engineering
Experiment Plan:
Module Week Course Waightage
Experiments Name
No. No. Outcome
Study of R and basic commands to access text
1. W1 data. CO1 10
Perform Preprocessing (Tokenization, Scrip
2. W2 Validation, Stop word removal and stemming) of CO2 03
Text.
Perform Morphological Analysis.
3. W3 CO2 03
Implement N-Gram model (bigram extraction).
4. W4 CO2 04
Implement Part-of-Speech (POS) Tagging.
5. W5 CO3 10
Implement chunking to extract Noun Phrases.
6 W6 CO4 05
Identify semantic relationships between the words
7. W7 from given text (Use WordNet Dictionary) . CO4 05
10
Department of Computer Engineering
11
Department of Computer Engineering
Contribution to Program
Course Outcomes
Specific outcomes
PSO1 PSO2 PSO3
Understand fundamental concept of natural
CO1 language text processing and implement basic 3 3 2
commands of text processing using R tool.
12
Department of Computer Engineering
◦ Assignments : 10 Marks
◦ Mini project: Report preparation and Implementation along with research papers
survey related to selected topic for mini project: 25 marks
Note: Although it is not mandatory, the experiments can be conducted with reference to
any Indian regional language.
13
Department of Computer Engineering
Computational Lab-II
Experiment No. : 1
14
Department of Computer Engineering
Experiment No.1
1. Aim: Study of R tool and basic commands to access text data.
2. Objectives:
• To understand natural language processing and to learn how to apply basic
algorithms in this field.
4. Theory:
R is one of the most popular and open source software projects for data science. It.is
used for analyzing data and constructing graphics. It is one of the popular tools used for
processing natural language text.
R has a wide variety of useful packages. The most commonly used packages for text
analysis and natural language processing are:
• OpenNLP
Apache OpenNLP is widely used for most common tasks in NLP, such as
tokenization, POS tagging, named entity recognition (NER), chunking, parsing, and
so on. It provides functions for sentence annotation, word annotation, POS tag
annotation, and annotation parsing using the Apache OpenNLP chunking parser.
• tm Package
It is a text-mining framework which uses a corpus, the main structure tm package,
for storing and manipulating text documents.
• koRpus Package
The korRpus package is a set of tools to analyze texts. Includes functions for
automatic language detection, hyphenation, several indices of lexical diversity. Basic
import functions for language corpora are also provided, to enable frequency
analysis and measures like tf-idf.
• SnowballC
An R interface to the C libstemmer library that implements Porter’s word stemming
algorithm for collapsing words to a common root to aid comparison of vocabulary.
15
Department of Computer Engineering
Download link
https://rstudio.com/products/rstudio/download/
2. Summation
> sum(x)
[1] 21
3. Mean
> mean(x)
[1] 3.5
4. Median
> median(x)
[1] 3.5
5. Square root
> sqrt(x)
[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490
6.Squaring
> x^2
[1] 1 4 9 16 25 36
7.Creating sequence
> seq(1,10)
[1] 1 2 3 4 5 6 7 8 9 10
16
Department of Computer Engineering
11. Regular
expressions
> grep("[a-zA-Z]",c(123,"abc"),value=TRUE)
[1] "abc"
> grep("(ab){2}",c("aabaa","abaaabab","abab"),value=TRUE)
[1] "abaaabab" "abab"
> grep("^(ab)",c("aabaa","abaaabab","bab"),value=TRUE)
[1] "abaaabab"
> grep("(ab)$",c("aabaa","abaaabab","bab"),value=TRUE)
[1] "abaaabab" "bab"
17
Department of Computer Engineering
12. Getting information about functions
> ?par
18
Department of Computer Engineering
5. Conclusion:
R is a language used for statistical computations, data analysis and graphical
representation of data. After performing this experiment we are able to work with
packages of R and represent output in visual forms.
19
Department of Computer Engineering
6. Viva Questions:
• What is Natural Language Processing?
• What is Text Analysis?
• What are features of R?
References:
1. Brian Neil Levine, An Introduction to R Programming
2. Niel J le Roux, Sugnet Lubbe, A step by step tutorial : An introduction into R application
and programming
20
Department of Computer Engineering
Computational Lab-II
Experiment No. : 2
Experiment No.2
21
Department of Computer Engineering
1. Aim: Perform Pre-processing (Tokenization, Scrip Validation, Stop word removal and
stemming) of Text.
2. Objectives:
• To understand natural language processing and to learn how to apply basic algorithms in
this field.
• To implement various language Models.
Outcomes: Students will be able to apply morphological analysis on natural language text.
4. Theory:
Text pre-processing is traditionally an important step for natural
language processing (NLP) tasks. It transforms text into a more digestible form so that
algorithms can perform better. It simply means to bring your text into a form that
is predictable and analyzable for your task.
b. Tokenization
Tokenization is a process of converting sentence into a chain of words so that processing
word by word can be easily performed. Here we use white space character for
tokenization.
Tokenization
Token 0: शिवाजीची
Token 1: आई
22
Department of Computer Engineering
Token 2: कोण
Token 3: होती
d. Stemming
Suffix stripping is done in this step. The widely used method for this processing is
Stemmer which uses a suffix list to remove suffixes from words. The stem is not
necessarily the linguistic root of the word.
—
Stemming in English :
car, cars, car's, cars' => car
5. Conclusion: We learned text pre-processing like Tokenization, Scrip Validation, Stop word
removal and stemming using inbuilt library from python/R.
6. Viva Questions:
• What is the importance of preprocessing in NLP tasks?
• How Porter Stemmer works?
• What is difference between stemming and lemmatization?
References:
1. Daniel Jurafsky, James H. Martin “Speech and Language Processing” Second
Edition, Prentice Hall, 2008.
23
Department of Computer Engineering
Computational Lab-II
Experiment No. : 3
24
Department of Computer Engineering
Experiment No.3
1. Aim: Perform Morphological Analysis.
2. Objectives:
• To understand natural language processing and to learn how to apply basic algorithms
in this field.
• To get acquainted with the basic concepts and algorithmic description of the main
language levels: morphology, syntax, semantics, and pragmatics.
4. Theory:
A word can be simple or complex. For example, the word 'cat' is simple because one
cannot further decompose the word into smaller parts. On the other hand, the word 'cats'
is complex, because the word is made up of two parts: root 'cat' and plural suffix '-s'
Analysis of a word into root and affix(es) is called as Morphological analysis of a word.
It is mandatory to identify the root of a word for any natural language processing task. A
root word can have various forms. For example, the word 'play' in English has the
following forms: 'play', 'plays', 'played' and 'playing'. Hindi shows more number of forms
for the word 'खेल' (khela) which is equivalent to 'play'.
25
Department of Computer Engineering
Thus we understand that the morphological richness of one language might vary from
one language to another. Indian languages are generally morphologically rich languages
and therefore morphological analysis of words becomes a very significant task for Indian
languages.
1. Inflectional morphology
Deals with word forms of a root, where there is no change in lexical category. For
example, 'played' is an inflection of the root word 'play'. Here, both 'played' and 'play' are
verbs.
2. Derivational morphology
Deals with word forms of a root, where there is a change in the lexical category. For
example, the word form 'happiness' is a derivation of the word 'happy'. Here, 'happiness'
is a derived noun form of the adjective 'happy'.
Morphological Features:
All words will have their lexical category attested during morphological analysis. A noun
and pronoun can take suffixes of the following features: gender, number, person, case
Hindi
लडके (ladake)
English
boy
rt=boy, cat=n, gen=m, num=sg
toys
rt=toy, cat=n, num=pl, per=3
'cat' stands for lexical category. The value of lexical category can be noun, verb, adjective,
pronoun, adverb, preposition.
'gen' stands for gender. The value of gender can be masculine or feminine.
'num' stands for number. The value of number can be singular (sg) or plural (pl).
26
Department of Computer Engineering
The value of tense can be present, past or future. This feature is applicable for verbs.
• Word Generation:
It is a inverse process where we generate different forms of the word from a given root word.
Example in English :
7. Conclusion:
We have learnt about different morphological features of a word and the understood that
obvious use of morphology in NLP systems is to reduce the number of forms of words to
be stored.
8. Viva Questions:
• What is difference in word analysis and word generation?
• What are different morphological features?
References:
1. Daniel Jurafsky, James H. Martin “Speech and Language Processing” Second
Edition, Prentice Hall, 2008.
2. Siddiqui and Tiwary U.S., Natural Language Processing and Information Retrieval,
Oxford University Press (2008).
27
Department of Computer Engineering
Computational Lab-II
Experiment No. : 4
28
Department of Computer Engineering
Experiment No.4
1. Aim: To implement N-Gram model (bi-gram extraction).
2. Objectives:
• To understand natural language processing and to learn how to apply basic
algorithms in this field.
• To implement various language Models.
Outcomes: Students will understand probabilistic language model bigram used in natural
language processing tasks.
4. Theory:
An n-gram model is a type of probabilistic language model for predicting the next item in such a
sequence in the form of a (n − 1)–order Markov model. N-gram models are now widely used in
probability, communication theory, computational linguistics (for instance, statistical natural
language processing), computational biology (for instance, biological sequence analysis), and
data compression. Two benefits of n-gram models (and algorithms that use them) are simplicity
and scalability – with larger n, a model can store more context with a well-understood space–
time tradeoff, enabling small experiments to scale up efficiently.
N-grams of texts are extensively used in text mining and natural language processing tasks. They
are basically a set of co-occurring words within a given window and when computing the n-
grams you typically move one word forward (although you can move X words forward in more
advanced scenarios). For example, for the sentence "The cow jumps over the moon". If N=2
(known as bigrams), then the ngrams would be:
the cow
cow jumps
jumps over
29
Department of Computer Engineering
over the
the moon
So you have 5 n-grams in this case. Notice that we moved from the->cow to cow->jumps to
jumps->over, etc, essentially moving one word forward to generate the next bigram.
So you have 4 n-grams in this case. When N=1, this is referred to as unigrams and this is
essentially the individual words in a sentence. When N=2, this is called bigrams and when N=3
this is called trigrams. When N>3 this is usually referred to as four grams or five grams and so
on.
If X=Num of words in a given sentence K, the number of n-grams for sentence K would be:
Bigrams
We can avoid this very long calculation by approximating that the probability of a given word
depends only on the probability of its previous words. This assumption is called Markov
assumption and such a model is called Markov model- bigrams. Bigrams can be generalized to
the n-gram which looks at (n-1) words in the past. A bigram is a first-order Markov model.
Therefore ,
P(w(1), w(2)..., w(n-1), w(n))= P(w(2)|w(1)) P(w(3)|w(2)) …. P(w(n)|w(n-1))
N-grams are used for a variety of different task. For example, when developing a
language model, n-grams are used to develop not just unigram models but also bigram
and trigram models. Google and Microsoft have developed web scale n-gram models that
can be used in a variety of tasks such as spelling correction, word breaking and text
summarization.
7. Conclusion: We learned language modelling and the n-gram, one of the most widely
used tools in language processing. Language models offer a way to assign a probability
to a sentence or other sequence of words, and to predict a word from preceding words.
.
30
Department of Computer Engineering
8. Viva Questions:
• What are the advantages behind using N-gram model in text
classification?
• Explain how Bi-gram for text classification work with suitable
example.
References:
1. Daniel Jurafsky, James H. Martin “Speech and Language Processing” Second
Edition, Prentice Hall, 2008.
2. Jalaj Thanaki, “Python Natural Language Processing”First edition Kindle edition
31
Department of Computer Engineering
Computational Lab-II
Experiment No. : 5
To implement Part-of-Speech (POS) Tagging
32
Department of Computer Engineering
Experiment No.5
1. Aim: Implement Part-of-Speech (POS) Tagging.
2. Objectives:
• To understand natural language processing and to learn how to apply basic
algorithms in this field.
• To get acquainted with the basic concepts and algorithmic description of the main
language levels: morphology, syntax, semantics, and pragmatics.
Outcomes: Students will able to analyze syntactic structure of a language using syntax analysis
techniques.
4. Theory:
POS tagging or part-of-speech tagging is the procedure of assigning a grammatical category like
noun, verb, adjective etc. to a word. In this process both the lexical information and context play
an important role as the same lexical form can behave differently in a different context.
part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging
or word-category disambiguation, is the process of marking up a word in a text (corpus) as
corresponding to a particular part of speech, based on both its definition and its context—i.e., its
relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified
form of this is commonly taught to school-age children, in the identification of words as nouns,
verbs, adjectives, adverbs, etc.
Part-of-speech tagging is harder than just having a list of words and their parts of speech,
because some words can represent more than one part of speech at different times, and because
some parts of speech are complex or unspoken. This is not rare—in natural languages (as
opposed to many artificial languages), a large percentage of word-forms are ambiguous. For
example, even "dogs", which is usually thought of as just a plural noun, can also be a verb:
Correct grammatical tagging will reflect that "dogs" is here used as a verb, not as the more
common plural noun. Grammatical context is one way to determine this; semantic analysis can
also be used to infer that "sailor" and "hatch" implicate "dogs" as 1) in the nautical context and
2) an action applied to the object "hatch" (in this context, "dogs" is a nautical term meaning
"fastens (a watertight door) securely").
33
Department of Computer Engineering
E. Brill’s tagger, one of the first and most widely used English POS-taggers, employs rule-based
algorithms. Let us first look at a very brief overview of what rule-based tagging is all about.
Rule-Based Tagging
Automatic part of speech tagging is an area of natural language processing where statistical
techniques have been more successful than rule-based methods.
For example, if the preceding word is an article, then the word in question must be a noun. This
information is coded in the form of rules.
Example of a rule:
Defining a set of rules manually is an extremely cumbersome process and is not scalable at all.
So we need some automatic way of doing this.
The Brill’s tagger is a rule-based tagger that goes through the training data and finds out the set
of tagging rules that best define the data and minimize POS tagging errors. The most important
point to note here about Brill’s tagger is that the rules are not hand-crafted, but are instead found
out using the corpus provided. The only feature engineering required is a set of rule templates
that the model can use to come up with new features.
The term ‘stochastic tagger’ can refer to any number of different approaches to the problem of
POS tagging. Any model which somehow incorporates frequency or probability may be properly
labelled stochastic.
The simplest stochastic taggers disambiguate words based solely on the probability that a word
occurs with a particular tag. In other words, the tag encountered most frequently in the training
set with the word is the one assigned to an ambiguous instance of that word. The problem with
this approach is that while it may yield a valid tag for a given word, it can also yield
inadmissible sequences of tags.
34
Department of Computer Engineering
An alternative to the word frequency approach is to calculate the probability of a given sequence
of tags occurring. This is sometimes referred to as the n-gram approach, referring to the fact that
the best tag for a given word is determined by the probability that it occurs with the n previous
tags. This approach makes much more sense than the one defined before, because it considers
the tags for individual words based on context.
The next level of complexity that can be introduced into a stochastic tagger combines the
previous two approaches, using both tag sequence probabilities and word frequency
measurements. This is known as the Hidden Markov Model (HMM).
For example the word "Park" can have two different lexical categories based on the context.
Assigning part of speech to words by hand is a common exercise one can find in an elementary
grammar class. But here we wish to build an automated tool which can assign the appropriate
part-of-speech tag to the words of a given sentence. One can think of creating handcrafted rules
by observing patterns in the language, but this would limit the system's performance to the
quality and number of patterns identified by the rule crafter. Thus, this approach is not
practically adopted for building POS Tagger. Instead, a large corpus annotated with correct POS
tags for each word is given to the computer and algorithms then learn the patterns automatically
from the data and store them in the form of a trained model. Later this model can be used to POS
tag new sentences.
35
Department of Computer Engineering
5. Conclusion: We learned Part of Speech ( POS) Tags which are useful for building
parse trees and used in building NERs (most named entities are Nouns) and extracting
relations between words.
6. Viva Questions:
• What are the different types of POS tagger?
• Explain with example how POS Tags are useful in building
Lemmatizer?
7. References:
1. Daniel Jurafsky, James H. Martin “Speech and Language Processing” Second
Edition, Prentice Hall, 2008.
2. Jalaj Thanaki, “Python Natural Language Processing”First edition Kindle edition
36
Department of Computer Engineering
Computational Lab-II
Experiment No. : 6
To implement chunking to extract Noun
Phrases
37
Department of Computer Engineering
Experiment No.6
1. Aim: To implement chunking to extract Noun Phrases.
2. Objectives:
• To understand natural language processing and to learn how to apply basic
algorithms in this field.
• To get acquainted with the basic concepts and algorithmic description of the main
language levels: morphology, syntax, semantics, and pragmatics.
Outcomes: Student will analyze of a sentence which identifies the constituents (noun groups,
verbs, verb groups, etc.) which are correlated.
Theory: Chunking is an analysis of a sentence which identifies the constituents (noun groups,
verbs, verb groups, etc.) which are correlated. These are non-overlapping regions of text.
Usually, each chunk contains a head, with the possible addition of some function words and
modifiers either before or after depending on languages. These are non-recursive in nature i.e. a
chunk cannot contain another chunk of the same category.
1. Noun Group
2. Verb Group
For example, the sentence 'He reckons the current account deficit will narrow to only 1.8 billion
in September.' can be divided as follows:
[NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only 1.8
billion ] [PP in ] [NP September ]
Each chunk has an open boundary and close boundary that delimit the word groups as a minimal
non-recursive unit.
Chunking of text involves dividing a text into syntactically correlated words. For example, the
sentence 'He ate an apple.' can be divided as follows:
Each chunk has an open boundary and close boundary that delimit the word groups as a minimal
non-recursive unit. This can be formally expressed by using IOB prefixes.
38
Department of Computer Engineering
[NP He ] [VP ate ] [NP an apple] [VP to satiate] [NP his hunger]
Chunk Types
The chunk types are based on the syntactic category part. Besides the head a chunk also contains
modifiers (like determiners, adjectives, postpositions in NPs).
1. Noun NP
2. Verb VP
3. Adverb ADVP
4. Adjectival ADJP
5. Prepositional PP
1 Noun Chunk NP
39
Department of Computer Engineering
A. NP Noun Chunks
Noun Chunks will be given the tag NP and include non-recursive noun phrases and postposition
for Indian languages and preposition for English. Determiners, adjectives and other modifiers
will be part of the noun chunk.
Eg:
B. Verb Chunks
The verb chunks are marked as VP for English, however they would be of several types for
Indian languages. A verb group will include the main verb and its auxiliaries, if any.
For English:
The types of verb chunks and their tags are described below.
The auxiliaries in the verb group mark the finiteness of the verb at the chunk level. Thus, any
verb group which is finite will be tagged as VGF. For example,
3. VGNN Gerunds
40
Department of Computer Engineering
An adjectival chunk will be tagged as ADJP for English and JJP for Indian languages. This
chunk will consist of all adjectival chunks including the predicative adjectives.
Eg:
वह लड़की है (सुन्दर/JJ)JJP
Note: Adjectives appearing before a noun will be grouped together within the noun chunk.
Eg:
He walks (slowly/ADV)/ADVP
PP Prepositional Chunk
This chunk type is present for only English and not for Indian languages. It consists of only the
preposition and not the NP argument.
Eg:
(with/IN)PP a pen
IOB prefixes
Each chunk has an open boundary and close boundary that delimit the word groups as a minimal
non-recursive unit. This can be formally expressed by using IOB prefixes: B-CHUNK for the
first word of the chunk and I-CHUNK for each other word in the chunk. Here is an example of
the file format:
He PRP B-NP
an DT B-NP
apple NN I-NP
to TO B-VP
41
Department of Computer Engineering
satiate VB I-VP
hunger NN I-NP
4. Conclusion: We learned Chunking which is useful in POS and short phrase like Noun
phrase in NLP.
5. Viva Questions:
• Define Chunking with example.
• What is difference between POS tag and Chunking in NLP?
References:
1. Daniel Jurafsky, James H. Martin “Speech and Language Processing” Second
Edition, Prentice Hall, 2008.
2. Jalaj Thanaki, “Python Natural Language Processing”First edition Kindle edition
42
Department of Computer Engineering
Computational Lab-II
Experiment No. : 7
Identify semantic relationships between the
words from given text (Use WordNet Dictionary)
43
Department of Computer Engineering
Experiment No.7
1. Aim: Using Wordnet dictionary identify synonyms from given text.
2. Objectives:
• To understand natural language processing and to learn how to apply basic
algorithms in this field.
• To get acquainted with the basic concepts and algorithmic description of the main
language levels: morphology, syntax, semantics, and pragmatics.
• To design and implement applications based on natural language processing
• To implement various language Models.
• To design systems that uses NLP techniques
4. Theory:
Word Senses
Consider the two uses of the lemma bank mentioned above, meaning something like “financial
institution” and “sloping mound”, respectively:
-- Instead, a bank can hold the investments in a custodial account in the client’s name.
--But as agriculture burgeons on the east bank, the river will shrink even more.
A sense (or word sense) is a discrete representation of one aspect of the meaning of a word.
Loosely following lexicographic tradition, we represent each sense by placing a superscript on
the lemma as in bank and bank.
• synonym -words having the same meaning. when two senses of two different words (lemmas)
are identical, or nearly identical, we say the two senses are synonym. Eg. couch/sofa
vomit/throw up filbert/hazelnut car/automobile.
• hyponyms - One sense is a hyponym of another sense if the first sense is more specific, a
subclass. For example, car is a hyponym of vehicle; dog is a hyponym of animal, and mango
is a hyponym of fruit.
44
Department of Computer Engineering
• hypernyms -The generic term used to designate a class of specifics (i.e., meal is a breakfast),
vehicle is a hypernym of car, and animal is a hypernym of dog. It is unfortunate that the two
words hypernym and hyponym are very similar and hence easily confused; for this reason, the
word superordinate is often used instead of hypernym.
• Meronyms & holonyms - Another common relation is meronymy, the part-whole relation.
A leg is part of a chair; a wheel is part of a car. We say that wheel is a meronym of car, and
car is a holonym of wheel.
• Homophones- Two words can be homonyms in a different way if they are spelled differently
but pronounced the same, like write and right, or piece and peace.
What is Wordnet?
WordNet provides information on co-ordinate terms, derivates, senses and more. It is used to
find the similarities between any two words. It also holds information on the results of the
related word. In short or nutshell one can treat it as Dictionary or Thesaurus. Going deeper in
wordnet, it is divided into four total subnets such as
1. Noun
2. Verb
3. Adjective
4. Adverb
Wordnet is an NLTK corpus reader, a lexical database for English. It can be used to find the
meaning of words, synonym or antonym. One can define it as a semantically oriented dictionary
of English. It is imported with the following command:
Synset: It is also called as synonym set or collection of synonym words. Let us check a example
Output:
Lexical Relations: These are semantic relations which are reciprocated. If there is a relationship
between {x1,x2,...xn} and {y1,y2,...yn} then there is also relation between {y1,y2,...yn} and
45
Department of Computer Engineering
{x1,x2,...xn}. For example Synonym is the opposite of antonym or hypernyms and hyponym are
type of lexical concept.
Let us write a program using python to find synonym and antonym of word "active" using
Wordnet.
print(set(synonyms))
print(set(antonyms))
5. Conclusion:
WordNet is a lexical database that has been used by a major search engine. From the WordNet,
information about a given word or phrase can be calculated. It can be used in the area of
artificial intelligence for text analysis. With the help of Wordnet, you can create your corpus for
spelling checking, language translation, Spam detection and many more.
6. Viva Questions:
• What is Word Sense?
• What are different relations between word sense?
• What is a command to import wordnet dictionary ?
References:
TB1: Daniel Jurafsky, James H. Martin ―Speech and Language Processing, Second Edition,
Prentice Hall,2008.
RB1: Siddiqui and Tiwary U.S., Natural Language Processing and Information Retrieval, Oxford
University Press (2008).
46
Department of Computer Engineering
Experiment No. : 8
Study on Reference Resolution Algorithm
47
Department of Computer Engineering
Experiment No.8
1. Aim: Study on Reference Resolution Algorithm
2. Objectives:
• To understand natural language processing and to learn how to apply basic
algorithms in this field.
• To get acquainted with the basic concepts and algorithmic description of the main
language levels: morphology, syntax, semantics, and pragmatics.
• To design and implement applications based on natural language processing
• To implement various language Models.
• To design systems that uses NLP techniques
Outcomes: Students able to identify and resolve references between sentences from the
discourse.
3. Hardware / Software Required : Study experiment.
4. Theory:
A discourse is a collocated group of sentences which convey a clear understanding only
when read together. The etymology of anaphora is ana (Greek for back) and pheri (Greek
for to bear), which in simple terms means repetition. The most prevalent type of
anaphora in natural language is the pronominal anaphora. Coreference, as the term
suggests refers to words or phrases referring to a single unique entity in the world.
Anaphoric and co-referent entities themselves form a subset of the broader term
\discourse parsing", which is crucial for full text understanding.
Reference resolution task in NLP has been widely considered as a task which inevitably
depends on some hand-crafted rules. These rules are based on syntactic and semantic
features of the text under consideration. Which features aid entity resolution and which
do not has been a constant topic of debate. There have also been studies conducted
specifically targeting this. Thus, most of the earlier anaphora resolution (AR) and
coreference resolution (CR) algorithms were dependent on a set of hand-crafted rules.
48
Department of Computer Engineering
The field of entity resolution underwent a shift during the late nineties from heuristic-
and rule-based approaches to learning-based approaches. Some of the early learning-
based and probabilistic approaches for AR used decision trees, genetic algorithms and
Bayesian rule. These approaches set the foundation for the learning-based approaches
for entity resolution which improved successively over time and, finally, outperformed
the rule-based algorithms.
Since its inception, the aim of entity resolution research has been to reduce the
dependency on hand-crafted features. With the introduction of deep learning in
NLP, words could be represented as vectors conveying semantic dependencies .
This gave an impetus to approaches which deployed deep learning for entity
resolution.
The first non-linear mention ranking model for CR aimed at learning different
feature representations for anaphoricity detection and antecedent ranking by pre-
training on these two individual subtasks. This approach addressed two major issues
in entity resolution: the first being the identification of non-anaphoric references
which are abound in text and the second was the complicated feature conjunction in
linear models which was necessary because of the inability of simpler features to
make a clear distinction between truly co-referent and non-coreferent mentions.
This model handled the above issues by introducing a new neural network model
which took only raw un-conjoined features as inputs and attempted to learn
intermediate representations.
5. Conclusion:
Entity resolution aims at resolving repeated references to an entity in a document and forms a
core component of natural language processing (NLP) research. This field possesses immense
potential to improve the performance of other NLP fields like machine translation, sentiment
analysis, paraphrase detection, summarization, etc. The area of entity resolution in NLP has seen
proliferation of research in two separate sub-areas namely: anaphora resolution and coreference
resolution.
6. Viva Questions:
• What are anaphora resolution (AR) and coreference resolution (CR) problems in
NLP?
• What are different types of References in Natural Language?
Eg. Zero Anaphora, One Anaphora, Demonstratives etc.
49
Department of Computer Engineering
References:
TB1: Daniel Jurafsky, James H. Martin ―Speech and Language Processing, Second
Edition, Prentice Hall,2008.
RB1: Siddiqui and Tiwary U.S., Natural Language Processing and Information Retrieval, Oxford
University Press (2008).
50
Department of Computer Engineering
Computational Lab-II
Experiment No. : 9
Perform Name Entity Recognition (NER) on
given text.
51
Department of Computer Engineering
Experiment No.9
1. Aim: Perform Name Entity Recognition (NER) on given text.
2. Objectives:
• To understand natural language processing and to learn how to apply basic
algorithms in this field.
• To get acquainted with the basic concepts and algorithmic description of the main
language levels: morphology, syntax, semantics, and pragmatics.
• To design and implement applications based on natural language processing
• To implement various language Models.
• To design systems that uses NLP techniques
Outcomes: understand how to use Named entity recognition (NER) refers to the identification
of words in a sentence as an entity
4. Theory:
Named entity recognition (NER) is probably the first step towards information extraction that
seeks to locate and classify named entities in text into pre-defined categories such as the
names of persons, organizations, locations, expressions of times, quantities, monetary values,
percentages, etc.
PERSON, NORP (nationalities, religious and political groups), FAC (buildings, airports etc.),
ORG (organizations), GPE (countries, cities etc.), LOC (mountain ranges, water bodies etc.),
PRODUCT (products), EVENT (event names), WORK_OF_ART (books, song titles), LAW
(legal document titles), LANGUAGE (named languages), DATE, TIME, PERCENT, MONEY,
QUANTITY, ORDINAL and CARDINAL.
Being easy to learn and use, one can easily perform simple tasks using a few lines of code.
52
Department of Computer Engineering
Installation :
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(sentence)
Output
Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY
Example -2 Further, it is interesting to note that spaCy’s NER model uses capitalization as one
of the cues to identify named entities. The same example, when tested with a slight modification,
produces a different result.
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(sentence)
Output
U.K. 27 31 GPE
$1 billion 44 54 MONEY
53
Department of Computer Engineering
5. Conclusion:
NER is used in many fields in Natural Language Processing (NLP), and it can help answering
many real-world questions, such as:
6. Viva Questions:
• What is Name Entity Recognition?
• What are the libraries used in NER?
• What is the difference between chunking and NER?
References:
RB1: Siddiqui and Tiwary U.S., Natural Language Processing and Information Retrieval,
Oxford University Press (2008).
https://spacy.io/
54
Department of Computer Engineering
Computational Lab-II
Mini Project
55
Department of Computer Engineering
2. Objectives:
• To understand natural language processing and to learn how to apply basic
algorithms in this field.
• To design and implement applications based on natural language processing
Outcomes: Be able to apply NLP techniques to design real world NLP applications such as
machine translation, text categorization, text summarization, information extraction...etc.
4. Theory:
1. Abstract
2. Introduction
3. Literature Survey – required minimum 4 research paper on selected application.
4. Implementation – implement specified problem using any standard algorithm.
5. Result and Analysis-
6. Conclusion-
References
Note: -Mini project is a group activity, maximum 3 students in a group.
-Need to submit report as a hard copy in same computation Lab-II course file.
-Include minimum above mentioned points.
- Also attach printouts of presentation.
56