0% found this document useful (0 votes)
22 views

NLP Notes

Uploaded by

Payal Khuspe
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

NLP Notes

Uploaded by

Payal Khuspe
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 90

1

Natural Language Processing (NLP) Unit-I


1. Natural Language Processing – Introduction

 Humans communicate through some form of language either by text or speech.


 To make interactions between computers and humans, computers need to understand
natural languages used by humans.
 Natural language processing is all about making computers learn, understand,
analyze, manipulate and interpret natural(human) languages.
 NLP stands for Natural Language Processing, which is a part of Computer Science,
Human languages or Linguistics, and Artificial Intelligence.
 Processing of Natural Language is required when you want an intelligent system like
robot to perform as per your instructions, when you want to hear decision from a
dialogue based clinical expert system, etc.
 The ability of machines to interpret human language is now at the core of many
applications that we use every day - chatbots, Email classification and spam filters,
search engines, grammar checkers, voice assistants, and social language translators.
 The input and output of an NLP system can be Speech or Written Text.

2. Applications of NLP or Use cases of NLP

1. Sentiment analysis
 Sentiment analysis, also referred to as opinion mining, is an approach to natural
language processing (NLP) that identifies the emotional tone behind a body of text.
 This is a popular way for organizations to determine and categorize opinions about a
product, service or idea.
 Sentiment analysis systems help organizations gather insights into real-time customer
sentiment, customer experience and brand reputation.
 Generally, these tools use text analytics to analyze online sources such as emails, blog
posts, online reviews, news articles, survey responses, case studies, web chats, tweets,
forums and comments.
 Sentiment analysis uses machine learning models to perform text analysis of human
language. The metrics used are designed to detect whether the overall sentiment of a
piece of text is positive, negative or neutral.
2. Machine Translation
 Machine translation, sometimes referred to by the abbreviation MT, is a sub-field
of computational linguistics that investigates the use of software to translate text or
speech from one language to another.
 On a basic level, MT performs mechanical substitution of words in one language for
words in another, but that alone rarely produces a good translation because
recognition of whole phrases and their closest counterparts in the target language is
needed.
 Not all words in one language have equivalent words in another language, and many
words have more than one meaning.
2

 Solving this problem with corpus statistical and neural techniques is a rapidly growing
field that is leading to better translations, handling differences in linguistic typology,
translation of idioms, and the isolation of anomalies.
 Corpus: A collection of written texts, especially the entire works of a particular
author.

3. Text Extraction
 There are a number of natural language processing techniques that can be
used to extract information from text or unstructured data.
 These techniques can be used to extract information such as entity names,
locations, quantities, and more.
 With the help of natural language processing, computers can make sense
of the vast amount of unstructured text data that is generated every day,
and humans can reap the benefits of having this information readily
available.
 Industries such as healthcare, finance, and e-commerce are already using
natural language processing techniques to extract information and
improve business processes.
 As the machine learning technology continues to develop, we will only
see more and more information extraction use cases covered.

4. Text Classification

 Unstructured text is everywhere, such as emails, chat conversations, websites, and


social media. Nevertheless, it’s hard to extract value from this data unless it’s
organized in a certain way.
 Text classification also known as text tagging or text categorization is the process of
categorizing text into organized groups. By using Natural Language
Processing (NLP), text classifiers can automatically analyze text and then assign a set
of pre-defined tags or categories based on its content.
 Text classification is becoming an increasingly important part of businesses as it
allows to easily get insights from data and automate business processes.

5. Speech Recognition
 Speech recognition is an interdisciplinary subfield of computer
science and computational linguistics that develops methodologies and technologies
that enable the recognition and translation of spoken language into text by computers.
 It is also known as automatic speech recognition (ASR), computer speech
recognition or speech to text (STT).
 It incorporates knowledge and research in the computer
science, linguistics and computer engineering fields. The reverse process is speech
synthesis.
3

Speech recognition use cases


 A wide number of industries are utilizing different applications of speech technology
today, helping businesses and consumers save time and even lives. Some examples
include:
 Automotive: Speech recognizers improves driver safety by enabling voice-activated
navigation systems and search capabilities in car radios.
 Technology: Virtual agents are increasingly becoming integrated within our daily
lives, particularly on our mobile devices. We use voice commands to access them
through our smartphones, such as through Google Assistant or Apple’s Siri, for tasks,
such as voice search, or through our speakers, via Amazon’s Alexa or Microsoft’s
Cortana, to play music. They’ll only continue to integrate into the everyday products
that we use, fueling the “Internet of Things” movement.
 Healthcare: Doctors and nurses leverage dictation applications to capture and log
patient diagnoses and treatment notes.
 Sales: Speech recognition technology has a couple of applications in sales. It can help
a call center transcribe thousands of phone calls between customers and agents to
identify common call patterns and issues. AI chatbots can also talk to people via a
webpage, answering common queries and solving basic requests without needing to
wait for a contact center agent to be available. In both instances speech recognition
systems help reduce time to resolution for consumer issues.
6. Chatbot
 Chatbots are computer programs that conduct automatic conversations with people.
They are mainly used in customer service for information acquisition. As the name
implies, these are bots designed with the purpose of chatting and are also simply
referred to as “bots.”

 You’ll come across chatbots on business websites or messengers that give pre-scripted
replies to your questions. As the entire process is automated, bots can provide quick
assistance 24/7 without human intervention.

7. Email Filter
 One of the most fundamental and essential applications of NLP online is email
filtering. It began with spam filters, which identified specific words or phrases that
indicate a spam message. But, like early NLP adaptations, filtering has been
improved.
 Gmail's email categorization is one of the more common, newer implementations of
NLP. Based on the contents of emails, the algorithm determines whether they belong
in one of three categories (main, social, or promotional).
 This maintains your inbox manageable for all Gmail users, with critical, relevant
emails you want to see and reply to fast.
8. Search Autocorrect and Autocomplete
 When you type 2-3 letters into Google to search for anything, it displays a list of
probable search keywords. Alternatively, if you search for anything with mistakes, it
corrects them for you while still returning relevant results. Isn't it incredible?
4

 Everyone uses Google search autocorrect autocomplete on a regular basis but seldom
gives it any thought. It's a fantastic illustration of how natural language processing is
touching millions of people across the world, including you and me.
 Both, search autocomplete and autocorrect make it much easier to locate accurate
results.
3. Components of NLP
 There are two components of NLP, Natural Language Understanding (NLU)and
Natural Language Generation (NLG).
 Natural Language Understanding (NLU) which involves transforming
humanlanguage into a machine-readable format.It helps the machine to understand
and analyze human language by extracting the text from large data such as keywords,
emotions, relations, and semantics.
 Natural Language Generation (NLG) acts as a translator that converts
thecomputerized data into natural language representation.
 It mainly involves Text planning, Sentence planning, and Text realization.
 The NLU is harder than NLG.

4. Steps in NLP
There are general five steps :
 1. Lexical Analysis
 2. Syntactic Analysis (Parsing)
 3. Semantic Analysis
 4. Discourse Integration
 5. Pragmatic Analysis

Lexical Analysis:
 The first phase of NLP is the Lexical Analysis.
 This phase scans the source code as a stream of characters and converts it into
meaningful lexemes.
 It divides the whole text into paragraphs, sentences, and words.
 Lexeme: A lexeme is a basic unit of meaning. In linguistics, the abstract unit of
morphological analysis that corresponds to a set of forms taken by a single word is
called lexeme.
 The way in which a lexeme is used in a sentence is determined by its grammatical
category.
5

 Lexeme can be individual word or multiword.


 For example, the word talk is an example of an individual word lexeme,
which mayhave many grammatical variants like talks, talked and talking.
 Multiword lexeme can be made up of more than one orthographic word. For
example, speak up, pull through, etc. are the examples of multiword lexemes.

Syntax Analysis (Parsing)


 Syntactic Analysis is used to check grammar, word arrangements, and
shows therelationship among the words.
 The sentence such as “The school goes to boy” is rejected by English
syntactic analyzer.

Semantic Analysis
 Semantic analysis is concerned with the meaning representation.
 It mainly focuses on the literal meaning of words, phrases, and sentences.
 The semantic analyzer disregards sentence such as “hot ice-cream”.
 Another Example is “Manhattan calls out to Dave” passes a syntactic analysis because it’s
a grammatically correct sentence. However, it fails a semantic analysis. Because
Manhattan is a place (and can’t literally call out to people), the sentence’s meaning doesn’t
make sense.

Discourse Integration
 Discourse Integration depends upon the sentences that precedes it and also
invokesthe meaning of the sentences that follow it.

 For instance, if one sentence reads, “Manhattan speaks to all its people,” and the
following sentence reads, “It calls out to Dave,” discourse integration checks the first
sentence for context to understand that “It” in the latter sentence refers to Manhattan.

Pragmatic Analysis
 During this, what was said is re-interpreted on what it actually meant.
 It involves deriving those aspects of language which require real world knowledge.
 For instance, a pragmatic analysis can uncover the intended meaning of “Manhattan
speaks to all its people.” Methods like neural networks assess the context to
understand that the sentence isn’t literal, and most people won’t interpret it as such. A
pragmatic analysis deduces that this sentence is a metaphor for how people
emotionally connect with place.

5. Finding the structure of Words


Words and Their Components
 Words are defined in most languages as the smallest linguistic units that
can form acomplete utterance by themselves.
 The minimal parts of words that deliver aspects of meaning to them are called
morphemes.
6

Tokens:
Suppose, for a moment, that words in English are delimited only by
whitespace and punctuation (the marks, such as full stop, comma, and
brackets)
 Example: Will you read the newspaper? Will you read it? I won’t
read it. If we confront our assumption with insights from syntax,
we notice twowords here: words newspaper and won’t.
7

Being a compound word, newspaper has an interesting derivational


structure.
In writing, newspaper and the associated concept is distinguished from
the isolated news and paper.
For reasons of generality, linguists prefer to analyze won’t as two
syntactic words, or tokens, each of which has its independent role and can
be reverted to its normalized form.
 The structure of won’t could be parsed as will followed by not.
 In English, this kind of tokenization and normalization may apply to just
a limited set of cases, but in other languages, these phenomena have to be
treated different way.

Lexemes
 By the term word, we often denote not just the one linguistic form in the given
context but also the concept behind the form and the set of alternative forms that can
express it.
 Such sets are called lexemes or lexical items, and they constitute the lexicon of a
language.
 Lexemes can be divided by their behaviour into the lexical categories of verbs, nouns,
adjectives, conjunctions or other parts of speech.
 The citation form of a lexeme, by which it is commonly identified, is also called its
lemma.
 When we convert a word into its other forms, such as turning the singular mouse into
the plural mice or mouses, we say we inflect the lexeme.
 When we transform a lexeme into another one that is morphologically related,
regardless of its lexical category, we say we derive the lexeme: for instance, the nouns
receiver and reception are derived from the verb receive.
 Example: Did you see him? I didn’t see him. I didn’t see anyone
Example presents the problem of tokenization of didn’t and the investigation of the
internal structure of anyone.
 The difficulty with the definition of what counts as a word need not pose a problem
for the syntactic description if we understand no one as two closely connected tokens
treated as one fixed element.

Morphemes
These components are usually called segments or morphs.
Morphology
Morphology is the domain of linguistics that analyses the internal structure of words.
 Morphological analysis – exploring the structure of words
 Words are built up of minimal meaningful elements called morphemes:
played = play-ed
cats = cat-s
unfriendly = un-friend-ly
8

Two types of morphemes:


i Stems: play, cat, friend
ii Affixes: -ed, -s, un-, -ly
Two main types of affixes:
i Prefixes precede the stem: un
ii Suffixes follow the stem: -ed, -s, un-, -ly
Stemming = find the stem by stripping off affixes
play = play
replayed = re-play-ed
computerized = comput-er-ize-d

Problems in morphological processing


Inflectional morphology: inflected forms are constructed from base forms
and inflectional
Affixes.
Inflection relates different forms of the same word
Lemma Singular Plural
Cat cat Cats
Mouse mouse mice
Derivational morphology: words are constructed from roots (or stems)
and derivational
affixes:
inter+national = international
international+ize = internationalize
internationalize+ation = internationalization

 The simplest morphological process concatenates morphs one by one, as in disagree-


ment-s, where agree is a free lexical morpheme and the other elements are bound
grammatical morphemes contributing some partial meaning to the whole word.
 In a more complex scheme, morphs can interact with each other, and their forms may
become subject to additional phonological and orthographic changes denoted as
morphophonemic.
 The alternative forms of a morpheme are termed allomorphs.
 The ending -s, indicating plural in “cats,” “dogs,” the -es in “dishes,” and the -en of
“oxen” are all allomorphs of the plural morpheme.
Typology
 Morphological typology divides languages into groups by characterizing the prevalent
morphological phenomena in those languages.
 It can consider various criteria, and during the history of linguistics, different
classifications have been proposed.
 Let us outline the typology that is based on quantitative relations between words, their
morphemes, and their features:
9

 Isolating, or analytic, languages include no or relatively few words that would


comprise more than one morpheme (typical members are Chinese, Vietnamese, and
Thai; analytic tendencies are also found in English).
 Synthetic languages can combine more morphemes in one word and are further
divided into agglutinative and fusional languages.
 Agglutinative languages have morphemes associated with only a single function at a
time (as in Korean, Japanese, Finnish, and Tamil, etc.)
 Fusional languages are defined by their feature-per-morpheme ratio higher than one
(as in Arabic, Czech, Latin, Sanskrit, German, etc.).
 In accordance with the notions about word formation processes mentioned earlier, we
can also find out using concatenative and nonlinear:
 Concatenative languages linking morphs and morphemes one after another.
 Nonlinear languages allowing structural components to merge nonsequentially to
apply tonal morphemes or change the consonantal or vocalic templates of words.

Morphological Typology
 Morphological typology is a way of classifying the languages of the world that groups
languages according to their common morphological structures.
 The field organizes languages on the basis of how those languages form words by
combining morphemes.
 The morphological typology classifies languages into two broad classes like synthetic
languages and analytical languages.
 The synthetic class is then further sub classified as either agglutinative languages or
fusional languages.
 Analytic languages contain very little inflection, instead relying on features like word
order and auxiliary words to convey meaning.
 Synthetic languages, ones that are not analytic, are divided into two categories:
agglutinative and fusional languages.
 Agglutinative languages rely primarily on discrete particles(prefixes, suffixes, and
infixes) for inflection, ex: inter+national = international, international+ize =
internationalize.
 While fusional languages "fuse" inflectional categories together, often allowing one
word ending to contain several categories, such that the original root can be difficult
to extract (anybody, newspaper).

6. Natural Language Processing With Python's NLTK Package

• NLTK, or Natural Language Toolkit, is a Python package that you can use for NLP.
• A lot of the data that you could be analyzing is unstructured data and contains human-
readable text.
• Before you can analyze that data programmatically, you first need to preprocess it.
• Now we are going to see kinds of text preprocessing tasks you can do with NLTK so
that you’ll be ready to apply them in future projects.
10

1. Tokenizing
 By tokenizing, you can conveniently split up text by word or by sentence.
 This will allow you to work with smaller pieces of text that are still relatively
coherent and meaningful even outside of the context of the rest of the text.
 It’s your first step in turning unstructured data into structured data, which is easier
to analyze.
 When you’re analyzing text, you’ll be tokenizing by word and tokenizing by
sentence.

Tokenizing by word
• Words are like the atoms of natural language. They’re the smallest unit of meaning
that still makes sense on its own.
• Tokenizing your text by word allows you to identify words that come up particularly
often.
• For example, if you were analyzing a group of job ads, then you might find that the
word “Python” comes up often.
• That could suggest high demand for Python knowledge, but you’d need to look deeper
to know more.
Tokenizing by sentence
• When you tokenize by sentence, you can analyze how those words relate to one
another and see more context.
• Are there a lot of negative words around the word “Python” because the hiring
manager doesn’t like Python?
• Are there more terms from the domain of herpetology than the domain of software
development, suggesting that you may be dealing with an entirely different kind
of python than you were expecting?

Python Program for Tokenizing by Sentence


from nltk.tokenize import sent_tokenize, word_tokenize
example_string = """
Muad'Dib learned rapidly because his first training was in how to
learn. And the first lesson of all was the basic trust that he could
learn.It's shocking to find how many people do not believe
they can learn,and how many more believe learning to be
difficult."""
sent_tokenize(example_string)
Output
["\n Muad'Dib learned rapidly because his first training was in how to learn.",
11

'And the first lesson of all was the basic trust that he could learn.’,
"It's shocking to find how many people do not believe they can learn,\n and how
many more believe learning to be difficult."]
Note:
import nltk
nltk.download('punkt')

Python Program for Tokenizing by Word


from nltk.tokenize import sent_tokenize, word_tokenize
example_string = """
Muad'Dib learned rapidly because his first training was in how to learn. And
the first lesson of all was the basic trust that he could learn.It's
shocking to find how many people do not believe they can learn,and
how many more believe learning to be difficult."""
word_tokenize(example_string)
Output:
["Muad'Dib", 'learned', 'rapidly', 'because', 'his', 'first', 'training', 'was', 'in', 'how', 'to',
'learn', '.', 'And', 'the', 'first', 'lesson', 'of', 'all', 'was', 'the', 'basic', 'trust', 'that', 'he',
'could', 'learn', '.', 'It', "'s", 'shocking', 'to', 'find', 'how', 'many', 'people', 'do', 'not',
'believe', 'they', 'can', 'learn', ',', 'and', 'how', 'many', 'more', 'believe', 'learning', 'to', 'be',
'difficult', '.']

2. Filtering Stop Words


 Stop words are words that you want to ignore, so you filter them out of your text
when you’re processing it. Very common words like 'in', 'is', and 'an' are often
used as stop words since they don’t add a lot of meaning to a text in and of
themselves.
 Note: nltk.download("stopwords")

Python program to eliminate stopwords


from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
worf_quote = "Sir, I protest. I am not a merry man!"
words_in_quote = word_tokenize(worf_quote)
print(words_in_quote)
stop_words = set(stopwords.words("english"))
filtered_list = []
12

for word in words_in_quote:


if word.casefold() not in stop_words:
filtered_list.append(word)
print(filtered_list)
Output:
• ['Sir', ',', 'I', 'protest', '.', 'I', 'am', 'not', 'a', 'merry', 'man', '!’]
• ['Sir', ',', 'protest', '.', 'merry', 'man', '!’]
• ‘I’ is pronoun and it is context word
• Content words give you information about the topics covered in the text or the
sentiment that the author has about those topics.
• Context words give you information about writing style. You can observe patterns in
how authors use context words in order to quantify their writing style.
• Once you’ve quantified their writing style, you can analyze a text written by an
unknown author to see how closely it follows a particular writing style so you can try
to identify who the author is.

3. Stemming
 Stemming is a text processing task in which you reduce words to their root, which
is the core part of a word.
 For example, the words “helping” and “helper” share the root “help.”
 Stemming allows you to zero in on the basic meaning of a word rather than all the
details of how it’s being used.
 NLTK has more than one stemmer, but we’ll be using the Porter stemmer.
Python program for Stemming
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
stemmer = PorterStemmer()
string_for_stemming = "The crew of the USS Discovery discovered many
discoveries. Discovering is what explorers do."
words = word_tokenize(string_for_stemming)
print(words)
stemmed_words = [stemmer.stem(word) for word in words]
print(stemmed_words)
Output
13

• ['The', 'crew', 'of', 'the', 'USS', 'Discovery', 'discovered', 'many', 'discoveries', '.',
'Discovering', 'is', 'what', 'explorers', 'do', '.’]
• ['the', 'crew', 'of', 'the', 'uss', 'discoveri', 'discov', 'mani', 'discoveri', '.', 'discov', 'is',
'what', 'explor', 'do', '.’]

Original word Stemmed version

'Discovery' 'discoveri'

'discovered' 'discov'

'discoveries' 'discoveri'

'Discovering' 'discov'

4. Tagging Parts of Speech


Part of speech is a grammatical term that deals with the roles words play when
you use them together in sentences. Tagging parts of speech, or POS tagging, is
the task of labeling the words in your text according to their part of speech.

Part of speech Role Examples

Noun Is a person, place, or thing mountain, bagel,


Poland

Pronoun Replaces a noun you, she, we

Adjective Gives information about what a noun is efficient, windy,


like colorful

Verb Is an action or a state of being learn, is, go

Adverb Gives information about a verb, an efficiently, always,


adjective, or another adverb very

Preposition Gives information about how a noun or from, about, at


pronoun is connected to another word

Conjunction Connects two other words or phrases so, because, and


14

Interjection Is an exclamation yay, ow, wow

• Some sources also include the category articles (like “a” or “the”) in the list of parts
of speech, but other sources consider them to be adjectives. NLTK uses the
word determiner to refer to articles.

Python program for Tagging Parts of Speech


import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.tokenize import word_tokenize
sagan_quote = """
If you wish to make an apple pie from scratch,
you must first invent the universe."""
words_in_sagan_quote = word_tokenize(sagan_quote)
nltk.pos_tag(words_in_sagan_quote)
Output:
• [('If', 'IN'), ('you', 'PRP'), ('wish', 'VBP'), ('to', 'TO'), ('make', 'VB'), ('an', 'DT'), ('apple',
'NN'), ('pie', 'NN'), ('from', 'IN'), ('scratch', 'NN'), (',', ','), ('you', 'PRP'), ('must', 'MD'),
('first', 'VB'), ('invent', 'VB'), ('the', 'DT'), ('universe', 'NN'), ('.', '.')]
POS Tag information
• nltk uses The Penn Treebank's POS tags
nltk.download('tagsets')
nltk.help.upenn_tagset()

5. Lemmatizing
• Like stemming, lemmatizing reduces words to their core meaning, but it will give
you a complete English word that makes sense on its own instead of just a fragment of
a word like 'discoveri'.
• A lemma is a word that represents a whole group of words, and that group of words is
called a lexeme.
• For example, if you were to look up the word “blending” in a dictionary, then you’d
need to look at the entry for “blend,” but you would find “blending” listed in that
entry.
• In this example, “blend” is the lemma, and “blending” is part of the lexeme. So when
you lemmatize a word, you are reducing it to its lemma.
15

5. Python Program for Lemmatization


import nltk
nltk.download('punkt')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
lemmatizer = WordNetLemmatizer()
string_for_lemmatizing = "The friends of DeSoto love scarves."
words = word_tokenize(string_for_lemmatizing)
lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
print(lemmatized_words)
Output:
 lemmatizer.lemmatize("worst")
o/p: 'worst’
 lemmatizer.lemmatize("worst", pos="a")
o/p: 'bad'

6. Chunking
 chunking allows you to identify phrases.
 A phrase is a word or group of words that works as a single unit to perform a
grammatical function. Noun phrases are built around a noun.
 Here are some examples:
 “A planet”
 “A tilting planet”
 “A swiftly tilting planet”
 Chunking makes use of POS tags to group words and apply chunk tags to those
groups. Chunks don’t overlap, so one instance of a word can be in only one chunk
at a time.

 After getting a list of tuples of all the words in the quote, along with their POS
tag. In order to chunk, you first need to define a chunk grammar.

 Note: A chunk grammar is a combination of rules on how sentences should be


chunked. It often uses regular expressions, or regexes.
 Create a chunk grammar with one regular expression rule:
 grammar = "NP: {<DT>?<JJ>*<NN>}“
 Create a chunk parser with this grammar:
16

Python program for chuncking


import nltk
nltk.download('puckt')
from nltk.tokenize import word_tokenize
quote = "It's a dangerous business, Frodo, going out your door."
words_quote = word_tokenize(quote)
print(words_quote)
nltk.download("averaged_perceptron_tagger")
tags = nltk.pos_tag(words_quote)
print(tags)
#Regular expression for Noun Phrase
grammar = "NP: {<DT>?<JJ>*<NN>}"
#Create a chunk parser with this grammar:
chunk_parser = nltk.RegexpParser(grammar)
tree = chunk_parser.parse(tags)
print(tree)

Output:
• ['It', "'s", 'a', 'dangerous', 'business', ',', 'Frodo', ',', 'going', 'out', 'your', 'door', '.']
• [('It', 'PRP'), ("'s", 'VBZ'), ('a', 'DT'), ('dangerous', 'JJ'), ('business', 'NN'), (',', ','),
('Frodo', 'NNP'), (',', ','), ('going', 'VBG'), ('out', 'RP'), ('your', 'PRP$'), ('door', 'NN'),
('.', '.')]
• (S
• It/PRP
• 's/VBZ
• (NP a/DT dangerous/JJ business/NN)
• ,/, Frodo/NNP
• ,/, going/VBG
• out/RP
• your/PRP$
• (NP door/NN)
• ./.)
17

Tree Representation

7. Chinking
• Chinking is used together with chunking, but while chunking is used to include a
pattern, chinking is used to exclude a pattern.
Python program to perform chinking
import nltk
nltk.download('puckt')
from nltk.tokenize import word_tokenize
quote = "It's a dangerous business, Frodo, going out your door."
words_quote = word_tokenize(quote)
print(words_quote)
nltk.download("averaged_perceptron_tagger")
tags = nltk.pos_tag(words_quote)
print(tags)
#Regular expression
grammar = """
Chunk: {<.*>+}
}<JJ>{""“
chunk_parser = nltk.RegexpParser(grammar)
tree = chunk_parser.parse(tags)
print(tree)
Output:
• ['It', "'s", 'a', 'dangerous', 'business', ',', 'Frodo', ',', 'going', 'out', 'your', 'door', '.']
• [('It', 'PRP'), ("'s", 'VBZ'), ('a', 'DT'), ('dangerous', 'JJ'), ('business', 'NN'), (',', ','),
('Frodo', 'NNP'), (',', ','), ('going', 'VBG'), ('out', 'RP'), ('your', 'PRP$'), ('door', 'NN'),
('.', '.')]
18

• (S
• (Chunk It/PRP 's/VBZ a/DT)
• dangerous/JJ
• (Chunk business/NN ,/, Frodo/NNP ,/, going/VBG out/RP your/PRP$ door/NN ./.))
Tree Representation

8. Using Named Entity Recognition (NER)


Some Examples of Named Entity Recognition (NER)

Python Program to Name Entity Recognition


import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
quote = "It's a dangerous business, Frodo, going out your door."
words_quote = word_tokenize(quote)
print(words_quote)
nltk.download("averaged_perceptron_tagger")
tags = nltk.pos_tag(words_quote)
nltk.download("maxent_ne_chunker")
nltk.download("words")
tree = nltk.ne_chunk(tags)
print(tree)

Output
['It', "'s", 'a', 'dangerous', 'business', ',', 'Frodo', ',', 'going', 'out', 'your', 'door', '.']
19

(S
It/PRP
's/VBZ
a/DT
dangerous/JJ
business/NN
,/,
(PERSON Frodo/NNP)
,/,
going/VBG
out/RP
your/PRP$
door/NN
./.)

Note: If we use this code it simply specifies that it is a Named Entity with out
giving the specification.

• tree = nltk.ne_chunk(tags, binary=True)


• print(tree)
Output

9. Term Frequency - Inverse Document Frequency (TF-IDF)


• Term Frequency - Inverse Document Frequency (TF-IDF) is a widely used
statistical method in natural language processing and information retrieval.
• It measures how important a term is within a document relative to a collection of
documents (i.e., relative to a corpus).
• Words within a text document are transformed into important numbers by a text
vectorization process.
• There are many different text vectorization scoring schemes, with TF-IDF being
one of the most common.
20
21
22
23
1

Natural Language Processing


Unit-II

2.1 Parsing Natural Language


2.2 Treebanks: A Data-Driven Approach to Syntax 2.3Representation of Syntactic Structure
2.3 Parsing Algorithms
2.4 Models for Ambiguity Resolution in Parsing

The parsing in NLP is the process of determining the syntactic structure of a text by analysing
its constituent words based on an underlying grammar.
Example Grammar:

Then, the outcome of the parsing process would be a parse tree, where sentence is the root,
intermediate nodes such as noun_phrase, verb_phrase etc. have children - hence they are
called non-terminals and finally, the leaves of the tree ‘Tom’, ‘ate’, ‘an’, ‘apple’ are called
terminals.
Parse Tree:

• A sentence is parsed by relating each word to other words in the sentence which depend
on it.
• The syntactic parsing of a sentence consists of finding the correct syntactic structure of
that sentence in the given formalism/grammar.
• Dependency grammar (DG) and phrase structure grammar (PSG) are two such
formalisms.
• PSG breaks sentence into constituents (phrases), which are then broken into smaller
constituents.
• Describe phrase, clause structure Example: NP, PP, VP etc.,
• DG: syntactic structure consists of lexical items, linked by binary asymmetric relations
called dependencies.
• Interested in grammatical relations between individual words.
• Does propose a recursive structure rather a network of relations
• These relations can also have labels.
2

• A treebank can be defined as a linguistically annotated corpus that includes some kind
of syntactic analysis over and above part-of-speech tagging.
Constituency tree vs Dependency tree
• Dependency structures explicitly represent
- Head-dependent relations (directed arcs)
- Functional categories (arc labels)
- Possibly some structural categories (POS)
• Phrase structure explicitly represent
- Phrases (non-terminal nodes)
- Structural categories (non-terminal labels)
- Possible some functional categories (grammatical functions)

Defining candidate dependency trees for an input sentence


➢ Learning: scoring possible dependency graphs for a given sentence, usually by
factoring the graphs into their component arcs
➢ Parsing: searching for the highest scoring graph for a given sentence

Syntax:
• In NLP, the syntactic analysis of natural language input can vary from being very low-
level, such as simply tagging each word in the sentence with a part of speech (POS), or
very high level, such as full parsing.
• In syntactic parsing, ambiguity is a particularly difficult problem because the most
possible analysis has to be chosen from an exponentially large number of alternative
analyses.
• From tagging to full parsing, algorithms that can handle such ambiguity have to be
carefully chosen.
• Here we explore the syntactic analysis methods from tagging to full parsing and the use
of supervised machine learning to deal with ambiguity.
2.1 Parsing Natural Language
• In a text-to-speech application, input sentences are to be converted to a spoken output
that should sound like it was spoken by a native speaker of the language.
• Example: He wanted to go a drive in the country.
• There is a natural pause between the words derive and In in sentence that reflects an
underlying hidden structure to the sentence.
• Parsing can provide a structural description that identifies such a break in the intonation.
• A simpler case: The cat who lives dangerously had nine lives.
• In this case, a text-to-speech system needs to know that the first instance of the word
lives is a verb and the second instance is a noun before it can begin to produce the
natural intonation for this sentence.
• This is an instance of the part-of-speech (POS) tagging problem where each word in
the sentence is assigned a most likely part of speech.
• Another motivation for parsing comes from the natural language task of summarization,
in which several documents about the same topic should be condensed down to a small
digest of information.
• Such a summary may be in response to a question that is answered in the set of
documents.
3

• In this case, a useful subtask is to compress an individual sentence so that only the
relevant portions of a sentence is included in the summary.
• For example: Beyond the basic level, the operations of the three products vary widely.
The operations of the products vary.
• The elegant way to approach this task is to first parse the sentence to find the various
constituents: where we recursively partition the words in the sentence into individual
phrases such as a verb phrase or a noun phrase.

2.2 Treebanks: A Data-Driven Approach to Syntax


➢ Parsing recovers information that is not explicit in the input sentence.
➢ This implies that a parser requires some knowledge (syntactic rules) in addition to the
input sentence about the kind of syntactic analysis that should be produced as output.
➢ One method to provide such knowledge to the parser is to write down a grammar of the
language – a set of rules of syntactic analysis as a CFGs.
➢ In natural language, it is far too complex to simply list all the syntactic rules in terms
of a CFG.
➢ The second knowledge acquisition problem- not only do we need to know the syntactic
rules for a particular language, but we also need to know which analysis is the most
plausible(probably) for a given input sentence.
➢ The construction of treebank is a data driven approach to syntax analysis that allows us
to address both of these knowledge acquisition bottlenecks in one stroke.
➢ A treebank is simply a collection of sentences (also called a corpus of text), where each
sentence is provided a complete syntax analysis.
➢ The syntactic analysis for each sentence has been judged by a human expert as the most
possible analysis for that sentence.
➢ A lot of care is taken during the human annotation process to ensure that a consistent
treatment is provided across the treebank for related grammatical phenomena.
➢ There is no set of syntactic rules or linguistic grammar explicitly provided by a
treebank, and typically there is no list of syntactic constructions provided explicitly in
a treebank.
➢ A detailed set of assumptions about the syntax is typically used as an annotation
guideline to help the human experts produce the single-most plausible syntactic
analysis for each sentence in the corpus.
➢ Treebanks provide a solution to the two kinds of knowledge acquisition bottlenecks.
➢ Treebanks solve the first knowledge acquisition problem of finding the grammar
underlying the syntax analysis because the syntactic analysis is directly given instead
of a grammar.
➢ In fact, the parser does not necessarily need any explicit grammar rules as long as it can
faithfully produce a syntax analysis for an input sentence.
➢ Treebank solve the second knowledge acquisition problem as well.
➢ Because each sentence in a treebank has been given its most plausible(probable)
syntactic analysis, supervised machine learning methods can be used to learn a scoring
function over all possible syntax analyses.
➢ Two main approaches to syntax analysis are used to construct treebanks: dependency
graph and phrase structure trees.
➢ These two representations are very closely related to each other and under some
assumptions, one representation can be converted to another.
4

➢ Dependence analysis is typically favoured for languages such as Czech and Turkish,
that have free word order.
➢ Phrase structure analysis is often used to provide additional information about long-
distance dependencies and mostly languages like English and French.
➢ NLP: is the capability of the computer software to understand the natural language.
➢ There are variety of languages in the world.
➢ Each language has its own structure (SVO or SOV)->called grammar ->has certain set
of rules->determines: what is allowed, what is not allowed.
➢ English: S O V Other languages: S V O or O S V
I eat mango
➢ Grammar is defined as the rules for forming well-structured sentences.
➢ belongs to VN
➢ Different Types of Grammar in NLP
1. Context-Free Grammar (CFG)
2.Constituency Grammar (CG) or Phrase structure grammar
3.Dependency Grammar (DG)

Context-Free Grammar (CFG)


• Mathematically, a grammar G can be written as a 4-tuple (N, T, S, P)
• N or VN = set of non-terminal symbols, or variables.
• T or ∑ = set of terminal symbols.
• S = Start symbol where S ∈ N
• P = Production rules for Terminals as well as Non-terminals.
• It has the form α → β, where α and β are strings on VN 𝖴 ∑ at least one symbol of α.
5

2.3 Representation of Syntactic Structure


2.3.1 Syntax Analysis Using Dependency Graphs
➢ The main philosophy behind dependency graphs is to connect a word- the head of a
phrase- with the dependents in that phrase.
➢ The notation connects a head with its dependent using a directed (asymmetric)
connections.
➢ Dependency graphs, just like phrase structures trees, is a representation that is
consistent with many different linguistic frameworks.
➢ The words in the input sentence are treated as the only vertices in the graph, which are
linked together by directed arcs representing syntactic dependencies.
6

➢ In dependency-based syntactic parsing, the task is to derive a syntactic structure for an


input sentence by identifying the syntactic head of each word in the sentence.
➢ This defines a dependency graph, where the nodes are the words of the input sentence
and arcs are the binary relations from head to dependent.

• The dependency tree analyses, where each word depends on exactly one parent,
either another word or a dummy root symbol.
• By convention, in dependency tree 0 index is used to indicate the root symbol
and the directed arcs are drawn from the head word to the dependent word.
• In the Fig shows a dependency tree for Czech sentence taken from the Prague
dependency treebank.

▪ Each node in the graph is a word, its part of speech and the position of the word in the
sentence. • For example [fakulte, N3,7] is the seventh word in the sentence with POS
tag N3.
▪ The node [#, ZSB,0] is the root node of the dependency tree.
7

▪ There are many variations of dependency syntactic analysis, but the basic textual format
for a dependency tree can be written in the following form.
▪ Where each dependent word specifies the head Word in the sentence, and exactly one
word is dependent to the root of the sentence.

• An important notation in dependency analysis is the notation of projectivity, which is a


constraint imposed by the linear order of words on the dependencies between words.
• A projective dependency tree is one where if we put the words in a linear order based
on the sentence with the root symbol in the first position, the dependency arcs can be
drawn above the words without any crossing dependencies.

2.3.2 Syntax Analysis Using Phrase Structures Trees


▪ A Phrase Structure syntax analysis of a sentence derives from the traditional sentence
diagrams that partition a sentence into constituents, and larger constituents are formed
by meaning smaller ones.
▪ Phrase structure analysis also typically incorporate ideas from generative grammar
(from linguistics) to deal with displaced constituents or apparent long-distance
relationships between heads and constituents.
▪ A phrase structure tree can be viewed as implicitly having a predicate-argument
structure associated with it.
▪ Sentence includes a subject and a predicate. The subject is a noun phrase (NP) and the
predicate is a verb phrase.
▪ For example, the phrase structure analysis: Mr. Baker seems especially sensitive, taken
from the Penn Treebank.
▪ The subject of the sentence is marked with the SBJ marker and predicate of the sentence
is marked with the PRD marker.
8

• NNP: proper noun, singular VBZ: verb, third person singular present ADJP: adjective
phrase RB: adverb JJ: adjective
• The same sentence gets the following dependency tree analysis: some of the
information from the bracketing labels from the phrase structure analysis gets mapped
onto the labelled arcs of the dependency analysis.

• To explain some details of phrase structure analysis in treebank, which was a project
that annotated 40,000 sentences from the wall street journal with phrase structure tree.

➢ The SBARQ label marks what questions ie those that contain a gap and therefore
require a trace.
➢ Wh- moved noun phrases are labelled WHNP and put inside SBARQ. They bear an
identity index that matches the reference index on the *T* in the position of the gap.
➢ However, questions that are missing both subject and auxiliary are label SQ
➢ NP-SBJ noun phrases cab be subjects.
➢ *T* traces for wh- movement and this empty trace has an index (here it is 1) and
associated with the WHNP constituent with the same index.

Parsing Algorithms
• Given an input sentence, a parser produces an output analysis of that sentence.
• Treebank parsers do not need to have an explicit grammar, but to discuss the parsing
algorithms simpler, we use CFG.
• The simple CFG G that can be used to derive string such as a and b or c from the start
symbol N.

• An important concept for parsing is a derivation.


• For the input string a and b or c, the following sequence of actions separated by symbol
represents a sequence of steps called derivation.
9

➢ In this derivation, each line is called a sentential form.


➢ In the above derivation, we restricted ourselves to only expanded on the
rightmost nonterminal in each sentential form.
➢ This method is called the rightmost derivation of the input using a CFG.
➢ This derivation sequence exactly corresponds to the construction of the
following parse tree from left to right, one symbol at a time.

▪ However, a unique derivation sequence is not guaranteed.


▪ There can be many different derivations.
▪ For example, one more rightmost derivation that results following parse tree.

Shift Reduce Parsing


➢ To build a parser, we need an algorithm that can perform the steps in the above
rightmost derivation for any grammar and for any input string.
➢ Every CFG turns out to have an automaton that is equivalent to it, called
pushdown automata (just like regular expression can be converted to finite-state
automata).
➢ An algorithm for parsing that is general for any given CFG and input string.
➢ The algorithm is called shift-reduce parsing which uses two data structures: a
buffer for input symbols and a stack for storing CFG symbols.
10
11

Hypergraphs and Chart Parsing (CYK Parsing)


➢ CFG s in the worst case such a parser might have to resort to backtracking, which means
re-parsing the input which leads to a time that is exponential in the grammar size in the
worst case.
➢ Variants of this algorithm (CYK) are often used in statistical parsers that attempt to
search the space of possible parse trees without the limitation of purely left to right
parsing.
➢ One of the earliest recognition parsing algorithm is CYK (Cocke, Kasami and younger)
parsing algorithm and It works only with CNF( Chomsky normal form).
12
13
14

Models for Ambiguity Resolution in Parsing

Here we discuss on modelling aspects of parsing: how to design features and ways to resolve
ambiguity in parsing.
Probabilistic context-free grammar
• Ex: John bought a shirt with pockets
15

• Here we want to provide a model that matches the intuition that the second tree above
is preferred over the first.
• The parses can be thought of as ambiguous (leftmost to rightmost) derivation of the
following CFG:

• We assign scores or probabilities to the rules in CGF in order to provide a score or


probability for each derivation.

➢ From these rule probabilities, the only deciding factor for choosing between the two
parses for John brought a shirt with pockets in the two rules NP->NP PP and VP-> VP
PP. The probability for NP -> NP PP is set higher in the preceding PCFG.
➢ The rule probabilities can be derived from a treebank, consider a treebank with three
tress t1, t2, t3

• If we assume that tree t1 occurred 10 times in the treebank, t2 occurred 20 times and
t3 occurred 50 times, then the PCFG we obtain from this treebank is:
16

• For input a a a there are two parses using the above PCFG: the probability P1 =0.125
0.334 0.285 = 0.01189 p2=0.25 0.667 0.714 =0.119.
• The parse tree p2 is the most likely tree for that input.

Generative models
• To find the most plausible parse tree, the parser has to choose between the possible
derivations each of which can be represented as a sequence of decisions.
• Let each derivation D = d1,d2,…..,dn, which is the sequence of decisions used to
build the parse tree.
• Then for input sentence x, the output parse tree y is defined by the sequence of
steps in the derivation.
• The probability for each derivation:

• The conditioning context in the probability P(di|d1,……..,di-1) is called the history


and corresponds to a partially built parse tree (as defined by the derived
sequence).
• We make a simplifying assumption that keeps the conditioning context to a finite
set by grouping the histories into equivalence classes using a function:

Discriminative models for Parsing


• Colins created a simple notation and framework that describes various discriminative
approaches to learning for parsing.
17

• This framework is called global linear model.


• Let x be a set of inputs and y be a set of possible outputs that can be a sequence of
POS tags or a parse tree or a dependency analysis.
• Each xƐx and yƐy is mapped to a d-dimensional feature vector ø(x,y), with each
dimension being a real number.
• A weight parameter vector wƐRd assigns a weight to each feature in ø(x,y),
representing the importance of that feature.

• The value of ø(x,y).w is the score of (x,y) . The height the score, the more possible it is
that y is the output of x.
• The function GEN(x) generates the set of possible outputs y for a given x.
• Having ø(x,y).w and GEN(x) specified, we would like to choose the height scoring
candidate 𝑦∗ from GEN(x) as the most possible output

where F(x) returns the highest scoring output 𝑦∗ from GEN(x)


• A conditional random field (CRF) defines the conditional probability as a linear score
for each candidate y and a global normalization term:

• A simple linear model that ignores the normalization term is:


18
1

Unit-3: N-gram Language Models (Part-I)

Uses of Language Modelling:


1. Predicting is difficult—especially about the future What word, for example, is likely to
follow
Please turn your homework ...
Hopefully, most of you concluded that a very likely word is in, or possibly over, but probably
not refrigerator or the.

We will formalize this intuition by introducing models that assign a probability to each possible
next word. The same models will also serve to assign a probability to an entire sentence. Such
a model, for example, could predict that the following sequence has a much higher probability
of appearing in a text:

all of a sudden I notice three guys standing on the sidewalk


than does this same set of words in a different order:
on guys all I of notice sidewalk three a sudden standing the

2. Why would you want to predict upcoming words, or assign probabilities to sentences?
Probabilities are essential in any task in which we have to identify words in noisy, ambiguous
input, like speech recognition. For a speech recognizer to realize that you said

I will be back soonish and not I will be bassoon dish, it helps to know
that back soonish is a much more probable sequence than bassoon dish.

3. For writing tools like spelling correction or grammatical error correction, we need to
find and correct errors in writing like Their are two midterms, in which There was mistyped
as Their, or Everything has improve, in which improve should have been improved. The
phrase There are will be much more probable than Their are, and has improved than has
improve, allowing us to help users by detecting and correcting these errors.

4. Assigning probabilities to sequences of words is also essential in machine translation.


Suppose we are translating a Chinese source sentence:

他 向 记者 介绍了 主要 内容

He to reporters introduced main content


2

As part of the process we might have built the following set of potential rough
English translations:
he introduced reporters to the main contents of the statement
he briefed to reporters the main contents of the statement
he briefed reporters on the main contents of the statement

5. Probabilities are also important for augmentative and alternative communication AAC
systems. People often use such AAC devices if they are physically unable to speak or sign but
can instead use eye gaze or other specific movements to select words from a menu to be
spoken by the system. Word prediction can be used to suggest likely words for the menu.

Language Models: Models that assign probabilities to sequences of words are called language
models or LMs. The simplest model that assigns probabilities to sentences and sequences of
words are the n-gram. An n-gram is a sequence of n words: a 2-gram (which we’ll call bigram)
is a two-word sequence of words like “please turn”, “turn your”, or ”your homework”, and a
3-gram (a trigram) is a three-word sequence of words like “please turn your”, or “turn your
homework”.

We’ll see how to use n-gram models to estimate the probability of the last word of an n-gram
given the previous words, and also to assign probabilities to entire sequences. The n-gram
models are much simpler than state-of-the art neural language models based on the RNNs and
transformers.

N-Grams
P(w|h), the probability of a word w given some history h. Suppose the history h is “its water
is so transparent that” and we want to know the probability that the next word is the:

One way to estimate this probability is from relative frequency counts: take a very large corpus,
count the number of times we see its water is so transparent that, and count the number of times
this is followed by the. This would be answering the question “Out of the times we saw the
history h, how many times was it followed by the word w”, as follows:

With a large enough corpus, such as the web, we can compute these counts and estimate the
probability. While this method of estimating probabilities directly from counts works fine in
3

many cases, it turns out that even the web isn’t big enough to give us good estimates in most
cases. This is because language is creative; new sentences are created all the time, and we won’t
always be able to count entire sentences. Even simple extensions of the example sentence may
have counts of zero on the web (such as “Walden Pond’s water is so transparent that the”; well,
used to have counts of zero). Similarly, if we wanted to know the joint probability of an entire
sequence of words like its water is so transparent, we could do it by asking “out of all possible
sequences of five words, how many of them are its water is so transparent?” We would have
to get the count of its water is so transparent and divide by the sum of the counts of all possible
five word sequences. That seems rather a lot to estimate!

For this reason, we’ll need to introduce more clever ways of estimating the probability of a
word w given a history h, or the probability of an entire word sequence W. Now, how can we
compute probabilities of entire sequences like P(w1;w2;…;wn)? One thing we can do is
decompose this probability using the chain rule of probability:
Applying the chain rule to words, we get

The chain rule shows the link between computing the joint probability of a sequence and
computing the conditional probability of a word given previous words. But using the chain rule
doesn’t really seem to help us! We don’t know any way to compute the exact probability of a
word given a long sequence of preceding words, P(wn|w1:n-1).

The intuition of the n-gram model is that instead of computing the probability of a word given
its entire history, we can approximate the history by just the last few words. The bigram model,
approximates the probability of a word given all the previous words P(wn|w1:n-1) by using only
the conditional probability of the preceding word P(wn|wn-1). In other words, instead of
computing the probability

we approximate it with the probability


When we use a bigram model to predict the conditional probability of the next word, we are
thus making the following approximation:

The assumption that the probability of a word depends only on the previous word is Markov
called a Markov assumption. Markov models are the class of probabilistic models that assume
4

we can predict the probability of some future unit without looking too far into the past. We can
generalize the bigram (which looks one word into the past) to the trigram (which looks two
words into the past) and thus to the n-gram (which looks n-1 words into the past).

Let’s see a general equation for this n-gram approximation to the conditional probability of the
next word in a sequence. We’ll use N here to mean the n-gram size, so N = 2 means bigrams
and N = 3 means trigrams. Then we approximate the probability of a word given its entire
context as follows:

Given the bigram assumption for the probability of an individual word, we can compute the
probability of a complete word sequence

An intuitive way to estimate probabilities is called maximum likelihood estimation or MLE.


We get the MLE estimate for the parameters of an n-gram model by getting counts from a
corpus, and normalizing the counts so that they lie between 0 and 1.

For example, to compute a particular bigram probability of a word wn given a previous word
wn-1, we’ll compute the count of the bigram C(wn-1wn) and normalize by the sum of all the
bigrams that share the same first word wn-1:

Let’s work through an example using a mini-corpus of three sentences. We’ll first need to
augment each sentence with a special symbol <s> at the beginning of the sentence, to give us
the bigram context of the first word. We’ll also need a special end-symbol. </s>

<s> I am Sam </s>


<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
Here are the calculations for some of the bigram probabilities from this corpus.

Maximum Likelihood Estimate:

The above equation estimates the n-gram probability by dividing the observed frequency of a
particular sequence by the observed frequency of a prefix. This ratio is called a relative
5

frequency. We said above that this use of frequencies as a way to estimate probabilities is an
example of maximum likelihood estimation or MLE. In MLE, the resulting parameter set
maximizes the likelihood of the training set T given the model M (i.e., P(T|M)). For example,
suppose the word Chinese occurs 400 times in a corpus of a million words like the Brown
corpus. What is the probability that a random word selected from some other text of, say, a
million words will be the word Chinese? The MLE of its probability is 400/1000000 or :0004.
Now :0004 is not the best possible estimate of the probability of Chinese occurring in all
situations; it might turn out that in some other corpus or context Chinese is a very unlikely
word. But it is the probability that makes it most likely that Chinese will occur 400 times in a
million-word corpus.
Let’s move on to some examples from a slightly larger corpus than our 14-word example above.
We’ll use data from the now-defunct Berkeley Restaurant Project, a dialogue system from the
last century that answered questions about a database of restaurants in Berkeley, California.
Here are some text normalized sample user queries (a sample of 9332 sentences is on the
website):
can you tell me about any good cantonese restaurants close by
mid priced thai food is what i’m looking for
tell me about chez panisse
can you give me a listing of the kinds of food that are available
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day

Figure below shows the bigram counts from a piece of a bigram grammar from the Berkeley
Restaurant Project. Note that the majority of the values are zero. In fact, we have chosen the
sample words to cohere with each other; a matrix selected from a random set of eight words
would be even more sparse.
6

Figure below shows the bigram probabilities after normalization (dividing each cell above
Figure by the appropriate unigram for its row, taken from the following set of unigram
probabilities):

Now we can compute the probability of sentences like I want English food or I want Chinese
food by simply multiplying the appropriate bigram probabilities together, as follows:

compute the probability of i want chinese food.

Some practical issues: Although for pedagogical purposes we have only described trigram
bigram models, in practice it’s more common to use trigram models, which condition on the
previous two words rather than the previous word, or 4-gram or even 5-gram models, when
there is sufficient training data. Note that for these larger ngrams, we’ll need to assume extra
contexts to the left and right of the sentence end.
For example, to compute trigram probabilities at the very beginning of the sentence, we use
two pseudo-words for the first trigram (i.e., P(I|<s><s>).

We always represent and compute language model probabilities in log format as log
probabilities. Since probabilities are (by definition) less than or equal to1, the more
probabilities we multiply together, the smaller the product becomes. Multiplying enough n-
7

grams together would result in numerical underflow. By using log probabilities instead of raw
probabilities, we get numbers that are not as small.

Evaluating Language Models


The best way to evaluate the performance of a language model is to embed it in an application
and measure how much the application improves. Such end-to-end evaluation is called extrinsic
evaluation. Extrinsic evaluation is the only way to know if a particular improvement in a
component is really going to help the task at hand. Thus, for speech recognition, we can
compare the performance of two language models by running the speech recognizer twice,
once with each language model, and seeing which gives the more accurate transcription.
Unfortunately, running big NLP systems end-to-end is often very expensive. Instead, it would
be nice to have a metric that can be used to quickly evaluate potential improvements in a
language model. An intrinsic evaluation metric is one that measures the quality of a model
independent of any application.

For an intrinsic evaluation of a language model we need a test set. As with many of the
statistical models in our field, the probabilities of an n-gram model come from the corpus it is
trained on, the training set or training corpus. We can then measure the quality of an n-gram
model by its performance on some unseen data called the test set or test corpus. So if we are
given a corpus of text and want to compare two different n-gram models, we divide the data
into training and test sets, train the parameters of both models on the training set, and then
compare how well the two trained models fit the test set. But what does it mean to “fit the test
set”? The answer is simple: whichever model assigns a higher probability to the test set—
meaning it more accurately predicts the test set, is a better model.
Perplexity
In practice we don’t use raw probability as our metric for evaluating language models, but a
variant called perplexity. The perplexity (sometimes called PPL for short) of a language model
on a test set is the inverse probability of the test set, normalized by the number of words. For a
test set W = w1w2…wN,:
8

The perplexity of a test set W depends on which language model we use. Here’s the perplexity
of W with a unigram language model (just the geometric mean of the unigram probabilities):

The perplexity of W computed with a bigram language model is still a geometric mean, but
now of the bigram probabilities:

Minimizing perplexity is equivalent to maximizing the test set probability according to the
language model.
Given a text W, different language models will have different perplexities. Because of this,
perplexity can be used to compare different n-gram models. Let’s look at an example, in which
we trained unigram, bigram, and trigram grammars on 38 million words (including start-of-
sentence tokens) from the Wall Street Journal, using a 19,979 word vocabulary. We then
computed the perplexity of each of these models on a test set of 1.5 million words, using Eq.
for unigrams, for bigrams, and the corresponding equation for trigrams. The table below shows
the perplexity of a 1.5 million word WSJ test set according to each of these grammars.

As we see above, the more information the n-gram gives us about the word sequence, the higher
the probability the n-gram will assign to the string.

Sampling sentences from a language model


One important way to visualize what kind of knowledge a language model embodies is to
sample from it. Sampling from a distribution means to choose random points according to their
likelihood. Thus, sampling from a language model, which represents a distribution over
sentences, means to generate some sentences, choosing each sentence according to its
likelihood as defined by the model. Thus, we are more likely to generate sentences that the
model thinks have a high probability and less likely to generate sentences that the model thinks
have a low probability.
This technique of visualizing a language model by sampling was first suggested very early on
by Shannon (1951) and Miller and Selfridge (1950). It’s simplest to visualize how this works
for the unigram case. Imagine all the words of the English language covering the probability
space between 0 and 1, each word covering an interval proportional to its frequency. Figure
shows a visualization, using a unigram LM computed from the text of this book. We choose a
9

random value between 0 and 1, find that point on the probability line, and print the word whose
interval includes this chosen value. We continue choosing random numbers and generating
words until we randomly generate the sentence-final token </s>.

We can use the same technique to generate bigrams by first generating a random bigram that
starts with <s> (according to its bigram probability). Let’s say the second word of that bigram
is w. We next choose a random bigram starting with w (again, drawn according to its bigram
probability), and so on.
Generalization and Zeros
The n-gram model, like many statistical models, is dependent on the training corpus. One
implication of this is that the probabilities often encode specific facts about a given training
corpus. Another implication is that n-grams do a better and better job of modelling the training
corpus as we increase the value of N. We can use the sampling method from the prior section
to visualize both of these facts! To give an intuition for the increasing power of higher-order
n-grams, Figure below shows random sentences generated from unigram, bigram, trigram, and
4-gram models trained on Shakespeare’s works.
10

Figure 3.
The longer the context on which we train the model, the more coherent the sentences. In the
unigram sentences, there is no coherent relation between words or any sentence-final
punctuation. The bigram sentences have some local word-to-word coherence (especially if we
consider that punctuation counts as a word). The trigram and 4-gram sentences are beginning
to look a lot like Shakespeare. Indeed, a careful investigation of the 4-gram sentences shows
that they look a little too much like Shakespeare. The words It cannot be but so are directly
from King John. From Shakespeare (N =884,647, V=29,066), our n-gram probability matrices
are ridiculously sparse. There are V2 =844,000,000 possible bigrams alone, and the number of
possible 4-grams is V4 = 7X1017. Thus, once the generator has chosen the first 4-gram (It
cannot be but), there are only five possible continuations (that, I, he, thou, and so); indeed, for
many 4-grams, there is only one continuation.
To get an idea of the dependence of a grammar on its training set, let’s look at an n-gram
grammar trained on a completely different corpus: the Wall Street Journal (WSJ) newspaper.
Shakespeare and the Wall Street Journal are both English, so we might expect some overlap
between our n-grams for the two genres. Figure below shows sentences generated by unigram,
bigram, and trigram grammars trained on 40 million words from WSJ.
Compare these examples to the pseudo-Shakespeare in above figure. While they both model
“English-like sentences”, there is clearly no overlap in generated sentences, and little overlap
even in small phrases. Statistical models are likely to be pretty useless as predictors if the
training sets and the test sets are as different as Shakespeare and WSJ.

How should we deal with this problem when we build n-gram models? One step is to be sure
to use a training corpus that has a similar genre to whatever task we are trying to accomplish.
To build a language model for translating legal documents, we need a training corpus of legal
documents. To build a language model for a question-answering system, we need a training
corpus of questions. It is equally important to get training data in the appropriate dialect or
variety, especially when processing social media posts or spoken transcripts.
11

Matching genres and dialects is still not sufficient. Our models may still be subject to the
problem of sparsity. For any n-gram that occurred a sufficient number of times, we might have
a good estimate of its probability. But because any corpus is limited, some perfectly acceptable
English word sequences are bound to be missing from it. That is, we’ll have many cases of
putative “zero probability n-grams” that should really have some non-zero probability.
Consider the words that follow the bigram denied the in the WSJ Treebank3 corpus, together
with their counts:

These zeros—things that don’t ever occur in the training set but do occur in the test set—are a
problem for two reasons. First, their presence means we are underestimating the probability of
all sorts of words that might occur, which will hurt the performance of any application we want
to run on this data. Second, if the probability of any word in the test set is 0, the entire
probability of the test set is 0. By definition, perplexity is based on the inverse probability of
the test set. Thus, if some words have zero probability, we can’t compute perplexity at all, since
we can’t divide by 0! There are two solutions, depending on the kind of zero. For words whose
n-gram probability is zero because they occur in a novel test set context, like the example of
denied the offer above, we’ll introduce algorithms called smoothing or discounting.
Smoothing algorithms shave off a bit of probability mass from some more frequent events and
give it to these unseen events. But first, let’s talk about an even more insidious form of zero:
words that the model has never seen below at all (in any context): unknown words!
Unknown Words
What do we do about words we have never seen before? Perhaps the word Jurafsky simply
did not occur in our training set, but pops up in the test set! We can choose to disallow this
situation from occurring, by stipulating that we already know all the words that can occur. In
such a closed vocabulary system the test set can only contain words from this known lexicon,
and there will be no unknown words.
In most real situations, however, we have to deal with words we haven’t seen before, which
we’ll call unknown words, or out of vocabulary (OOV) words. The percentage of OOV words
that appear in the test set is called the OOV rate. One way to create an open vocabulary system
12

is to model these potential unknown words in the test set by adding a pseudo-word called
<UNK>.
There are two common ways to train the probabilities of the unknown word model <UNK>.
The first one is to turn the problem back into a closed vocabulary one by choosing a fixed
vocabulary in advance:
1. Choose a vocabulary (word list) that is fixed in advance.
2. Convert in the training set any word that is not in this set (any OOV word) to the unknown
word token <UNK> in a text normalization step.
3. Estimate the probabilities for <UNK> from its counts just like any other regular word in the
training set.
The second alternative, in situations where we don’t have a prior vocabulary in advance, is to
create such a vocabulary implicitly, replacing words in the training data by <UNK> based on
their frequency. For example, we can replace by <UNK> all words that occur fewer than n
times in the training set, where n is some small number, or equivalently select a vocabulary
size V in advance (say 50,000) and choose the top V words by frequency and replace the rest
by <UNK>. In either case we then proceed to train the language model as before, treating
<UNK> like a regular word.
Smoothing
What do we do with words that are in our vocabulary (they are not unknown words) but appear
in a test set in an unseen context (for example they appear after a word they never appeared
after in training)? To keep a language model from assigning zero probability to these unseen
events, we’ll have to shave off a bit of probability mass from some more frequent events and
give it to the events we’ve never seen. This modification is called smoothing or discounting.
Now we’ll see a variety of ways to do smoothing: Laplace (add-one) smoothing, add-k
smoothing, stupid backoff, and Kneser-Ney smoothing.
Laplace Smoothing
The simplest way to do smoothing is to add one to all the n-gram counts, before we normalize
them into probabilities. All the counts that used to be zero will now have a count of 1, the
counts of 1 will be 2, and so on. This algorithm is called Laplace smoothing. Laplace smoothing
does not perform well enough to be used smoothing in modern n-gram models, but it usefully
introduces many of the concepts that we see in other smoothing algorithms, gives a useful
baseline, and is also a practical smoothing algorithm for other tasks like text classification.
Let’s start with the application of Laplace smoothing to unigram probabilities. Recall that the
unsmoothed maximum likelihood estimate of the unigram probability of the word wi is its
count ci normalized by the total number of word tokens N:
13

Laplace smoothing merely adds one to each count (hence its alternate name add one
smoothing). Since there are V words in the vocabulary and each one was incremented, we also
need to adjust the denominator to take into account the extra V observations.

Let’s smooth our Berkeley Restaurant Project bigrams. Figure below shows the add-one
smoothed counts for the bigrams in Berkeley Restaurant Project.

Recall that normal bigram probabilities are computed by normalizing each row of counts by
the unigram count:

For add-one smoothed bigram counts, we need to augment the unigram count by the number
of total word types in the vocabulary V:

Thus, each of the unigram counts given in the previous section will need to be augmented by
V =1446. The result is the smoothed bigram probabilities in Figure below.
14

The sharp change in counts and probabilities occurs because too much probability mass is
moved to all the zeros.
Add-k smoothing
One alternative to add-one smoothing is to move a bit less of the probability mass from the
seen to the unseen events. Instead of adding 1 to each count, we add a fractional count k (.5?
.05? .01?). This algorithm is therefore called add-k smoothing.

Add-k smoothing requires that we have a method for choosing k; this can be done, for example,
by optimizing on a devset. Although add-k is useful for some tasks (including text
classification), it turns out that it still doesn’t work well for language modelling, generating
counts with poor variances and often inappropriate discounts.
Backoff and Interpolation
The discounting we have been discussing so far can help solve the problem of zero frequency
n-grams. But there is an additional source of knowledge we can draw on. If we are trying to
compute P(wn|wn-2wn-1) but we have no examples of a particular trigram wn-2wn-1wn, we can
instead estimate its probability by using the bigram probability P(wn|wn-1). Similarly, if we
don’t have counts to compute P(wn|wn-1), we can look to the unigram P(wn). In other words,
sometimes using less context is a good thing, helping to generalize more for contexts that the
model hasn’t learned much about. There are two ways to use this n-gram “hierarchy”. In
backoff, we use the trigram if the evidence is sufficient, otherwise we use the bigram, otherwise
the unigram. In other words, we only “back off” to a lower-order n-gram if we have zero
evidence for a higher-order n-gram.

By contrast, in interpolation, we always mix the probability estimates from all the n-gram
estimators, weighting and combining the trigram, bigram, and unigram counts. In simple linear
interpolation, we combine different order n-grams by linearly interpolating them. Thus, we
15

estimate the trigram probability P(wn|wn-2wn-1) by mixing together the unigram, bigram, and
trigram probabilities, each weighted by a

The s must sum to 1, making Equation equivalent to a weighted average:


1

UNIT - IV

Semantic Parsing

1. Introduction

• Two approaches have emerged in the NLP for language understanding.


• In the first approach, a specific, rich meaning representation is created for a
limited domain for use by application that are restricted to that domain, such as
travel reservations, football game simulations, or querying a geographic
database.
• In the second approach, a related set of intermediate-specific meaning representation
is created, going from low-level analysis to a middle analysis, and the bigger
understanding task is divided into multiple, smaller pieces that are more manageable,
such as word sense disambiguation followed by predicate-argument structure
recognition.
• Here two types of meaning representations: a domain-dependent, deeper
representation and a set of relatively shallow but general-purpose, low-level, and
intermediate representation.
• The task of producing the output of the first type is often called deep semantic
parsing, and the task of producing the output of the second type is often called
shallow semantic parsing.
• The first approach is so specific that porting to every new domain can require
anywhere from a few modifications to almost reworking the solution from scratch.

• In other words, the reusability of the representation across domains is very limited.
• The problem with second approach is that it is extremely difficult to construct a
general-purpose ontology and create symbols that are shallow enough to be learnable
but detailed enough to be useful for all possible applications.

• Ontology means
1. The branch of metaphysics dealing with the nature of being.
2. a set of concepts and categories in a subject area or domain that shows their properties
and the relations between them.
"what's new about our ontology is that it is created automatically from large datasets"

• Therefore, an application specific translation layer between the more general


representation and the more specific representation becomes necessary.

2. Semantic Interpretation

Semantic parsing can be considered as part of Semantic interpretation, which


involves various components that together define a representation of text that can be
fed into a computer to allow further computations manipulations and search, which
are prerequisite for any language understanding system or application. Here we
discuss the structure of semantic theory.

A Semantic theory should be able to:


➢ Explain sentence having ambiguous meaning: The bill is large is ambiguous in the
sense that is could represent money or the beak of a bird.
2

➢ Resolve the ambiguities of words in context. The bill is large but need not be paid, the
theory should be able to disambiguate the monetary meaning of bill.
➢ Identify meaningless but syntactically well-formed sentence: Colorless green ideas
sleep furiously.
➢ Identify syntactically or transformationally unrelated paraphrasers of concept having
the same semantic content.

➢ Here we look at some requirements for achieving a semantic representation.

2.1 Structural Ambiguity

➢ Structure means syntactic structure of sentences.


➢ The syntactic structure means transforming a sentence into its underlying syntactic
representation and in theory of semantic interpretation refer to underlying syntactic
representation.

2.2 Word Sense

➢ In any given language, the same word type is used in different contexts and with
different morphological variants to represent different entities or concepts in the
world.
➢ For example, we use the word nail to represent a part of the human anatomy
and also to represent the generally metallic object used to secure other objects.

2.3 Entity and Event Resolution

➢ Any discourse consists of a set of entities participating in a series of explicit or


implicit events over a period of time.
➢ So, the next important component of semantic interpretation is the
identification of various entities that are sparkled across the discourse using the
same or different phrases.
➢ The predominant tasks have become popular over the years: named entity
recognition and coreference resolution.
➢ Coreference resolution is the task of finding all expressions that refer to the same
entity in a text.

2.4 Predicate Argument Structure


3

➢ Once we have the word-sense, entities and events identified, another level of semantics
structure comes into play: identifying the participants of the entities in these events.
➢ Resolving the argument structure of predicate in the sentence is where we identify which
entities play what part in which event.
➢ A word which functions as the verb is called a predicate and words which function as the
nouns are called arguments. Here are some other predicates and arguments:

2.5 Meaning Representation

➢ The final process of the semantic interpretation is to build a semantic representation


or meaning representation that can then be manipulated by algorithms to various
application ends.
➢ This process is sometimes called the deep representation.

The following two examples

3. System Paradigms

• It is important to get a perspective on the various primary dimensions on which the


problem of semantic interpretation has been tackled.
• The approaches generally fall into the following three categories: 1.System architecture
4

2.Scope 3. Coverage.

1. System Architectures
a. Knowledge based: These systems use a predefined set of rules or a knowledge base to
obtain a solution to a new problem.
b. Unsupervised: These systems tend to require minimal human intervention to be
functional by using existing resources that can be bootstrapped for a particular
application or problem domain.
c. Supervised: these systems involve the manual annotation of some phenomena
that appear in a sufficient quantity of data so that machine learning algorithms can
be applied.
d. Semi-Supervised: manual annotation is usually very expensive and does not yield
enough data to completely capture a phenomenon. In such instances, researches
can automatically expand the data set on which their models are trained either
by employing machine-generated output directly or by bootstrapping off an
existing model by having humans correct its output.

2. Scope:
➢ Domain Dependent: These systems are specific to certain domains, such as
air travel reservations or simulated football coaching.
➢ Domain Independent: These systems are general enough that the techniques can be
applicable to multiple domains without little or no change.

3. Coverage
a. Shallow: These systems tend to produce an intermediate representation that can
then be converted to one that a machine can base its action on.
b. Deep: These systems usually create a terminal representation that is directly consumed by
a machine or application.

4. Word Sense
➢ Word Sense Disambiguation is an important method of NLP by which the meaning
of a word is determined, which is used in a particular context.
➢ In a compositional approach to semantics, where the meaning of the whole is
composed on the meaning of parts, the smallest parts under consideration in textual
discourse are typically the words themselves: either tokens as they appear in the text
or their lemmatized forms.
➢ Words sense has been examined and studied for a very long time.
➢ Attempts to solve this problem range from rule based and knowledge based to
completely unsupervised, supervised, and semi-supervised learning methods.
➢ Very early systems were predominantly rule based or knowledge based and used
dictionary definitions of senses of words.
➢ Unsupervised word sense induction or disambiguation techniques try to induce the
senses or word as it appears in various corpora.
➢ These systems perform either a hard or soft clustering of words and tend to allow the
tuning of these clusters to suit a particular application.
➢ Most recent supervised approaches to word sense disambiguation, usually
application- independent-level of granularity (including small details). Although the
output of supervised approaches can still be amendable to generating a ranking,
5

or distribution, of membership sense.


➢ Word sense ambiguities can be of three principal types: i.homonymy ii.polysemy
iii.categorial ambiguity.
➢ Homonymy defined as the words having same spelling or same form but having
different and unrelated meaning. For example, the word “Bat” is a homonymy word
because bat can be an implement to hit a ball or bat is a nocturnal flying mammal
also
➢ Polysemy is a Greek word, which means “many signs”. polysemy has the same
spelling but different and related meaning.
➢ Both polysemy and homonymy words have the same syntax or spelling. The main
difference between them is that in polysemy, the meanings of the words are related
but in homonymy, the meanings of the words are not related.
➢ For example: Bank Homonymy: financial bank and river bank
Polysemy: financial bank, bank of clouds and book bank: indicate collection of
things.
➢ Categorial ambiguity: the word book can mean a book which contain the chapters or
police register which is used to enter the charges against someone.
➢ In the above note book, text book belongs to the grammatical category of noun and
book is verb.
➢ Distinguishing between these two categories effectively helps disambiguate these two
senses.
➢ Therefore, categorical ambiguity can be resolved with syntactic information (part
of speech) alone, but polyseme and homonymy need more than syntax.
➢ Traditionally, in English, word senses have been annotated for each part of speech
separately, whereas in Chinese, the sense annotation has been done per lemma.

Resources:
➢ As with any language understanding task, the availability of resources is key
factor in the disambiguation of the word senses in corpora.
➢ Early work on word sense disambiguation used machine readable dictionaries or
thesaurus as knowledge sources.
➢ Two prominent sources were the Longman dictionary of contemporary English
(LDOCE) and Roget’s Thesaurus.
➢ The biggest sense annotation corpus OntoNotes released through Linguistic Data
Consortium (LDC).
➢ The Chinese annotation corpus is HowNet.

Systems:
Researchers have explored various system architectures to address the sense disambiguation
problem.
We can classify these systems into four main categories: i. rules based or
knowledge ii. Supervised iii.unsupervised iv. Semisupervised

Rule Based:
➢ The first-generation of word sense disambiguation systems was primarily based on
dictionary sense definitions.
➢ Much of this information is historical and cannot readily be translated and made
available for building systems today. But some of techniques and algorithms are still
available.
6

➢ The simplest and oldest dictionary-based sense disambiguation algorithm was


introduced by Lesk.
The core of the algorithm is that the dictionary senses whose terms most closely overlap with
the terms in the context.

Another dictionary-based algorithm was suggested Yarowsky.


This study used Roget’s Thesaurus categories and classified unseen words into one
of these 1042 categories based on a statistical analysis of 100 word concordances for each
member of each category.
7

The method consists of three steps, as shown in Fig below.


• The first step is a collection of contexts.
• The second step computes weights for each of the salient words.
• P(w|Rcat) is the probability of a word w occurring in the context of a Roget’s
Thesaurus category Rcat.
• P(w|Rcat) |Pr(w) , the probability of a word (w) appearing in the context of a Roget
category divided by its overall probability in the corpus.
• Finally, in third step, the unseen words in the test set are classified into the category
that has the maximum weight.

Supervised:
• The simpler form of word sense disambiguating systems the supervised approach,
which tends to transfer all the complexity to the machine learning machinery while
still requiring hand annotation tends to be superior to unsupervised and performs best
8

when tested on annotated data.


• These systems typically consist of a machine learning classifier trained on various
features extracted for words that have been manually disambiguated in a given
corpus and the application of the resulting models to disambiguating words in the
unseen test sets.
• A good feature of these systems is that the user can incorporate rules and knowledge
in the form of features.
Classifier:
Probably the most common and high performing classifiers are support vector machine
(SVMs) and maximum entropy classifiers.
Features: Here we discuss a more commonly found subset of features that have been useful in
supervised learning of word sense.
Lexical context: The feature comprises the words and lemma of words occurring in the entire
paragraph or a smaller window of usually five words.
Parts of speech: the feature comprises the surrounding the word that is being sense tagged.
Bag of words context: this feature comprises using an unordered set of words in the context
window.
Local Collocations: Local collocations are an ordered sequence of phrases near the target word that
provide semantic context for disambiguation. Usually, a very small window of about three tokens
on each side of the target word, most often in contiguous pairs or triplets, are added as a list of
features.
Syntactic relations: if the parse of the sentence containing the target word is available, then we can
use syntactic features.
Topic features: The board topic, or domain, of the article that word belongs to is also a good
indicator of what sense of the word might be most frequent.

"Word sense disambiguation" (WSD) is a natural language processing (NLP) task that involves
determining the correct sense or meaning of a word within a given context. Many words in
natural language have multiple meanings or senses, and WSD aims to choose the most
appropriate sense for a word in a specific sentence or context.
9

Supervised learning with Support Vector Machines (SVM) is one approach to solving the WSD
problem. Here's how it works:

1. Data Collection: To train an SVM for WSD, you need a labeled dataset where each word
is tagged with its correct sense in various contexts. This dataset is typically created by
human annotators who assign senses to words in sentences.
2. Feature Extraction: For each word in the dataset, you need to extract relevant features
from its context. These features could include the words surrounding the target word,
part-of-speech tags, syntactic information, and more. These features serve as the input to
the SVM.
3. Training: Once you have the labeled dataset and extracted features, you can train an
SVM classifier. The goal is to teach the SVM to learn patterns in the features that are
indicative of specific word senses.
4. Testing/Predicting: After training, you can use the SVM to predict the sense of an
ambiguous word in a new, unseen sentence. The SVM considers the context features and
assigns the word the most likely sense based on what it learned during training.
5. Evaluation: To assess the performance of your WSD system, you can use various
evaluation metrics, such as accuracy, precision, recall, and F1-score. These metrics help
you measure how well your SVM-based WSD system is performing in disambiguating
word senses.

SVMs are popular for WSD because they are effective at handling high-dimensional feature
spaces and can learn complex decision boundaries. However, the success of the SVM-based WSD
system heavily depends on the quality of the labeled dataset and the choice of features used for
training.

The identification of the head word is important in syntax because it helps determine the
grammatical structure of a phrase or sentence. For feature selection in NLP tasks like parsing or
word sense disambiguation, knowing the head word and its relationships with other words in a
10

sentence can be valuable information. Syntactic relations often involve the relationship between a
head word and its dependents or modifiers, and these relations can be used as features in
various natural language processing applications.

Unsupervised:

Unsupervised learning in Natural Language Processing (NLP) is a category of machine


learning where the model is trained on unlabeled data without explicit supervision or
predefined categories. It aims to discover patterns, structures, or representations within the
data. One concept related to unsupervised learning in NLP is "Conceptual Density."

HyperLex
11
12

Figure: Conceptual Density

Semi Supervised:
Semi-supervised learning is a machine learning paradigm that combines both labeled and
unlabeled data to improve model performance. In the context of word sense disambiguation
13

(WSD) in Natural Language Processing (NLP), semi-supervised learning techniques can be


quite beneficial because labeled data for WSD is often limited and expensive to obtain. Here's
an overview of a semi-supervised learning algorithm for WSD:

Self-Training for WSD:

Self-training is a popular semi-supervised learning approach that can be adapted for WSD. In
self-training for WSD, you start with a small set of labeled examples and a larger set of
unlabeled examples. The process involves iterative steps:

1. Initialization: Begin with a small labeled dataset where each example consists of a
sentence containing an ambiguous word and its corresponding sense label.
2. Feature Extraction: Extract relevant features from the labeled examples, which
typically include information about the target word, its context words, part-of-speech
tags, syntactic relations, and more.
3. Model Training: Train a WSD model using the labeled data. This can be a
supervised machine learning model like Support Vector Machines (SVM), Naive
Bayes, or a neural network-based model.
4. Prediction: Use the trained model to predict word senses for the unlabeled data.
Apply the model to the sentences containing the ambiguous word from the unlabeled
dataset to assign senses to those instances.
5. Confidence Threshold: Introduce a confidence threshold or some criteria to filter the
predictions. For instance, you can choose to keep only the predictions where the
model is highly confident.
6. Adding Labeled Data: Add the confidently predicted examples to the labeled
dataset, marking them as newly labeled instances.
7. Iteration: Repeat steps 2-6 for a fixed number of iterations or until convergence.
8. Final Model: Train a final model using the combined labeled data (original labeled
dataset plus the newly labeled instances) to create a more robust WSD model.

Advantages of Self-Training for WSD:

• It leverages a larger pool of unlabeled data, which can be especially beneficial when
labeled data is scarce.
• It allows the model to learn from its own predictions and iteratively improve.
• Self-training is a flexible approach and can be used with various machine learning
models.

Challenges:

• Labeling errors: The initial labeled dataset should be of high quality because errors
can accumulate during self-training iterations.

Semi-supervised learning with self-training can be effective for WSD, but it's essential to
carefully design the process, monitor model performance, and apply filtering criteria to
ensure the quality of the added labeled instances.
14

Motivation and concept of Yorowsky algorithm


15
16

Figure: Yorowsky algorithm


Discourse Processing (Unit-5 Part-2)

Definition of Discourse: Discourse is the coherent structure of language above the level of
sentences or clauses. A discourse is a coherent structured group of sentences.
What makes a passage coherent? A practical answer: It has meaningful
connections between its utterances.
Cohesion
Relations between words in two units (sentences, paragraphs) “glue” them together.
Example: Before winter I built a chimney, and shingled the sides of my house… I have thus
a tight shingled and plastered house.

There are Three Main Classes of Features for Discourse Cohesion


• Lexical overlap/lexical chains
• Coreference chains
• Cue words/discourse markers

Discourse Processing:

One of the major problems in NLP is discourse processing − building theories and models of
how utterances stick together to form coherent discourse. Actually, the language always
consists of collocated, structured and coherent groups of sentences rather than isolated and
unrelated sentences like movies. These coherent groups of sentences are referred to as
discourse.

Concept of Coherence

Coherence and discourse structure are interconnected in many ways. Coherence, along with
property of good text, is used to evaluate the output quality of natural language generation
system. The question that arises here is what does it mean for a text to be coherent? Suppose
we collected one sentence from every page of the newspaper, then will it be a discourse? Of-
course, not. It is because these sentences do not exhibit coherence. The coherent discourse must
possess the following properties −

Coherence relation between utterances

The discourse would be coherent if it has meaningful connections between its utterances. This
property is called coherence relation. For example, some sort of explanation must be there to
justify the connection between utterances.

Discourse structure

An important question regarding discourse is what kind of structure the discourse must have.
The answer to this question depends upon the segmentation we applied on discourse. Discourse
segmentations may be defined as determining the types of structures for large discourse. It is
quite difficult to implement discourse segmentation, but it is very important for information
retrieval, text summarization and information extraction kind of applications.

Algorithms for Discourse Segmentation


In this section, we will learn about the algorithms for discourse segmentation. The algorithms
are described below −

Unsupervised Discourse Segmentation

The class of unsupervised discourse segmentation is often represented as linear segmentation.


We can understand the task of linear segmentation with the help of an example. In the example,
there is a task of segmenting the text into multi-paragraph units; the units represent the passage
of the original text. These algorithms are dependent on cohesion that may be defined as the use
of certain linguistic devices to tie the textual units together. On the other hand, lexicon cohesion
is the cohesion that is indicated by the relationship between two or more words in two units
like the use of synonyms.

Supervised Discourse Segmentation

The earlier method does not have any hand-labeled segment boundaries. On the other hand,
supervised discourse segmentation needs to have boundary-labeled training data. It is very easy
to acquire the same. In supervised discourse segmentation, discourse marker or cue words play
an important role. Discourse marker or cue word is a word or phrase that functions to signal
discourse structure. These discourse markers are domain-specific.

Text Coherence

Lexical repetition is a way to find the structure in a discourse, but it does not satisfy the
requirement of being coherent discourse. To achieve the coherent discourse, we must focus on
coherence relations in specific. As we know that coherence relation defines the possible
connection between utterances in a discourse. Hebb has proposed such kind of relations as
follows −

We are taking two terms S0 and S1 to represent the meaning of the two related sentences −

Result

It infers that the state asserted by term S0 could cause the state asserted by S1. For example,
two statements show the relationship result: Ram was caught in the fire. His skin burned.

Explanation

It infers that the state asserted by S1 could cause the state asserted by S0. For example, two
statements show the relationship − Ram fought with Shyam’s friend. He was drunk.

Parallel

It infers p(a1,a2,…) from assertion of S0 and p(b1,b2,…) from assertion S1. Here ai and bi are
similar for all i. For example, two statements are parallel − Ram wanted car. Shyam wanted
money.

Elaboration
It infers the same proposition P from both the assertions − S0 and S1 For example, two
statements show the relation elaboration: Ram was from Chandigarh. Shyam was from Kerala.

Occasion

It happens when a change of state can be inferred from the assertion of S0, final state of which
can be inferred from S1 and vice-versa. For example, the two statements show the relation
occasion: Ram picked up the book. He gave it to Shyam.

Building Hierarchical Discourse Structure

The coherence of entire discourse can also be considered by hierarchical structure between
coherence relations. For example, the following passage can be represented as hierarchical
structure −

• S1 − Ram went to the bank to deposit money.


• S2 − He then took a train to Shyam’s cloth shop.
• S3 − He wanted to buy some clothes.
• S4 − He do not have new clothes for party.
• S5 − He also wanted to talk to Shyam regarding his health

Reference Resolution

Interpretation of the sentences from any discourse is another important task and to achieve this
we need to know who or what entity is being talked about. Here, interpretation reference is the
key element. Reference may be defined as the linguistic expression to denote an entity or
individual. For example, in the passage, Ram, the manager of ABC bank, saw his friend Shyam
at a shop. He went to meet him, the linguistic expressions like Ram, His, He are reference.

On the same note, reference resolution may be defined as the task of determining what entities
are referred to by which linguistic expression.

Terminology Used in Reference Resolution

We use the following terminologies in reference resolution −

• Referring expression − The natural language expression that is used to perform


reference is called a referring expression. For example, the passage used above is a
referring expression.
• Referent − It is the entity that is referred. For example, in the last given example Ram
is a referent.
• Corefer − When two expressions are used to refer to the same entity, they are called
corefers. For example, Ram and he are corefers.
• Antecedent − The term has the license to use another term. For example, Ram is the
antecedent of the reference he.
• Anaphora & Anaphoric − It may be defined as the reference to an entity that has been
previously introduced into the sentence. And, the referring expression is called
anaphoric.
• Discourse model − The model that contains the representations of the entities that have
been referred to in the discourse and the relationship they are engaged in.

Types of Referring Expressions

Let us now see the different types of referring expressions. The five types of referring
expressions are described below −

Indefinite Noun Phrases

Such kind of reference represents the entities that are new to the hearer into the discourse
context. For example − in the sentence Ram had gone around one day to bring him some food
− some is an indefinite reference.

Definite Noun Phrases

Opposite to above, such kind of reference represents the entities that are not new or identifiable
to the hearer into the discourse context. For example, in the sentence - I used to read The Times
of India – The Times of India is a definite reference.

Pronouns

It is a form of definite reference. For example, Ram laughed as loud as he could. The
word he represents pronoun referring expression.

Demonstratives
These demonstrate and behave differently than simple definite pronouns. For example, this and
that are demonstrative pronouns.

Names

It is the simplest type of referring expression. It can be the name of a person, organization and
location also. For example, in the above examples, Ram is the name-refereeing expression.

Reference Resolution Tasks

The two reference resolution tasks are described below.

Coreference Resolution

It is the task of finding referring expressions in a text that refer to the same entity. In simple
words, it is the task of finding corefer expressions. A set of coreferring expressions are called
coreference chain. For example - He, Chief Manager and His - these are referring expressions
in the first passage given as example.

Constraint on Coreference Resolution

In English, the main problem for coreference resolution is the pronoun it. The reason behind
this is that the pronoun it has many uses. For example, it can refer much like he and she. The
pronoun it also refers to the things that do not refer to specific things. For example, It’s raining.
It is really good.

Pronominal Anaphora Resolution

Unlike the coreference resolution, pronominal anaphora resolution may be defined as the task
of finding the antecedent for a single pronoun. For example, the pronoun is his and the task of
pronominal anaphora resolution is to find the word Ram because Ram is the antecedent.

You might also like