NLP - Unit Ii
NLP - Unit Ii
Grammars and Parsing- Top-Down and Bottom-Up Parsers, Transition Network Grammars, Feature
Systems and Augmented Grammars, Morphological Analysis and the Lexicon, Parsing with Features,
Augmented Transition Networks, Bayees Rule, Shannon game, Entropy and Cross Entropy.
______________________________________________________________________________
Purpose of Parsing
The primary purpose of parsing is to break down a sentence into its constituent parts (such as nouns,
verbs, adjectives, etc.) and to reconstruct these parts into a parse tree that represents the syntactic
structure of the sentence. This structured representation helps machines understand the
relationships and hierarchies between different elements of the sentence, aiding in more complex
language processing tasks.
Types of Parsing
There are two main types of parsing used in computational linguistics: top-down parsing and bottom-
up parsing.
1. Top-Down Parsing:
o This method begins at the highest level of the parse tree and attempts to match the
input sentence against the predicted components of the grammar. It starts with the
start symbol and applies grammar rules to predict the structure of the sentence until
the entire input is consumed or a mismatch occurs.
2. Bottom-Up Parsing:
o Opposite to top-down parsing, this method starts with the input tokens and
constructs the parse tree by gradually building up to the start symbol. It combines
the smallest units first according to the grammar rules and works its way up to the
overall sentence structure.
To understand how parsing is applied, let's define a set of simplified grammar rules:
• VP → V | V NP | AUX V | AUX V NP: A verb phrase can be a verb, a verb followed by a noun
phrase, an auxiliary verb followed by a verb, etc.
Consider the sentence: "Students are attending the Accenture placement drive."
o NP → "Students".
o AUX → "are".
o V → "attending".
o "Students" → NP from N.
2. Combine to form S:
o S → NP VP.
In TNGs, nodes (or states) represent positions in the parsing process, and directed arcs between
these states are labeled with syntactic rules or categories. The parser traverses the network based on
the input tokens, eventually reaching a final state if the sentence is grammatical.
Key Components of Transition Network Grammars:
• Categories: Labels on transitions, representing parts of speech (e.g., N for noun, V for verb).
• Final State: The end state in the network signifies a successfully parsed sentence.
1. Grammar Rules:
o S → NP VP: A sentence is composed of a noun phrase (NP) and a verb phrase (VP).
o VP → AUX V NP: A verb phrase can be an auxiliary verb followed by a verb and a
noun phrase.
o Start with state S, which expects a sentence in the form of NP and VP.
2. NP state:
o In this state, we are expecting a noun phrase. We transition to a state where NP can
be satisfied by a noun (N), so we move to a state for nouns.
o N → "Students": This matches the first word, so we move forward to parse the verb
phrase (VP).
3. VP state:
o In the VP state, we need to parse an auxiliary verb followed by a verb (VP → AUX V
NP).
o AUX → "are": Matches the auxiliary verb, and we move to a state that expects a
verb (V).
o V → "attending": Matches the verb, so we proceed to parse the noun phrase (NP).
4. NP for VP:
5. Final state:
o At this point, we have successfully parsed the sentence and reach the final state,
signifying that the sentence is grammatically correct.
o "Students" → N.
o "are" → AUX.
o "attending" → V.
o "the" → D.
o "Accenture" → ADJ.
o "placement" → N.
o "drive" → N.
4. Final state:
o The entire sentence is parsed successfully, and we reach the final state in the TNG.
• They are more suited for natural language processing where ambiguity and variability in
sentence structures are common.
Feature Systems and Augmented Grammars
Feature systems and augmented grammars extend traditional grammars by associating syntactic
categories with additional information (features) to handle various linguistic properties more
effectively. These approaches are particularly useful in natural language processing (NLP) to deal with
phenomena like agreement (e.g., subject-verb agreement), case marking, tense, and gender.
Feature Systems
A feature system is a formal way to represent linguistic properties such as number, person, gender,
tense, etc., using attributes called features. These features are associated with different syntactic
categories (like nouns, verbs, etc.) to enforce agreement and capture grammatical constraints.
o Example features:
The feature system ensures that in a sentence like "Students are attending the lecture," both the
subject "students" (plural) and the verb "are" (plural) agree in number.
Feature Structures
A feature structure is an organized set of feature-value pairs that describe the syntactic,
morphological, and semantic properties of a word or phrase. These structures are typically
represented as attribute-value matrices (AVMs).
NP
|-- N: "students"
|-- Number: PL
|-- Person: 3
For the verb phrase "are attending":
VP
|-- Number: PL
|-- V: "attending"
Augmented Grammars
An augmented grammar associates feature structures with syntactic categories, and production rules
specify how features combine during parsing.
Let’s extend a simple grammar with features for number and person:
• VP → AUX[Number=?n, Person=?p] V
A verb phrase consists of an auxiliary verb and a main verb, and the auxiliary must agree in
number and person with the noun phrase.
• Det → "the"
Determiners have no features for number or person.
• N → "students"[Number=PL, Person=3]
The noun "students" is plural and third person.
• V → "attending"
The main verb has no person or number agreement but follows the auxiliary in tense.
4. The sentence is successfully parsed since the features match between NP and VP.
Feature Unification
Unification is the process of combining two feature structures to ensure consistency in their values.
If two structures have compatible features, they are merged. If they have conflicting features (e.g.,
one is singular and the other is plural), unification fails, and the sentence is considered
ungrammatical.
Example of Unification:
• "The students are attending" → Unification succeeds because the NP and VP agree in
number (PL) and person (3).
• "The student are attending" → Unification fails because the NP has singular number (SG) and
the VP has plural number (PL), resulting in a grammatical error.
1. Subject-Verb Agreement: Ensures that the subject and verb agree in features like number
and person.
2. Case Assignment: Handles case marking (e.g., nominative, accusative) in languages with
complex case systems.
3. Gender Agreement: Useful in languages where adjectives and verbs must agree with the
gender of the noun.
4. Tense and Aspect: Ensures consistency in tense and aspect between auxiliary verbs and main
verbs.
5. Semantic Roles: Captures roles like agent, patient, and theme, helping in more detailed
sentence analysis.
Morphological analysis is the study of the internal structure of words and the rules for word
formation in a language. It plays a critical role in natural language processing (NLP) by breaking down
words into their smallest units of meaning, called morphemes. These morphemes can be combined
in various ways to generate the complex words used in human language.
The lexicon is a structured database of the vocabulary of a language. It contains words (or lexemes)
along with their grammatical, morphological, and semantic properties. Morphological analysis often
interacts with the lexicon to determine the meaning and form of words, making it a core task in NLP
for applications like speech recognition, machine translation, and information retrieval.
1. Morpheme:
o Types of morphemes:
2. Word Formation:
o Derivation: The process of creating a new word by adding affixes, which may change
the word’s meaning or part of speech.
3. Types of Morphology:
4. Compound Words:
1. Lemmatization: Identifying the base form or dictionary form of a word. For example,
lemmatizing "running" would yield the lemma "run."
2. Stemming: Reducing a word to its root form by cutting off affixes. For example, stemming
"jumps" might result in "jump." Stemming is generally more aggressive and less accurate
than lemmatization.
3. Morphological Parsing: Breaking a word down into its morphemes. For instance, "unhappily"
can be parsed into "un-" (prefix), "happy" (root), and "-ly" (suffix).
• Word: "running"
• Word: "unhappiness"
o Morpheme breakdown: "un-" (prefix for negation) + "happy" (root) + "-ness" (suffix
to form noun)
• Word: "classroom"
o Both "class" and "room" are free morphemes that form a compound noun.
Lexicon in NLP
The lexicon is the repository that stores the vocabulary of a language, including words and
morphemes. In NLP, the lexicon is essential for understanding word meanings, part-of-speech (POS)
tagging, and mapping between inflected forms of words and their lemmas.
1. Lexical Entries:
o A lexical entry contains the lemma (base form), part of speech (noun, verb,
adjective, etc.), and morphological features (tense, number, gender, etc.).
o For instance, an entry for "dogs" would include the lemma "dog" and features like
[Number=Plural].
o During morphological analysis, a word is looked up in the lexicon to identify its base
form, inflections, and any relevant grammatical features.
o For example, the lexicon helps in recognizing that "went" is the past tense of "go," a
non-obvious relationship that is critical for correct language processing.
3. Lexicon Structure:
o The lexicon includes not just words but also morphemes (both free and bound),
along with their syntactic and semantic features.
1. Word Lookup: The system first looks up a word in the lexicon to retrieve its base form and
associated features.
2. Feature Analysis: The features of the word (like tense, number, or case) are determined
through its morphemes.
3. Morphological Decomposition: If the word is not found directly, the system breaks it down
into its constituent morphemes to reconstruct its meaning or part of speech.
4. Lexical Disambiguation: In cases of ambiguity (e.g., "lead" as a noun vs. "lead" as a verb), the
lexicon helps disambiguate based on context or syntactic role.
1. Spell Checking: Morphological analysis helps spell checkers by recognizing various inflected
forms of a word.
2. Machine Translation: Morphological analysis aids in understanding how words change based
on grammatical rules, crucial for translating between languages with different morphological
rules.
3. Part-of-Speech Tagging: Morphological analysis helps assign the correct part of speech to
words by examining their form and features.
5. Speech Recognition: Speech systems use morphological analysis to predict word endings or
inflected forms based on context, improving the accuracy of transcription.
Features in Parsing:
• Unification: During parsing, features from different parts of the sentence are unified,
ensuring agreement between them. For example, a sentence might require subject-verb
agreement in number and person.
Example: In parsing "The students are attending the lecture," the noun phrase (NP) "the students"
has the feature [Number=PL] (plural), and the verb phrase (VP) "are attending" must also have
[Number=PL].
Augmented Grammars: These grammars include not only syntactic rules but also rules for checking
features during parsing. The production rules ensure that the features are carried and checked
during the parsing process, preventing ungrammatical constructions from being accepted.
o The noun determines the number and person for the entire NP.
• VP → AUX[Number=?n, Person=?p] V
o The auxiliary verb and the main verb must agree in number and person with the
subject.
• Arcs (Transitions): Represent movement between states based on grammar rules. Transitions
can include tests (conditions on features like number or tense) and actions (to store
information or adjust state).
• Registers: Store intermediate results during parsing, such as subject or verb, and their
associated features.
Example:
Let’s consider an ATN parsing the sentence "The students are attending":
ATNs allow for non-determinism, meaning multiple transitions can be tried if ambiguity arises (e.g.,
parsing "lead" as a noun or a verb).
Example:
In word-sense disambiguation, we use Bayes' Rule to determine the most likely meaning of a word
given its context.
If HHH represents the hypothesis that the word "bank" refers to a financial institution, and EEE
represents the surrounding words in a sentence, Bayes' Rule helps calculate the probability that
"bank" means a financial institution given the surrounding context.
Shannon's Game
Shannon's Game is based on Claude Shannon’s idea of predictive text generation. The goal is to
predict the next word or letter in a sequence based on prior text, using probabilistic models of
language.
Shannon's work introduced the idea of language as an information source with a certain amount of
uncertainty (entropy). In the game, participants try to guess the next letter or word in a sentence,
and the difficulty of the guess reflects the entropy of the language.
Example:
If given the sentence fragment: "The cat sat on the ...", Shannon's Game would predict the next
word using a probabilistic model of language. The word "mat" might have a high probability because
it frequently follows that phrase in English.
Entropy
Entropy is a measure of the uncertainty or unpredictability in a system. In NLP, entropy quantifies the
amount of information or the degree of surprise associated with a set of linguistic data. Lower
entropy means that the next word or letter is more predictable, while higher entropy means it is
more uncertain.
Cross Entropy
Cross entropy measures the difference between two probability distributions: the true distribution
of the data and a predicted distribution. In NLP, cross entropy is used to evaluate the performance of
models like language models or classifiers by measuring how well they predict the true distribution
of words or phrases.
• Entropy: Used in evaluating the uncertainty in predicting the next word or character in a
sentence.
• Cross Entropy: Used in machine learning models, particularly for evaluating classification tasks
such as word-sense disambiguation, POS tagging, or translation.