0% found this document useful (0 votes)
55 views13 pages

NLP - Unit Ii

Uploaded by

213t1a3231
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views13 pages

NLP - Unit Ii

Uploaded by

213t1a3231
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

UNIT II Grammars and Parsing Lecture 9Hrs

Grammars and Parsing- Top-Down and Bottom-Up Parsers, Transition Network Grammars, Feature
Systems and Augmented Grammars, Morphological Analysis and the Lexicon, Parsing with Features,
Augmented Transition Networks, Bayees Rule, Shannon game, Entropy and Cross Entropy.

______________________________________________________________________________

Grammars and Parsing


Parsing is a fundamental process in natural language processing (NLP) where the goal is to analyze
sentences according to the rules of a formal grammar. This analysis helps in understanding the
grammatical structure of sentences, which is essential for many NLP applications such as machine
translation, speech recognition, and syntactic analysis.

Purpose of Parsing

The primary purpose of parsing is to break down a sentence into its constituent parts (such as nouns,
verbs, adjectives, etc.) and to reconstruct these parts into a parse tree that represents the syntactic
structure of the sentence. This structured representation helps machines understand the
relationships and hierarchies between different elements of the sentence, aiding in more complex
language processing tasks.

Types of Parsing

There are two main types of parsing used in computational linguistics: top-down parsing and bottom-
up parsing.

1. Top-Down Parsing:

o This method begins at the highest level of the parse tree and attempts to match the
input sentence against the predicted components of the grammar. It starts with the
start symbol and applies grammar rules to predict the structure of the sentence until
the entire input is consumed or a mismatch occurs.

2. Bottom-Up Parsing:

o Opposite to top-down parsing, this method starts with the input tokens and
constructs the parse tree by gradually building up to the start symbol. It combines
the smallest units first according to the grammar rules and works its way up to the
overall sentence structure.

Grammar Rules for Parsing

To understand how parsing is applied, let's define a set of simplified grammar rules:

• S → NP VP: A sentence is typically composed of a noun phrase followed by a verb phrase.

• NP → N | D N | D ADJ N | D N N | D ADJ N N: A noun phrase can be a single noun, a


determiner followed by a noun, etc.

• VP → V | V NP | AUX V | AUX V NP: A verb phrase can be a verb, a verb followed by a noun
phrase, an auxiliary verb followed by a verb, etc.

• D → "the" | "a": Determiners include "the" and "a."


• N → "Students" | "Accenture" | "placement" | "drive": Specific nouns in the example
sentence.

• V → "attending": The main verb in our example.

• AUX → "are": The auxiliary verb in our example.

• ADJ → "Accenture": Used as an adjective in this context.

Example Sentence for Parsing

Consider the sentence: "Students are attending the Accenture placement drive."

Top-Down Parsing Example:

1. Start with S and predict S → NP VP.

2. Break down into NP and VP:

o NP → "Students".

o VP → "are attending the Accenture placement drive".

3. Further analyze VP using AUX V NP:

o AUX → "are".

o V → "attending".

o NP → "the Accenture placement drive" (D ADJ N N).

Bottom-Up Parsing Example:

1. Identify parts of speech for each word and start building:

o "Students" → NP from N.

o "are attending" → VP from AUX V.

o "the Accenture placement drive" → NP from D ADJ N N.

2. Combine to form S:

o S → NP VP.

Transition Network Grammars (TNGs) are a form of grammar used in computational


linguistics that involve finite state automata with transitions labeled by grammar rules. These networks
define a path through a sentence, determining how words or sequences of words fit into a syntactic
structure. TNGs are more flexible than context-free grammars because they allow non-deterministic
transitions, making them useful for parsing sentences in natural language. The primary purpose of a
TNG is to parse natural language sentences by modeling the syntactic structure as a network of states,
with each state corresponding to a part of the sentence. Parsing using TNGs involves moving from state
to state based on the input tokens (words), with transitions representing the application of grammar
rules.

In TNGs, nodes (or states) represent positions in the parsing process, and directed arcs between
these states are labeled with syntactic rules or categories. The parser traverses the network based on
the input tokens, eventually reaching a final state if the sentence is grammatical.
Key Components of Transition Network Grammars:

• States: Represent positions in the parsing process.

• Transitions: Directed arcs between states labeled with grammar rules.

• Categories: Labels on transitions, representing parts of speech (e.g., N for noun, V for verb).

• Final State: The end state in the network signifies a successfully parsed sentence.

Example Transition Network Grammar:

Let’s construct a simple TNG to parse the sentence:


"Students are attending the Accenture placement drive."

1. Grammar Rules:

o S → NP VP: A sentence is composed of a noun phrase (NP) and a verb phrase (VP).

o NP → N | D N | D ADJ N N: A noun phrase can be a noun, a determiner with a noun,


or a determiner with adjectives and nouns.

o VP → AUX V NP: A verb phrase can be an auxiliary verb followed by a verb and a
noun phrase.

o D → "the": Determiners include "the."

o N → "Students" | "Accenture" | "placement" | "drive": Nouns in our sentence.

o V → "attending": Verb in our sentence.

o AUX → "are": Auxiliary verb in our sentence.

o ADJ → "Accenture": Used as an adjective in the sentence.

Top-Down Parsing Using TNG:

1. Initial state (S):

o Start with state S, which expects a sentence in the form of NP and VP.

o Transition to a new state that breaks S into NP and VP (S → NP VP).

2. NP state:

o In this state, we are expecting a noun phrase. We transition to a state where NP can
be satisfied by a noun (N), so we move to a state for nouns.

o N → "Students": This matches the first word, so we move forward to parse the verb
phrase (VP).

3. VP state:

o In the VP state, we need to parse an auxiliary verb followed by a verb (VP → AUX V
NP).

o AUX → "are": Matches the auxiliary verb, and we move to a state that expects a
verb (V).
o V → "attending": Matches the verb, so we proceed to parse the noun phrase (NP).

4. NP for VP:

o In this state, the NP expected by the VP is of the form D ADJ N N.

o D → "the": Matches the determiner.

o ADJ → "Accenture": Matches the adjective.

o N → "placement": Matches the first noun.

o N → "drive": Matches the second noun.

5. Final state:

o At this point, we have successfully parsed the sentence and reach the final state,
signifying that the sentence is grammatically correct.

Bottom-Up Parsing Using TNG:

1. Start with input tokens:

o "Students are attending the Accenture placement drive."

2. Identify the smallest units first:

o "Students" → N.

o "are" → AUX.

o "attending" → V.

o "the" → D.

o "Accenture" → ADJ.

o "placement" → N.

o "drive" → N.

3. Build up to larger constituents:

o Combine "the Accenture placement drive" → NP (D ADJ N N).

o Combine "are attending the Accenture placement drive" → VP (AUX V NP).

o Combine "Students" with the VP to form the complete sentence:


S → NP VP.

4. Final state:

o The entire sentence is parsed successfully, and we reach the final state in the TNG.

Advantages of TNG in Parsing:

• TNGs provide flexibility in parsing sentences by allowing non-deterministic transitions.

• They are more suited for natural language processing where ambiguity and variability in
sentence structures are common.
Feature Systems and Augmented Grammars

Feature systems and augmented grammars extend traditional grammars by associating syntactic
categories with additional information (features) to handle various linguistic properties more
effectively. These approaches are particularly useful in natural language processing (NLP) to deal with
phenomena like agreement (e.g., subject-verb agreement), case marking, tense, and gender.

Feature Systems

A feature system is a formal way to represent linguistic properties such as number, person, gender,
tense, etc., using attributes called features. These features are associated with different syntactic
categories (like nouns, verbs, etc.) to enforce agreement and capture grammatical constraints.

• Features: Pairs of attribute-value combinations that add descriptive properties to grammar


symbols.

o Example features:

▪ Number: Singular (SG) or Plural (PL)

▪ Person: First (1), Second (2), or Third (3)

▪ Gender: Masculine (M), Feminine (F), or Neuter (N)

▪ Tense: Past, Present, Future

▪ Case: Nominative, Accusative, Dative, etc.

Example of Feature Representation:

• A noun like "students" might have features:


N[Number=PL, Person=3]

• A verb like "are" might have features:


V[AUX, Number=PL, Tense=Present]

The feature system ensures that in a sentence like "Students are attending the lecture," both the
subject "students" (plural) and the verb "are" (plural) agree in number.

Feature Structures

A feature structure is an organized set of feature-value pairs that describe the syntactic,
morphological, and semantic properties of a word or phrase. These structures are typically
represented as attribute-value matrices (AVMs).

Example Feature Structure:

For the noun phrase "the students":

NP

|-- Det: "the"

|-- N: "students"

|-- Number: PL

|-- Person: 3
For the verb phrase "are attending":

VP

|-- AUX: "are"

|-- Number: PL

|-- Tense: Present

|-- V: "attending"

|-- Aspect: Progressive

Augmented Grammars

Augmented grammars extend traditional context-free grammars (CFGs) by incorporating features


into the production rules. These features help enforce constraints like agreement between different
parts of a sentence (e.g., subject-verb agreement). By adding features, augmented grammars
become more powerful and can handle natural language more effectively than simple CFGs.

An augmented grammar associates feature structures with syntactic categories, and production rules
specify how features combine during parsing.

Example Augmented Grammar Rules:

Let’s extend a simple grammar with features for number and person:

• S → NP[Number=?n, Person=?p] VP[Number=?n, Person=?p]


A sentence consists of a noun phrase (NP) and a verb phrase (VP) that must agree in number
and person.

• NP → Det N[Number=?n, Person=?p]


A noun phrase consists of a determiner and a noun, and the number and person features of
the NP are derived from the noun.

• VP → AUX[Number=?n, Person=?p] V
A verb phrase consists of an auxiliary verb and a main verb, and the auxiliary must agree in
number and person with the noun phrase.

• Det → "the"
Determiners have no features for number or person.

• N → "students"[Number=PL, Person=3]
The noun "students" is plural and third person.

• AUX → "are"[Number=PL, Person=3, Tense=Present]


The auxiliary verb "are" is plural, third person, and present tense.

• V → "attending"
The main verb has no person or number agreement but follows the auxiliary in tense.

Example Sentence Parsing Using Augmented Grammar:

Consider the sentence "The students are attending the lecture."

1. Start with the production rule:


o S → NP[Number=?n, Person=?p] VP[Number=?n, Person=?p]

2. NP → Det N[Number=?n, Person=?p]:

o "The students" → Det="the" and N="students"[Number=PL, Person=3].

o The features of NP are Number=PL and Person=3.

3. VP → AUX[Number=PL, Person=3, Tense=Present] V:

o "Are attending" → AUX="are"[Number=PL, Person=3, Tense=Present],


V="attending".

4. The sentence is successfully parsed since the features match between NP and VP.

Feature Unification

Unification is the process of combining two feature structures to ensure consistency in their values.
If two structures have compatible features, they are merged. If they have conflicting features (e.g.,
one is singular and the other is plural), unification fails, and the sentence is considered
ungrammatical.

Example of Unification:

• "The students are attending" → Unification succeeds because the NP and VP agree in
number (PL) and person (3).

• "The student are attending" → Unification fails because the NP has singular number (SG) and
the VP has plural number (PL), resulting in a grammatical error.

Applications of Feature Systems and Augmented Grammars

1. Subject-Verb Agreement: Ensures that the subject and verb agree in features like number
and person.

2. Case Assignment: Handles case marking (e.g., nominative, accusative) in languages with
complex case systems.

3. Gender Agreement: Useful in languages where adjectives and verbs must agree with the
gender of the noun.

4. Tense and Aspect: Ensures consistency in tense and aspect between auxiliary verbs and main
verbs.

5. Semantic Roles: Captures roles like agent, patient, and theme, helping in more detailed
sentence analysis.

Morphological Analysis and the Lexicon

Morphological analysis is the study of the internal structure of words and the rules for word
formation in a language. It plays a critical role in natural language processing (NLP) by breaking down
words into their smallest units of meaning, called morphemes. These morphemes can be combined
in various ways to generate the complex words used in human language.

The lexicon is a structured database of the vocabulary of a language. It contains words (or lexemes)
along with their grammatical, morphological, and semantic properties. Morphological analysis often
interacts with the lexicon to determine the meaning and form of words, making it a core task in NLP
for applications like speech recognition, machine translation, and information retrieval.

Key Concepts in Morphological Analysis

1. Morpheme:

o The smallest grammatical unit in a language that has meaning.

o Types of morphemes:

▪ Free morphemes: Can stand alone as words (e.g., "book", "run").

▪ Bound morphemes: Must attach to other morphemes and cannot stand


alone (e.g., prefixes like "un-", suffixes like "-ed").

2. Word Formation:

o Inflection: The process of modifying a word to express different grammatical


categories like tense, number, or case. This does not change the word’s base
meaning.

▪ Example: "run" → "running" (progressive tense)

▪ Example: "book" → "books" (plural form)

o Derivation: The process of creating a new word by adding affixes, which may change
the word’s meaning or part of speech.

▪ Example: "happy" → "unhappy" (prefix changes meaning)

▪ Example: "develop" → "development" (suffix changes verb to noun)

3. Types of Morphology:

o Inflectional Morphology: Focuses on changes within a word to express grammatical


relationships, without changing the word's category (e.g., tense, number, or gender).

o Derivational Morphology: Involves forming new words by adding morphemes,


changing the meaning or category of the original word (e.g., from a noun to a verb).

4. Compound Words:

o Words formed by combining two or more morphemes or words.

▪ Example: "classroom" (class + room), "toothpaste" (tooth + paste).

Morphological Analysis in NLP


In natural language processing, morphological analysis is necessary to process various forms of
words and understand their root meanings. The primary goal is to identify the base form of a word
(lemma) and its inflected or derived forms.

Key Tasks in Morphological Analysis:

1. Lemmatization: Identifying the base form or dictionary form of a word. For example,
lemmatizing "running" would yield the lemma "run."
2. Stemming: Reducing a word to its root form by cutting off affixes. For example, stemming
"jumps" might result in "jump." Stemming is generally more aggressive and less accurate
than lemmatization.

3. Morphological Parsing: Breaking a word down into its morphemes. For instance, "unhappily"
can be parsed into "un-" (prefix), "happy" (root), and "-ly" (suffix).

Morphological Analysis Examples

Inflectional Morphology Example:

• Word: "running"

o Root (lemma): "run"

o Morpheme breakdown: "run" (root) + "-ing" (suffix for progressive tense)

Derivational Morphology Example:

• Word: "unhappiness"

o Root (lemma): "happy"

o Morpheme breakdown: "un-" (prefix for negation) + "happy" (root) + "-ness" (suffix
to form noun)

Compound Word Example:

• Word: "classroom"

o Morpheme breakdown: "class" (noun) + "room" (noun)

o Both "class" and "room" are free morphemes that form a compound noun.

Lexicon in NLP
The lexicon is the repository that stores the vocabulary of a language, including words and
morphemes. In NLP, the lexicon is essential for understanding word meanings, part-of-speech (POS)
tagging, and mapping between inflected forms of words and their lemmas.

1. Lexical Entries:

o A lexical entry contains the lemma (base form), part of speech (noun, verb,
adjective, etc.), and morphological features (tense, number, gender, etc.).

o For instance, an entry for "dogs" would include the lemma "dog" and features like
[Number=Plural].

2. Role of the Lexicon in Morphological Analysis:

o During morphological analysis, a word is looked up in the lexicon to identify its base
form, inflections, and any relevant grammatical features.

o For example, the lexicon helps in recognizing that "went" is the past tense of "go," a
non-obvious relationship that is critical for correct language processing.

3. Lexicon Structure:
o The lexicon includes not just words but also morphemes (both free and bound),
along with their syntactic and semantic features.

o A robust lexicon supports multi-linguality, morphological complexity, and semantic


disambiguation by providing additional features like word senses, verb conjugations,
and noun declensions.

Morphological Parsing and the Lexicon Interaction

1. Word Lookup: The system first looks up a word in the lexicon to retrieve its base form and
associated features.

2. Feature Analysis: The features of the word (like tense, number, or case) are determined
through its morphemes.

3. Morphological Decomposition: If the word is not found directly, the system breaks it down
into its constituent morphemes to reconstruct its meaning or part of speech.

4. Lexical Disambiguation: In cases of ambiguity (e.g., "lead" as a noun vs. "lead" as a verb), the
lexicon helps disambiguate based on context or syntactic role.

Applications of Morphological Analysis and the Lexicon

1. Spell Checking: Morphological analysis helps spell checkers by recognizing various inflected
forms of a word.

2. Machine Translation: Morphological analysis aids in understanding how words change based
on grammatical rules, crucial for translating between languages with different morphological
rules.

3. Part-of-Speech Tagging: Morphological analysis helps assign the correct part of speech to
words by examining their form and features.

4. Information Retrieval: Morphological analysis ensures that searches retrieve relevant


documents by matching different forms of a word, e.g., searching for "run" also retrieves
"running" and "ran."

5. Speech Recognition: Speech systems use morphological analysis to predict word endings or
inflected forms based on context, improving the accuracy of transcription.

Parsing with Features


Parsing with features involves extending traditional syntactic parsing to account for additional
grammatical information like number, gender, tense, and agreement. These features are essential in
ensuring that the parsed sentences are both syntactically and grammatically correct. The process
involves attaching feature structures (also known as attribute-value pairs) to syntactic categories and
production rules in a grammar.

Features in Parsing:

• Unification: During parsing, features from different parts of the sentence are unified,
ensuring agreement between them. For example, a sentence might require subject-verb
agreement in number and person.
Example: In parsing "The students are attending the lecture," the noun phrase (NP) "the students"
has the feature [Number=PL] (plural), and the verb phrase (VP) "are attending" must also have
[Number=PL].

Augmented Grammars: These grammars include not only syntactic rules but also rules for checking
features during parsing. The production rules ensure that the features are carried and checked
during the parsing process, preventing ungrammatical constructions from being accepted.

Parsing Example with Features:

• S → NP[Number=?n, Person=?p] VP[Number=?n, Person=?p]

o Both NP and VP must agree in number and person.

• NP → Det N[Number=?n, Person=?p]

o The noun determines the number and person for the entire NP.

• VP → AUX[Number=?n, Person=?p] V

o The auxiliary verb and the main verb must agree in number and person with the
subject.

Augmented Transition Networks (ATNs)


Augmented Transition Networks (ATNs) are a type of finite-state automaton used for parsing natural
language. They extend the basic concept of a finite-state machine by adding additional actions and
tests on transitions between states. This allows ATNs to handle recursive and non-deterministic
grammars more efficiently than simple finite automata.

Key Features of ATNs:

• States: Represent parts of the sentence that are being processed.

• Arcs (Transitions): Represent movement between states based on grammar rules. Transitions
can include tests (conditions on features like number or tense) and actions (to store
information or adjust state).

• Registers: Store intermediate results during parsing, such as subject or verb, and their
associated features.

Example:

Let’s consider an ATN parsing the sentence "The students are attending":

1. State S: Initial state, expects a subject and a verb phrase.

o Transition to NP state (for the subject noun phrase).

2. State NP: Handles noun phrase parsing.

o "The students" is accepted as an NP. Number is set to PL.

o Transition to VP state (for the verb phrase).

3. State VP: Handles the verb phrase.


o "Are attending" is accepted as the verb phrase. The number (PL) agrees with the
subject NP.

ATNs allow for non-determinism, meaning multiple transitions can be tried if ambiguity arises (e.g.,
parsing "lead" as a noun or a verb).

Example:

In word-sense disambiguation, we use Bayes' Rule to determine the most likely meaning of a word
given its context.

If HHH represents the hypothesis that the word "bank" refers to a financial institution, and EEE
represents the surrounding words in a sentence, Bayes' Rule helps calculate the probability that
"bank" means a financial institution given the surrounding context.

Shannon's Game

Shannon's Game is based on Claude Shannon’s idea of predictive text generation. The goal is to
predict the next word or letter in a sequence based on prior text, using probabilistic models of
language.

Shannon's work introduced the idea of language as an information source with a certain amount of
uncertainty (entropy). In the game, participants try to guess the next letter or word in a sentence,
and the difficulty of the guess reflects the entropy of the language.

Example:

If given the sentence fragment: "The cat sat on the ...", Shannon's Game would predict the next
word using a probabilistic model of language. The word "mat" might have a high probability because
it frequently follows that phrase in English.

Entropy
Entropy is a measure of the uncertainty or unpredictability in a system. In NLP, entropy quantifies the
amount of information or the degree of surprise associated with a set of linguistic data. Lower
entropy means that the next word or letter is more predictable, while higher entropy means it is
more uncertain.

Cross Entropy

Cross entropy measures the difference between two probability distributions: the true distribution
of the data and a predicted distribution. In NLP, cross entropy is used to evaluate the performance of
models like language models or classifiers by measuring how well they predict the true distribution
of words or phrases.

Application of Entropy and Cross Entropy in NLP:

• Entropy: Used in evaluating the uncertainty in predicting the next word or character in a
sentence.

• Cross Entropy: Used in machine learning models, particularly for evaluating classification tasks
such as word-sense disambiguation, POS tagging, or translation.

You might also like