NLP QB
NLP QB
Stages of NLP
Ans.
• Morphological Analysis
o Morphological analysis is another critical phase in NLP, focusing on identifying morphemes, the smallest units
of a word that carry meaning and cannot be further divided. Understanding morphemes is vital for grasping
the structure of words and their relationships.
o Types of Morphemes
▪ Free Morphemes: Text elements that carry meaning independently and make sense on their own. For
example, “bat” is a free morpheme.
▪ Bound Morphemes: Elements that must be attached to free morphemes to convey meaning, as they
cannot stand alone. For instance, the suffix “-ing” is a bound morpheme, needing to be attached to a free
morpheme like “run” to form “running.”
o Importance of Morphological Analysis: Morphological analysis is crucial in NLP for several reasons:
▪ Understanding Word Structure: It helps in deciphering the composition of complex words.
▪ Predicting Word Forms: It aids in anticipating different forms of a word based on its morphemes.
▪ Improving Accuracy: It enhances the accuracy of tasks such as part-of-speech tagging, syntactic parsing,
and machine translation.
o By identifying and analyzing morphemes, the system can interpret text correctly at the most fundamental
level, laying the groundwork for more advanced NLP applications.
• Syntactic Analysis (Parsing)
o Syntactic analysis, also known as parsing, is the second phase of Natural Language Processing (NLP). This
phase is essential for understanding the structure of a sentence and assessing its grammatical correctness. It
involves analyzing the relationships between words and ensuring their logical consistency by comparing their
arrangement against standard grammatical rules.
o Parsing examines the grammatical structure and relationships within a given text. It assigns Parts-Of-Speech
(POS) tags to each word, categorizing them as nouns, verbs, adverbs, etc. This tagging is crucial for
understanding how words relate to each other syntactically and helps in avoiding ambiguity. Ambiguity arises
when a text can be interpreted in multiple ways due to words having various meanings. For example, the
word “book” can be a noun (a physical book) or a verb (the action of booking something), depending on the
sentence context.
o During parsing, each word in the sentence is assigned a POS tag to indicate its grammatical category.
Assigning POS tags correctly is crucial for understanding the sentence structure and ensuring accurate
interpretation of the text.
o By analyzing and ensuring proper syntax, NLP systems can better understand and generate human language.
This analysis helps in various applications, such as machine translation, sentiment analysis, and information
retrieval, by providing a clear structure and reducing ambiguity.
• Semantic Analysis
o Semantic Analysis is the third phase of Natural Language Processing (NLP), focusing on extracting the meaning
from text. Unlike syntactic analysis, which deals with grammatical structure, semantic analysis is concerned
with the literal and contextual meaning of words, phrases, and sentences.
o Semantic analysis aims to understand the dictionary definitions of words and their usage in context. It
determines whether the arrangement of words in a sentence makes logical sense. This phase helps in finding
context and logic by ensuring the semantic coherence of sentences.
o Key Tasks in Semantic Analysis
▪ Named Entity Recognition (NER): NER identifies and classifies entities within the text, such as names of
people, places, and organizations. These entities belong to predefined categories and are crucial for
understanding the text’s content.
▪ Word Sense Disambiguation (WSD): WSD determines the correct meaning of ambiguous words based on
context. For example, the word “bank” can refer to a financial institution or the side of a river. WSD uses
contextual clues to assign the appropriate meaning.
o Semantic analysis is essential for various NLP applications, including machine translation, information
retrieval, and question answering. By ensuring that sentences are not only grammatically correct but also
meaningful, semantic analysis enhances the accuracy and relevance of NLP systems.
• Discourse Integration
o Discourse Integration is the fourth phase of Natural Language Processing (NLP). This phase deals with
comprehending the relationship between the current sentence and earlier sentences or the larger context.
Discourse integration is crucial for contextualizing text and understanding the overall message conveyed.
o Discourse integration examines how words, phrases, and sentences relate to each other within a larger
context. It assesses the impact a word or sentence has on the structure of a text and how the combination of
sentences affects the overall meaning. This phase helps in understanding implicit references and the flow of
information across sentences.
o In conversations and texts, words and sentences often depend on preceding or following sentences for their
meaning. Understanding the context behind these words and sentences is essential to accurately interpret
their meaning.
o Discourse integration is vital for various NLP applications, such as machine translation, sentiment analysis, and
conversational agents. By understanding the relationships and context within texts, NLP systems can provide
more accurate and coherent responses.
• Pragmatic Analysis
o Pragmatic Analysis is the fifth and final phase of Natural Language Processing (NLP), focusing on interpreting
the inferred meaning of a text beyond its literal content. Human language is often complex and layered with
underlying assumptions, implications, and intentions that go beyond straightforward interpretation. This
phase aims to grasp these deeper meanings in communication.
o Pragmatic analysis goes beyond the literal meanings examined in semantic analysis, aiming to understand
what the writer or speaker truly intends to convey. In natural language, words and phrases can carry different
meanings depending on context, tone, and the situation in which they are used.
o In human communication, people often do not say exactly what they mean. For instance, the word “Hello”
can have various interpretations depending on the tone and context in which it is spoken. It could be a simple
greeting, an expression of surprise, or even a signal of anger. Thus, understanding the intended meaning
behind words and sentences is crucial.
Q2. Morphological analysis and role of FSA
Ans.
• Morphological Analysis
o Morphological analysis is another critical phase in NLP, focusing on identifying morphemes, the smallest units
of a word that carry meaning and cannot be further divided. Understanding morphemes is vital for grasping
the structure of words and their relationships.
o Types of Morphemes
▪ Free Morphemes: Text elements that carry meaning independently and make sense on their own. For
example, “bat” is a free morpheme.
▪ Bound Morphemes: Elements that must be attached to free morphemes to convey meaning, as they
cannot stand alone. For instance, the suffix “-ing” is a bound morpheme, needing to be attached to a free
morpheme like “run” to form “running.”
o Importance of Morphological Analysis: Morphological analysis is crucial in NLP for several reasons:
▪ Understanding Word Structure: It helps in deciphering the composition of complex words.
▪ Predicting Word Forms: It aids in anticipating different forms of a word based on its morphemes.
▪ Improving Accuracy: It enhances the accuracy of tasks such as part-of-speech tagging, syntactic parsing,
and machine translation.
o By identifying and analyzing morphemes, the system can interpret text correctly at the most fundamental
level, laying the groundwork for more advanced NLP applications.
1. Lexical Analysis: FSAs are used to model and implement lexical analyzers, which are tools that tokenize input
text by recognizing patterns such as words, numbers, and punctuation. For example, an FSA can be designed
to recognize valid identifiers in a programming language by accepting sequences of letters and digits starting
with a letter.
2. Morphological Analysis: FSAs can model the structure of words, including the application of affixes like
prefixes, suffixes, and circumfixes. For instance, an FSA can be used to recognize different forms of a word by
accepting various suffixes and prefixes while transitioning through states that represent the root word.
3. Spell Checking and Correction: FSAs can be used to model dictionaries, where each word is represented as a
path through states. This allows for efficient spell checking by verifying if a word's sequence of characters
leads to an accept state. Non-deterministic FSAs can also suggest corrections for misspelled words by
allowing small deviations from the correct path.
4. Pattern Matching: FSAs are central to regular expression engines, which are used in text search and
manipulation tasks. Regular expressions define patterns, and FSAs are used to match these patterns against
input text efficiently.
5. Syntactic Parsing: Although FSAs are more suited for lexical analysis and morphological processing, they can
be extended (as Pushdown Automata) to handle context-free grammars, which are used in syntactic parsing
of sentences.
6. Speech Recognition: In speech recognition systems, FSAs can model phoneme sequences, helping to match
spoken input with expected word patterns.
Q3. Porter Stemmer Algorithm
Ans.
The Porter Stemmer algorithm is one of the most widely used algorithms for stemming in Natural Language
Processing (NLP). Stemming is the process of reducing a word to its root or base form, often by removing suffixes.
The purpose is to treat different forms of a word as the same word in text processing tasks, like search engines or
information retrieval systems.
1. Suffix Stripping: The Porter Stemmer algorithm works by applying a series of rules to strip common suffixes
from English words. For example, it might remove “ing” from “running” to get “run”.
2. Five Phases of Rules: The algorithm consists of five phases of suffix removal rules, each applied sequentially.
Each phase applies a set of rules until one of them succeeds, and then the process moves to the next phase.
3. Rules Based on Conditions: Each rule checks specific conditions in a word, such as the length of the word,
the presence of vowels, or specific sequences of consonants and vowels. These conditions help determine
whether a suffix should be removed.
4. Step-by-Step Process:
o Step 1: Deals with plurals and past participles, like removing "sses" or "ies" to convert "caresses" to
"caress" or "ponies" to "poni".
o Step 2: Handles suffixes like "ational" or "izer", reducing "relational" to "relate" and "revival" to
"revive".
o Step 3: Focuses on suffixes such as "icate" or "ness", converting "triplicate" to "triplic" and
"goodness" to "good".
5. Heuristics: The algorithm is based on a set of heuristics rather than a linguistic understanding of word
morphology. This makes it fast and effective for many English words, though it may not be perfect for all
cases.
Applications:
• Search Engines: Helps in matching queries with documents by reducing different word forms to a common
base.
Affixes are morphemes (the smallest units of meaning) that are attached to a base word or root to modify its
meaning or create a new word. Affixes play a crucial role in understanding word formation and morphology in various
languages.
1. Prefixes
• Function: They often modify the meaning of the base word by adding a specific nuance or changing its form
entirely.
• Examples:
2. Suffixes
• Function: They can change the word’s grammatical function, such as turning a verb into a noun or an
adjective, or indicate tense, plurality, or degree.
• Examples:
3. Infixes
• Function: They are less common in English but are used in other languages to alter the meaning or
grammatical function of the base word.
• Examples: English doesn’t commonly use infixes, but a playful example is "un-freaking-believable," where
"freaking" is inserted to add emphasis.
4. Circumfixes
• Definition: Affixes that are attached to both the beginning and the end of a word simultaneously.
• Function: Circumfixes modify the meaning or grammatical function by surrounding the base word.
• Examples:
o In German, the past participle often uses a circumfix: "ge-" + root + "-t". For example, "ge-sag-t"
(meaning "said"), where "ge-" and "-t" are the circumfixes surrounding the root "sag."
Importance in NLP:
• Morphological Analysis: Understanding affixes helps in tasks like stemming and lemmatization, where words
are reduced to their base forms.
• Named Entity Recognition (NER): Affixes can be clues to identifying named entities (e.g., "-stan" in
"Pakistan" suggesting a country).
• Part-of-Speech Tagging: Suffixes often indicate the grammatical category of a word (e.g., "-ly" usually
signifies an adverb).
In Natural Language Processing (NLP), words are categorized into two main classes based on their syntactic roles and
how easily new words can be added to these categories: open class words and closed class words.
Open class words are categories of words that frequently acquire new members. They are typically content words
that carry the main meaning of a sentence. This class is "open" because new words can be freely created or
borrowed from other languages and added to this category.
• Nouns: These are words that represent people, places, things, or ideas. Examples: dog, city, happiness,
computer.
• Verbs: These represent actions, states, or occurrences. Examples: run, think, eat, develop.
• Adjectives: These describe or modify nouns. Examples: happy, blue, large, fast.
• Adverbs: These modify verbs, adjectives, or other adverbs. Examples: quickly, silently, very, well.
Examples in Sentences:
Closed class words are categories of words that do not regularly accept new members. These are typically function
words that serve grammatical purposes rather than carrying significant meaning by themselves. The class is "closed"
because new words are rarely added to it.
• Pronouns: Words that replace nouns or noun phrases. Examples: he, she, it, they.
• Prepositions: Words that show relationships between other words. Examples: in, on, at, under.
• Conjunctions: Words that connect clauses, sentences, or words. Examples: and, but, or, because.
Examples in Sentences:
Summary:
• Open class words: Dynamic, meaning-carrying words that often see new additions (nouns, verbs, adjectives,
adverbs).
• Closed class words: Stable, function-oriented words with few new additions (pronouns, prepositions,
conjunctions, determiners, auxiliary verbs).
Minimum Edit distance between two strings str1 and str2 is defined as the minimum number of
insert/delete/substitute operations required to transform str1 into str2.
For example if str1 = "ab", str2 = "abc" then making an insert operation of character 'c' on str1 transforms str1 into
str2.
Inflection is the combination of a word stem with a grammatical morpheme, usually resulting in a word of the same
class as the original stem, and usually filling some syntactic function like agreement.
Don’t change the lexical category from one to another (parts of speech)
For example, English has the inflectional morpheme -s for marking the plural on nouns, and the inflectional
morpheme -ed for marking the past tense on verbs.
Derivation is the combination of a word stem with a grammatical morpheme, usually resulting in a word of a different
class, often with a meaning hard to predict exactly.
Can change the lexical category from one to another (parts of speech).
Q10. POS tagging
art-of-Speech (POS) tagging is a fundamental task in Natural Language Processing (NLP) that involves labeling each
word in a sentence with its appropriate part of speech based on its definition and context. Parts of speech include
categories like nouns, verbs, adjectives, adverbs, pronouns, conjunctions, prepositions, and more.
Here are some common POS tags used in the Penn Treebank POS tag set, which is widely used in English:
Supervised Taggers:
• Rule-Based:
o Brill Tagger: This is a transformation-based learning tagger that relies on a set of predefined rules to
tag parts of speech.
• Stochastic:
o HMM (Hidden Markov Model): Uses probabilities, often employing the Viterbi algorithm for
determining the most likely sequence of tags.
o Uses n-gram Approach: This typically refers to taggers that use a probabilistic model based on
sequences of n tags (often bigrams or trigrams) to predict the tag for a word.
• Neural Network: Employs machine learning models, often deep learning-based, to predict POS tags by
learning from labeled data.
Unsupervised Taggers:
• Rule-Based:
o Brill Tagger: Similar to the supervised approach, this relies on predefined or learned rules but doesn't
require labeled data for training.
• Transformation-Based:
o User Baum-Welch: This refers to a variant of the Baum-Welch algorithm, used to estimate the
parameters of hidden Markov models, which can be adapted for POS tagging.
• Neural Network: Unsupervised neural networks attempt to learn patterns in the data without explicit labels,
often using methods like clustering.
Challenges in POS Tagging:
1. Ambiguity:
o Lexical Ambiguity: Many words can function as more than one part of speech depending on the
context. For example, the word "bank" can be a noun ("I went to the bank") or a verb ("They bank on
us").
o Contextual Ambiguity: The same sentence structure can lead to different interpretations. For
example, "He saw her duck" could mean seeing a bird (noun) or watching someone lower their head
(verb).
o Words that are not present in the training data (like new slang or technical terms) can be difficult to
tag accurately, especially for rule-based and statistical models.
o Long and complex sentences with multiple clauses, conjunctions, or embedded phrases can make it
challenging to correctly tag each word due to intricate dependencies.
4. Idiomatic Expressions:
o Idioms and phrases that don't follow standard grammatical rules can confuse POS taggers. For
instance, in "kick the bucket" (meaning "to die"), "kick" is a verb, but it's not used in the literal sense.
o Texts like social media posts, chat messages, or transcribed speech often contain spelling errors,
abbreviations, or unconventional grammar, making POS tagging more difficult.
6. Domain-Specific Language:
o Different domains (e.g., legal, medical, technical) might use specific jargon or terminology that isn't
well-represented in general-purpose training datasets, leading to inaccurate tagging.
o Different languages have different grammatical rules, and many languages have words that can
change their part of speech more fluidly than in English, making multilingual POS tagging a complex
task.
8. Dependency on Context:
o POS tags can depend heavily on the broader context of a paragraph or even a larger text segment,
which can be challenging for models that only consider local word sequences.