NLP Soln
NLP Soln
Stemming Lemmatization
Stemming is a process that stems or removes Lemmatization considers the context and
last few characters from a word, often leading converts the word to its meaningful base form,
to incorrect meanings and spelling. which is called Lemma.
For instance, stemming the word ‘Caring‘ For instance, lemmatizing the word ‘Caring‘
would return ‘Car‘. would return ‘Care‘.
1 Unigram
2 Bigram
3 Trigram
n n-gram
5.What are The Difficulties /Challenges in POS Tagging?
Ans. 1. Contextual words & Phrases & Homonyms
2. Synonyms
3. Irony & Sarcasm
4. Ambiguity
5. Errors in text or speech
6. Idioms & Slang
7. Domain Specific Language.
8. Low-Resource Languages.
6. Explain Stochastic Based Tagging.
Stochastic-based tagging refers to a method in natural language processing (NLP) for
assigning tags or labels to text, often used in tasks like part-of-speech tagging, named
entity recognition, or other forms of text classification. The approach relies on
probabilistic models that use statistical methods to predict tags based on patterns
observed in training data.
Probabilistic Models: Stochastic-based tagging uses models that incorporate probability
to make predictions. These models are trained on a corpus of text where each word or
token is annotated with a tag. The model learns the likelihood of a particular tag
occurring given the context of surrounding words or tokens.
Hidden Markov Models (HMMs): One common type of stochastic model used for
tagging is the Hidden Markov Model. In HMMs, the process of tagging is modeled as a
sequence of states (tags) that generate observable outcomes (words) according to certain
probabilities. The model estimates the probability of a sequence of tags given a sequence
of words.
Training and Inference: During training, the model calculates the probabilities of
transitions between tags and the probabilities of observing specific words given particular
tags. During inference (i.e., when tagging new text), the model uses these probabilities to
predict the most likely sequence of tags for the input text.
Viterbi Algorithm: For HMMs, the Viterbi algorithm is often used to find the most
probable sequence of tags for a given sequence of words. This dynamic programming
algorithm efficiently computes the best tag sequence by considering all possible tag
sequences and their associated probabilities.
Conditional Random Fields (CRFs): Another popular stochastic model for tagging is
Conditional Random Fields. CRFs are used for labeling and segmenting sequential data
and are particularly useful when considering the context of an entire sequence rather than
individual elements in isolation. CRFs model the conditional probability of a label
sequence given an observation sequence, allowing for more flexible and accurate tagging.
Advantages and Limitations: Stochastic-based tagging models can handle a variety of
linguistic phenomena and capture complex patterns in data. However, they require
substantial amounts of annotated training data and can be computationally intensive.
Modern approaches, such as deep learning models, have largely supplemented or
replaced traditional stochastic models in many NLP tasks due to their ability to learn
more complex representations from data.
7 Define Language and Knowledge.
NLP focuses on the interaction between computers and human language, enabling machines to
understand, interpret, and generate human language in a way that is both meaningful and useful. Natural
language processing is a field of artificial intelligence and computational linguistics that focuses on the
interaction between computers and human (natural) languages.Language is difficult to describe
thoroughly. Even if you manage to document all the words and rules of the standard version of any given
language, there are complications such as dialects, slang, sarcasm, context, and how these things change
over time.
8 Explain Ambiguities and its Types.
Ambiguity in language refers to situations where a word, phrase, or sentence can
be interpreted in more than one way. This can create challenges in understanding
and processing language because the intended meaning isn't clear.
Ambiguities can arise in various forms, and they can be broadly categorized into
several types:
1. Lexical Ambiguity
2. Syntactic Ambiguity
3. Semantic Ambiguity
4. Pragmatic Ambiguity
5. Anaphoric Ambiguity
6. Quantifier Ambiguity
7. Semantic Role Ambiguity
9. Explain in Detail Tokenization and list of its Types
Tokenization is a fundamental preprocessing step in natural language processing
(NLP) that involves breaking down text into smaller units, called tokens. Tokens
are typically words, but they can also be phrases, characters, or other meaningful
elements depending on the context and application. Tokenization is crucial for
various NLP tasks, such as text analysis, machine translation, and information
retrieval.
Types of Tokenization
1. Word Tokenization
2. Subword Tokenization
3. Character Tokenization
4. Sentence Tokenization
5. Phrase Tokenization
6. Tokenization with Special Rules
7. Morphological Tokenization
10. Explain Parser and List Out its Types
In computational linguistics and natural language processing (NLP), a parser is a
tool or algorithm used to analyze and understand the structure of sentences or
other text units. Parsing involves breaking down and analyzing the syntactic
structure of text based on grammatical rules or patterns. The primary goal of
parsing is to determine how words or tokens are combined to form meaningful
sentences according to a specified grammar.
Types of Parsers
1 Top-Down Parsers: These parsers start from the root of the parse tree and
attempt to derive the input sentence by expanding grammar rules.
2. Bottom-Up Parsers: These parsers start from the input tokens and attempt to
construct the parse tree by combining tokens into larger constituents until they
match the start symbol of the grammar.
Lexical ambiguity occurs when a word has multiple meanings. This type of ambiguity
arises at the level of individual words.
● Polysemy: When a single word has multiple related meanings. For example, the
word "bank" can refer to the side of a river or a financial institution.
● Homonymy: When a word has multiple unrelated meanings. For instance, "bat"
can refer to a flying mammal or a piece of sports equipment.
2. Syntactic Ambiguity
Syntactic ambiguity, also known as structural ambiguity, arises when a sentence can be
parsed in more than one way due to its structure.
● Ambiguous Phrase Structure: For example, "I saw the man with the telescope" can
be interpreted as either the man had the telescope or the speaker used a telescope
to see the man.
● Attachment Ambiguity: When it’s unclear which part of the sentence a modifying
phrase or clause attaches to. For example, "She told her friend that she was tired"
can be ambiguous about whether "she" refers to the friend or to the speaker.
3. Semantic Ambiguity
Semantic ambiguity occurs when the meaning of a sentence or phrase is unclear because
of the multiple meanings of words or phrases within it.
● Ambiguous Reference: For instance, in the sentence "John told Steve he needed
help," it is ambiguous whether "he" refers to John or Steve.
● Ambiguous Scope: When the scope of quantifiers or modifiers is unclear. For
example, "Every student in the class passed" can be interpreted as every student
passing in a single exam or every student passing at least one exam.
4. Pragmatic Ambiguity
Pragmatic ambiguity involves the interpretation of language based on context and world
knowledge. This type of ambiguity arises when the intended meaning depends on the
broader conversational or situational context.
● Implicature: When the speaker implies something that is not explicitly stated. For
example, "Can you pass the salt?" might pragmatically imply that the speaker
wants the salt passed, even though it is framed as a question.
● Speech Acts: Different intentions behind the same utterance. For instance, "Can
you close the window?" might be interpreted as a request rather than a question
about capability.
5. Anaphoric Ambiguity
Anaphoric ambiguity occurs when it's unclear what a pronoun or other referential
expression refers to.
● Pronoun Reference: For example, "Sarah told Emily that she would call her later"
is ambiguous regarding whether "she" refers to Sarah or Emily, and whether "her"
refers to Emily or someone else.
6. Quantifier Ambiguity
Quantifier ambiguity arises when the scope or extent of quantifiers (like "all," "some,"
"many") is unclear.
● Scope Ambiguity: For example, "Some students can speak Spanish" can be
interpreted as some students being able to speak Spanish or that there exists a
subset of students who can speak Spanish.
This type occurs when it’s unclear which role an entity is playing in a sentence.
● Agent/Patient Ambiguity: For instance, in "The chef served the meal," it’s
ambiguous whether "the chef" is the one who prepared the meal or just the one
who served it.
The Lexicon-Free Porter Stemmer Algorithm is a specific version of the Porter stemming
algorithm that does not rely on an external lexicon or predefined dictionary of word
stems. The Porter stemmer itself is a well-known algorithm used to reduce words to their
root or base forms, which is useful in various natural language processing (NLP) tasks
such as text indexing and information retrieval.
Stemming is the process of reducing inflected or derived words to their base or root form.
For example, "running," "runner," and "runs" might all be reduced to "run." The goal is to
normalize words so that they can be treated as the same term in text analysis.
Lexicon-Free refers to the characteristic of the algorithm where it operates purely based
on rules and patterns rather than relying on a predefined list of root words or stems. This
means that the algorithm does not need to consult an external lexicon or dictionary to
perform stemming.
Limitations
● Over-Stemming: The algorithm may sometimes reduce words too aggressively,
leading to cases where different words with distinct meanings are reduced to the
same stem.
● Language-Specific: The rules are tailored to English and may not be suitable for
other languages without modification.
● Literals: Characters that represent themselves. For example, a matches the letter
'a'.
● Metacharacters: Special characters that define patterns or rules. For example, .
matches any character except a newline.
● Quantifiers: Specify the number of times a character or group should appear. For
example, * means zero or more times.
● Character Classes: Define a set of characters. For example, [abc] matches any
one of the characters 'a', 'b', or 'c'.
● Groups and Ranges: Allow for the grouping and ordering of patterns. For
example, (abc) groups 'abc' together, and [a-z] defines a range of characters
from 'a' to 'z'.
● Anchors: Define positions in the text. For example, ^ matches the start of a line,
and $ matches the end of a line.
Regular expressions can be categorized into different types based on their syntax and
usage:
● Definition: The original and simplest form of regular expressions. They are used
in tools like grep in Unix.
● Syntax: Basic syntax includes literals, character classes, and basic quantifiers.
● Example: a.b matches 'a', followed by any character, followed by 'b'.
● Definition: Regular expressions used in JavaScript, with syntax similar to Perl but
with some differences in functionality.
● Syntax: Includes features like global search with the g flag and case-insensitive
search with the i flag.
● Example: /\d{2,}/ matches a sequence of two or more digits.
**1. Literals
● abc matches the exact sequence 'abc'.
**3. Quantifiers
**4. Anchors
**6. Assertions
A parser is a component in natural language processing (NLP) and compiler design that
analyzes the syntactic structure of input text based on a formal grammar. Its primary
function is to decompose text into its constituent parts according to grammatical rules and
to produce a structured representation, such as a parse tree or abstract syntax tree, that
reflects the syntactic structure of the text.
● Syntax Analysis: Determines whether the input text conforms to the rules of the
grammar.
● Error Detection: Identifies syntax errors or inconsistencies in the text.
● Structure Representation: Produces a parse tree or abstract syntax tree that
represents the grammatical structure of the input.
1. Top-Down Parsers
Definition: Top-down parsers work by starting from the top of the parse tree (the start
symbol of the grammar) and attempt to construct the parse tree by breaking it down into
its constituent parts based on the grammar rules. They attempt to match the input text
with the expected structure by expanding grammar rules in a recursive manner.
Characteristics:
Advantages:
Disadvantages:
2. Bottom-Up Parsers
Definition: Bottom-up parsers start from the input tokens and attempt to build the parse
tree by combining tokens into larger constituents until they match the start symbol of the
grammar. They work by reducing the input text to the start symbol based on the grammar
rules.
Characteristics:
● Shift-Reduce: Uses shift actions to move tokens onto a stack and reduce actions
to replace patterns on the stack with non-terminals based on grammar rules.
● Handles Ambiguity: More effective at handling certain types of ambiguities and
complex grammars.
Advantages:
Disadvantages:
Natural Language Processing (NLP) is a multidisciplinary field that intersects computer science,
artificial intelligence (AI), and linguistics. Its goal is to enable computers to understand,
interpret, and generate human language in a meaningful way. The history and origin of NLP is a
rich tapestry that reflects the evolution of computing and language understanding technologies.
Early Foundations
● Ancient Linguistics: The study of language has roots in ancient civilizations. For
example, Panini's grammar of Sanskrit in ancient India (around 5th century BCE) was an
early example of formal language rules.
● Classical Logic: Philosophers like Aristotle and later logicians developed formal systems
of logic that laid the groundwork for computational language analysis.
● Alan Turing: Alan Turing's work on the concept of a machine that could perform tasks
requiring intelligence led to the development of the Turing Test, which assesses a
machine's ability to exhibit intelligent behavior equivalent to or indistinguishable from
that of a human.
● Early Machine Translation: The first significant NLP application was machine
translation. In 1954, the Georgetown-IBM experiment demonstrated the potential of
machine translation by translating 60 Russian sentences into English.
● Semantic Networks: Early work focused on rule-based systems and semantic networks.
For example, the work of Joseph Weizenbaum on ELIZA, a program simulating a
Rogerian psychotherapist, demonstrated that computers could engage in simple
conversations with humans.
● Syntax and Parsing: Research on syntactic parsing led to the development of formal
grammars, such as context-free grammar (CFG), which became foundational in parsing
algorithms.
● Statistical Models: The shift from rule-based to statistical methods began in the 1980s.
Researchers like Frederick Jelinek applied statistical models to speech recognition and
natural language processing, marking a departure from purely symbolic methods.
● Hidden Markov Models (HMMs): HMMs became popular for tasks such as
part-of-speech tagging and speech recognition, allowing for probabilistic modeling of
language sequences.
● Corpora and Annotation: The availability of large text corpora and annotated data
enabled the development of more sophisticated statistical models. The Penn Treebank, for
example, provided annotated corpora for syntactic parsing and other NLP tasks.
● Support Vector Machines (SVMs): SVMs and other machine learning techniques were
introduced for text classification and named entity recognition, further advancing the
field.
● Neural Networks: The rise of deep learning and neural networks revolutionized NLP.
Models like Word2Vec, developed by Tomas Mikolov and his team at Google, introduced
word embeddings that captured semantic relationships between words.
● Transformer Models: The introduction of the Transformer architecture by Vaswani et al.
in 2017 marked a significant breakthrough. Transformers, and models based on them like
BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative
Pre-trained Transformer), achieved state-of-the-art performance on a wide range of NLP
tasks.
● Large Pre-trained Models: The development of large pre-trained models such as GPT-3
by OpenAI and T5 by Google pushed the boundaries of what is possible in NLP, enabling
more advanced text generation, comprehension, and interaction capabilities.
● Ethical and Societal Implications: The field has also seen increasing focus on the ethical
implications of NLP technologies, including issues related to bias, fairness, and the
responsible use of AI.
Q6 Hidden Markov Model (HMM Viterbi) for POS Tagging.
The Hidden Markov Model (HMM) is a statistical model used for various sequence analysis
tasks, including Part-of-Speech (POS) tagging. POS tagging involves assigning parts of speech
(such as nouns, verbs, adjectives) to each word in a sentence based on the context.
The Viterbi algorithm is a dynamic programming algorithm used to find the most likely sequence
of hidden states in an HMM given a sequence of observed events. In the context of POS tagging,
the Viterbi algorithm helps determine the most likely sequence of POS tags for a given sentence.
1. States: These are the hidden states that represent POS tags in POS tagging. For example,
states might be "Noun," "Verb," "Adjective," etc.
2. Observations: These are the observed symbols, which are the words in a sentence.
3. Transition Probabilities: These represent the probability of moving from one state (POS
tag) to another. For example, the probability of a noun being followed by a verb.
4. Emission Probabilities: These represent the probability of a particular observation (word)
being emitted by a state (POS tag). For example, the probability of the word "run" being
a verb.
5. Initial Probabilities: These represent the probability of the sequence starting with a
particular state (POS tag).
The Viterbi algorithm is used to find the most probable sequence of states given a sequence of
observations. It works by using dynamic programming to efficiently compute this sequence.
Here’s how the Viterbi algorithm is applied to POS tagging:
● Input: A sequence of words (observations), e.g., ["The", "cat", "sat", "on", "the", "mat"].
● Output: The most probable sequence of POS tags for the words.
● Viterbi Table: A table where V[i][j] represents the highest probability of the most
likely sequence of POS tags that ends in state j (POS tag j) at position i (the i-th word
in the sequence).
● Backpointer Table: A table where B[i][j] records the state (POS tag) that maximized
the probability at V[i][j].
3. Algorithm Steps:
Initialization:
● Set initial probabilities for the first word. V[0][j]=Initial Probability[j]×Emission
Probability[j][word0]V[0][j] = \text{Initial Probability}[j] \times \text{Emission
Probability}[j][\text{word}_0]V[0][j]=Initial Probability[j]×Emission
Probability[j][word0]
● Initialize backpointer table for the first word.
Recursion:
● For each word position i from 1 to N-1, and for each state j (POS tag), compute:
V[i][j]=maxk(V[i−1][k]×Transition Probability[k][j]×Emission
Probability[j][wordi])V[i][j] = \max_{k} \left(V[i-1][k] \times \text{Transition
Probability}[k][j] \times \text{Emission Probability}[j][\text{word}_i]
\right)V[i][j]=kmax(V[i−1][k]×Transition Probability[k][j]×Emission
Probability[j][wordi]) where k ranges over all possible states. Update the backpointer
table to record the state that gave the maximum probability.
Termination:
● The final step involves finding the most probable state sequence that ends in any state j
at position N-1: Most Likely Ending State=argmaxjV[N−1][j]\text{Most Likely Ending
State} = \arg \max_{j} V[N-1][j]Most Likely Ending State=argjmaxV[N−1][j]
● Trace back through the backpointer table to reconstruct the most probable sequence of
states.
Example
Let’s go through a simplified example with three POS tags (Noun, Verb, Adjective) and a
sentence "The cat sat."
● Find the highest probability in the last column and trace back to reconstruct the sequence
OR
-
- A sketch of the lattice for Janet will back the bill, showing the possible tags for each word and
highlighting the path corresponding to the correct tag sequence through the hidden state. States
(part of speech )which have a zero probability of generating a particular word according to the B
matrix (such as the probability that a determiner DT will be realized as Janet) are greyed out.