NLP Unit 1 and 2
NLP Unit 1 and 2
Natural Language Processing (NLP) is a field at the intersection of linguistics, computer science,
and artificial intelligence, concerned with enabling machines to understand, interpret, and generate
human language. The origin and evolution of NLP can be traced back to the 1950s and 1960s.
Here’s a brief overview of its development:
Challenges of NLP
Despite significant progress, NLP still faces several challenges:
1. Ambiguity in Language
• Lexical Ambiguity: Words often have multiple meanings depending on context. For
example, "bank" could refer to a financial institution or the side of a river.
• Syntactic Ambiguity: Sentences can have multiple interpretations due to structure. For
instance, "The man saw the woman with the telescope" could imply either the man used a
telescope to see the woman or the woman had a telescope.
• Semantic Ambiguity: Understanding the meaning of a sentence can be ambiguous due to
context or unclear references. For example, the meaning of the sentence "She is my friend’s
sister" might depend on understanding who "she" is.
1. Data-Related Challenges
• Data Scarcity: Large-scale language models require vast amounts of data to train
effectively. For certain languages, domains, or specific types of text, suitable data might be
limited or difficult to obtain.
• Quality of Data: The quality of data used to train language models is crucial. Noisy data
(e.g., typos, slang, misinformation) can negatively impact model performance. For instance,
training on unclean or biased data can cause the model to learn undesirable patterns, such as
stereotypes or misinformation.
• Out-of-Vocabulary (OOV) Words: Older models, like n-grams, often struggle with words
they haven't encountered during training. Although modern models like Word2Vec and
BERT handle this better, they still face challenges when encountering rare or unseen words,
especially in specialized fields like medical or legal domains.
2. Long-Term Dependencies
• Capturing Context Over Long Sequences: Traditional n-gram and even some early neural
models struggled with long-term dependencies in language. For example, the relationship
between words at the beginning and end of a long sentence could be difficult to capture.
While transformers and LSTMs have significantly improved the ability to capture these
dependencies, they still face challenges when handling very long contexts, especially in
memory and computational complexity.
4. Biases in Models
• Bias in Training Data: Language models tend to inherit biases present in the data they are
trained on. This includes biases related to gender, race, culture, or political affiliation. These
biases can lead to problematic outputs, such as stereotyping or unfair treatment in tasks like
sentiment analysis or content moderation.
• Mitigating Bias: Detecting and mitigating bias in NLP models is an ongoing challenge.
Techniques to reduce bias are still being researched and refined, as biased models can have
far-reaching negative effects, especially in critical applications like hiring systems, legal
systems, or healthcare.
Modern Evolution
While n-gram models were the standard for many years, they have largely been replaced by neural
language models (e.g., RNNs, LSTMs, transformers) in modern NLP. These neural models are
more flexible, can capture longer-range dependencies, and do not suffer from the sparsity issues of
n-gram models. However, n-gram models are still used in some applications where they offer
simplicity, interpretability, and efficiency.
Regular Expressions
One of the unsung successes in standardization in computer science has been the regular expression
(RE), a language for specifying text search strings. This prac- regular expression tical language is
used in every computer language, word processor, and text processing tools like the Unix tools grep
or Emacs. Formally, a regular expression is an algebraic notation for characterizing a set of strings.
They are particularly usecorpus ful for searching in texts, when we have a pattern to search for and
a corpus of texts to search through. A regular expression search function will search through the
corpus, returning all texts that match the pattern. The corpus can be a single document or a
collection. For example, the Unix command-line tool grep takes a regular expression and returns
every line of the input document that matches the expression. A search can be designed to return
every match on a line, if there are more than one, or just the first match. In the following examples
we generally underline the exact part of the pattern that matches the regular expression and show
only the first match. We’ll show regular expressions delimited by slashes but note that slashes are
not part of the regular expressions. Regular expressions come in many variants. We’ll be describing
extended regular expressions; different regular expression parsers may only recognize subsets of
these, or treat some expressions slightly differently. Using an online regular expression tester is a
handy way to test out your expressions and explore these variations.
A regular expression (often abbreviated as regex) is a sequence of characters that defines a search
pattern. Regular expressions are used for pattern matching within text, and they are widely
employed in text processing tasks. In Natural Language Processing (N NLP), regular expressions
are commonly used for text cleaning, pattern matching, tokenization, and other tasks that require
searching for specific patterns in text.
2. Tokenization:
• Tokenization is the process of breaking down text into smaller units (tokens), such as
words or sentences. Regular expressions are often used for word tokenization by
matching sequences of word characters or whitespace.
6. Spelling Correction:
• Regular expressions can also be used in spelling correction tasks to identify
common misspellings and apply corrections by matching common patterns of errors.
For example, replacing common typos like "teh" with "the".
nglish Morphology
Morphology is the branch of linguistics concerned with the structure of words. It deals with how
words are formed from smaller units called morphemes, which are the smallest meaningful units of
language. In English morphology, morphemes are combined in various ways to form words.
Understanding English morphology is crucial for tasks like language processing, text analysis, and
machine learning applications in Natural Language Processing (NLP).
Key Concepts in English Morphology
1. Morphemes: A morpheme is the smallest unit of meaning in a language. There are two
main types of morphemes in English:
• Free Morphemes: These can stand alone as words and carry meaning independently
(e.g., "cat", "book", "run").
• Bound Morphemes: These cannot stand alone and must attach to a free morpheme
to convey meaning (e.g., "un-" in "undo", "-s" in "cats", "-ed" in "walked").
Morphemes can be further divided into:
• Roots: The core morpheme that carries the primary meaning of a word (e.g., "run" in
"running").
• Affixes: Morphemes that attach to roots to alter their meaning or grammatical
function. Affixes include:
• Prefixes: Added to the beginning of a word (e.g., "re-" in "replay").
• Suffixes: Added to the end of a word (e.g., "-ed" in "walked").
• Infixes: Inserted within a word (e.g., some informal usage like "un-freaking-
believable").
• Circumfixes: Attach to both the beginning and the end of a word (though rare
in English, an example would be the German circumfix "ge-…-t" used in past
participles).
2. Inflectional vs. Derivational Morphemes:
• Inflectional Morphemes: These morphemes do not change the fundamental
meaning of a word but instead modify its tense, number, case, or other grammatical
properties. Inflectional morphemes are bound and help in conveying grammatical
distinctions.
• Examples:
• Tense: "run" → "ran" (past tense)
• Plural: "cat" → "cats" (plural)
• Possessive: "cat" → "cat's" (possessive)
• Derivational Morphemes: These morphemes are used to create new words by
changing the meaning or part of speech of the base word.
• Examples:
• Noun to Adjective: "beauty" → "beautiful" (-ful suffix)
• Verb to Noun: "run" → "runner" (-er suffix)
3. Types of Morphemes in English:
• Simple Words: Words consisting of only one morpheme (e.g., "book").
• Complex Words: Words made up of more than one morpheme (e.g., "books"
consists of "book" + "s").
• Compound Words: These are formed by combining two or more free morphemes
(e.g., "toothbrush", "snowman").
• Derivational Words: These are formed by adding derivational affixes (e.g.,
"happiness" from "happy" + "ness").
Examples of Morphemes in English
Word Morphemes Meaning/Function
Cats cat + s "cat" (free morpheme) + "s" (inflectional morpheme for plural)
"play" (free morpheme) + "ing" (inflectional morpheme for
Playing play + ing
continuous)
un + happy + "un" (prefix) + "happy" (root) + "ness" (suffix for noun
Unhappiness
ness formation)
Runner run + er "run" (root) + "er" (suffix indicating a person who does an action)
Happily happy + ly "happy" (root) + "ly" (suffix for adverb formation)
Lexicon
The lexicon would include the base forms of words and their morphological features. For example:
• Lexicon Entry:
• "run" → [root: "run", verb]
• "running" → [root: "run", verb, present participle]
Rules
The rules in the FST could describe the transformations that occur during inflection, such as:
• Add "-ing" to a verb root to form the present participle: "run" → "running".
• Remove "-ing" to return to the base form: "running" → "run".
Transducer
An FST would be constructed with these elements, where:
• The input might be a word like "running".
• The output would be "run", along with the feature "present participle".
The transducer works by applying the rules in a sequence of states:
1. Initial state: Reads the word "running".
2. Rule application: The FST applies the rule for removing "-ing" (a suffix) from verbs.
3. Final state: Outputs the root "run" with its associated feature (present participle).
Types of Tokenization
1. Word Tokenization:
• In word tokenization, the input text is split into individual words. This is the most
common form of tokenization.
• For example:
• Input text: "I love programming."
• Tokenized output: ["I", "love", "programming", "."]
• Punctuation marks (e.g., periods, commas) may be treated as separate tokens or
included with the words depending on the tokenizer's settings.
2. Sentence Tokenization:
• Sentence tokenization involves splitting the input text into individual sentences.
• For example:
• Input text: "I love programming. It is my passion."
• Tokenized output: ["I love programming.", "It is my
passion."]
• This type of tokenization is used when the analysis needs to focus on sentences
rather than individual words.
3. Character Tokenization:
• In character tokenization, the input text is split into individual characters, rather
than words or sentences.
• For example:
• Input text: "Hello"
• Tokenized output: ["H", "e", "l", "l", "o"]
• Character tokenization is often used in tasks like character-level language models,
spelling correction, or language modeling for morphologically rich languages.
4. Subword Tokenization:
• Subword tokenization splits words into smaller units, often at the level of
morphemes, or using machine learning-based methods like Byte Pair Encoding
(BPE), WordPiece, or SentencePiece.
• For example:
• Input text: "unhappiness"
• Tokenized output (using subword tokenization): ["un", "happiness"]
or ["un", "##happiness"] (depending on the method used).
• Subword tokenization is particularly useful in handling out-of-vocabulary (OOV)
words, such as rare or compound words, in deep learning-based NLP systems.
Challenges in Tokenization
1. Punctuation:
• Deciding whether punctuation marks should be included as separate tokens or
attached to the words they follow is a common challenge. For example, should "I
don't know." be tokenized as ["I", "don't", "know", "."] or ["I",
"don’t", "know."]?
2. Word Boundaries:
• Some languages, like Chinese or Japanese, do not use spaces to separate words.
Tokenizing text in these languages requires specialized algorithms that can accurately
detect word boundaries.
3. Compound Words:
• Some languages or contexts feature compound words that may need to be split or
treated as single tokens (e.g., "icecream" or "toothpaste").
4. Ambiguity:
• Tokenization may suffer from ambiguities, especially in cases where a token may
have different meanings depending on the context (e.g., "I saw her, a woman" vs. "I
saw her in the park"). The word "saw" could be a verb or noun based on context,
which complicates tokenization.
5. Hyphenated Words:
• Words connected by hyphens (e.g., "well-being" or "high-end") may need special
treatment to decide whether to tokenize them as a single word or split them.
6. Language Variability:
• The tokenization process needs to account for the diverse grammar, punctuation, and
morphology across languages. For example, tokenizing Arabic, which is written from
right to left, presents additional challenges compared to English.
3. Phonetic Algorithms
• Phonetic algorithms, like Soundex or Metaphone, map words to their phonetic
representations, helping to identify words that sound similar but may be spelled differently.
• How it works: Phonetic algorithms generate codes for words based on their pronunciation.
Misspelled words that sound like valid words can then be matched to those in the dictionary.
• Example: "fone" and "phone" would be mapped to the same phonetic code.
• Pros: Useful for detecting errors where words are spelled phonetically but incorrectly (e.g.,
homophones or regional spelling variations).
• Cons: May not handle non-phonetic errors well, and phonetic codes can sometimes match
unrelated words.
2. Frequency-Based Correction
• Word frequency or bigram frequency can be used to determine the most likely correction.
• How it works: Words that are more frequent in the language model are preferred. For
instance, if "teh" is detected as a misspelling, it might be corrected to "the" since "the" is
more frequent in general language use.
• Example: In the context of English text, "hte" might be corrected to "the" because "the"
appears more often in corpora.
• Pros: Works well when dealing with simple typos and common mistakes.
• Cons: May fail for rare or domain-specific terms.
Steps:
1. Initialization:
• The first row is initialized as D[0][j] = j (i.e., transforming an empty string to
the first j characters of s2 requires j insertions).
• The first column is initialized as D[i][0] = i (i.e., transforming the first i
characters of s1 to an empty string requires i deletions).
2. Matrix filling:
• For each pair of characters (s1[i-1], s2[j-1]), compute the cost for each
operation:
• Insertion: D[i][j-1] + 1
• Deletion: D[i-1][j] + 1
• Substitution: If s1[i-1] == s2[j-1], then no substitution is needed
(D[i-1][j-1]); otherwise, it is D[i-1][j-1] + 1.
• Take the minimum of these three values to determine D[i][j].
3. Result:
• The final value D[len(s1)][len(s2)] gives the Levenshtein distance.
xample
Let’s calculate the Levenshtein distance between the words "kitten" and "sitting":
1. Initialize the matrix with the lengths of the two strings:
arduino
Copy code
"" k i t t e n
"" 0 1 2 3 4 5 6
s 1 1 2 3 4 5 6
i 2 2 1 2 3 4 5
t 3 3 2 1 2 3 4
t 4 4 3 2 1 2 3
e 5 5 4 3 2 1 2
n 6 6 5 4 3 2 2
g 7 7 6 5 4 3 3
2. After calculating, we find that the Levenshtein distance between "kitten" and "sitting" is 3,
because it requires three operations:
• Substitute "k" → "s"
• Substitute "e" → "i"
• Insert "g" at the end
2. Fuzzy Matching
• Levenshtein distance is often used in fuzzy matching, where exact string matches are not
required. It helps find similar strings even when they are slightly different due to
misspellings, typographical errors, or variations in text (e.g., matching user inputs with
database records).
• This is useful in applications like search engines, data deduplication, or information
retrieval.
3. Plagiarism Detection
• In plagiarism detection, Levenshtein distance helps measure the similarity between two
pieces of text. If a piece of text is a paraphrase or closely similar to another, Levenshtein
distance can help assess how much text has been copied or modified.
• Example: "The quick brown fox jumps over the lazy dog" and "A fast, dark-colored fox
leaps over a sleepy dog" will have a non-zero Levenshtein distance, suggesting textual
similarity.
4. Speech Recognition
• In speech-to-text systems, Levenshtein distance can be used to compare the output text with
the reference transcript. The edit distance tells how close the recognized speech is to the
correct transcription.
• The lower the Levenshtein distance, the more accurate the transcription.
5. Machine Translation
• In machine translation, Levenshtein distance can be used as a measure of how similar the
machine-generated translation is to the correct human translation.
• It can be used to evaluate the quality of translations by comparing the output with ground-
truth sentences.
6. Text Normalization
• Levenshtein distance can be used for text normalization tasks, such as correcting informal
spellings, slang, or abbreviations in text (e.g., converting "u" to "you" or "b4" to "before").
Time Complexity
• The time complexity of the Levenshtein distance algorithm is O(n * m), where n and m
are the lengths of the two strings being compared. This is because the algorithm requires
filling an (n+1) x (m+1) matrix, with each cell representing a state transition between
the two strings.
• Space complexity can also be reduced to O(min(n, m)) by storing only the current and
previous rows of the matrix (since only these rows are needed for calculating the next step).
1. Perplexity
Perplexity is one of the most commonly used evaluation metrics for n-gram models. It measures
how well a probabilistic model predicts a sample. Lower perplexity indicates that the model is
better at predicting the next word in a sequence.
• Definition: Perplexity is the exponentiation of the cross-entropy of the model, and it reflects
how well the model can predict the test data.
For a given test set of size N, with true words w1,w2,…,wN, the perplexity PP is defined as:
PP=exp(−N1i=1∑NlogP(wi∣w1,w2,…,wi−1))
In simpler terms:
• It calculates the average log probability of each word in the test set, which measures
how surprised the model is by each word.
• Perplexity is the exponential of this value, which gives an interpretable measure of
how many words the model is "perplexed" by.
• Interpretation:
• A lower perplexity indicates better predictive performance (the model is less
"surprised" by the test data).
• A higher perplexity indicates poorer performance (the model struggles to predict the
next word).
2. Cross-Entropy
Cross-entropy is closely related to perplexity and is another common metric for evaluating the
performance of n-gram models. It measures the difference between the true distribution of the data
and the distribution predicted by the model.
• Definition: Cross-entropy is defined as the negative log-likelihood of the model's predicted
probabilities of the test set words.
For a test set with N words, the cross-entropy H(P,Q) between the true distribution P and the
model’s predicted distribution Q is:
H(P,Q)=−N1i=1∑NlogP(wi)
Here, P(wi) is the probability assigned to the word wi by the n-gram model.
• Interpretation:
• A lower cross-entropy indicates that the model’s predicted probabilities are close to
the actual distribution of the data, meaning better performance.
• Cross-entropy can also be viewed as a measure of surprise—if the model assigns
high probability to the correct word, it’s less surprised (lower cross-entropy).
3. Accuracy
Accuracy is another straightforward evaluation metric for n-gram models, particularly in tasks such
as speech recognition, machine translation, or text classification, where the model’s task is to
predict a sequence of words.
• Definition: Accuracy measures the proportion of correct predictions (or correctly predicted
n-grams) to the total number of predictions. It can be calculated for individual n-grams or as
an overall metric.
of Correct Predictions Number of PredictionsAccuracy=Total Number of PredictionsNumbe
r of Correct Predictions
• Interpretation:
• Accuracy is useful when comparing predicted sequences of words to the actual target
sequences.
• However, in some contexts (like language modeling), accuracy may not be the best
metric because of the sparseness of the correct n-grams.
4. BLEU (Bilingual Evaluation Understudy)
BLEU score is a metric commonly used for evaluating machine translation systems, but it can also
be used for general text generation tasks, where an n-gram model is used to generate sequences of
words.
• Definition: BLEU evaluates how many n-grams in the generated text overlap with n-grams
in reference texts. It rewards n-grams that appear in both the prediction and the reference.
n-gram count n-gram countBLEU=min(1,reference n-gram countgenerated n-gram count
)×Pn
• Pn is the precision of n-grams (e.g., bigrams, trigrams).
• BLEU applies a brevity penalty to discourage overly short generated texts that
match a reference.
• Interpretation:
• A higher BLEU score indicates better matching between the model's output and the
reference text.
• BLEU evaluates the precision of n-grams at different levels, which helps measure the
fluency and quality of the text generated by an n-gram model.
Formula:
For a unigram model, the smoothed probability P(w) for word w is calculated as:
words in corpusP(w)=total words in corpus+Vcount(w)+1
Where:
• count(w) is the count of word w in the training corpus.
• V is the size of the vocabulary (i.e., the total number of distinct words in the corpus).
For bigrams, the probability P(w2∣w1) for a sequence of words w1,w2 is calculated as:
P(w2∣w1)=count(w1)+Vcount(w1,w2)+1
This approach adds 1 to the frequency of each n-gram and adjusts the denominator to account for
the new possibilities created by adding the smoothing term.
• Advantages: Simple to implement and guarantees non-zero probabilities for unseen n-
grams.
• Disadvantages: The addition of 1 might be excessive for large corpora with frequent n-
grams, causing over-smoothing.
2. Good-Turing Smoothing
Good-Turing Smoothing is a more advanced smoothing technique that estimates the probability of
unseen n-grams based on the frequency of n-grams that have appeared once in the training corpus.
It adjusts probabilities by redistributing the probability mass from observed n-grams to unseen ones.
Formula:
Let N1 be the number of n-grams that occurred once, N2 the number of n-grams that occurred
twice, and so on. The probability for unseen n-grams is given by:
P(unseen)=NN1
Where:
• N1 is the number of n-grams that appear exactly once in the training corpus.
• N is the total number of n-grams in the corpus.
For n-grams that occurred c times, the probability is adjusted using the formula:
P(c)=Nc(c+1)⋅Nc+1
Where Nc is the count of n-grams that appeared c times.
• Advantages: Provides a more sophisticated estimate of probabilities for unseen n-grams
than simple additive smoothing.
• Disadvantages: Requires calculating counts of n-grams with specific frequencies (e.g., n-
grams with 1, 2, 3 occurrences), which can be computationally expensive.
3. Kneser-Ney Smoothing
Kneser-Ney Smoothing is an advanced and highly effective smoothing technique that works
particularly well for large corpora and high-order n-grams (like trigrams and beyond). It combines
discounting (reducing the probability mass of observed n-grams) with a back-off strategy (using
lower-order n-grams when higher-order n-grams are not observed).
The basic idea is to subtract a constant (discount factor) D from the count of each n-gram and
redistribute the probability mass to unseen n-grams based on their lower-order n-grams.
Formula:
The smoothed probability of a bigram P(w2∣w1) is calculated as:
P(w2∣w1)=max(count(w1,w2)−D,0)/count(w1)+λPbackoff(w2)
Where:
• D is a discount factor, typically between 0 and 1.
• λ is a normalizing constant.
• Pbackoff(w2) is the probability of w2 based on a lower-order model (e.g., unigram or
bigram).
For unseen bigrams, the model "backs off" to lower-order models like unigrams or trigrams,
redistributing probability mass.
• Advantages: Highly effective for language modeling, particularly for high-order n-grams
and large corpora. Often used in modern systems.
• Disadvantages: More complex to implement than simpler techniques like Laplace
smoothing.
4. Witten-Bell Smoothing
Witten-Bell Smoothing is another approach that focuses on adjusting the probability of unseen n-
grams using information from lower-order n-grams. This smoothing method is based on the
intuition that unseen n-grams are likely to share characteristics with n-grams that have been
observed a few times.
Formula:
The probability P(w2∣w1) of a bigram is calculated as:
P(w2∣w1)=count(w1)count(w1,w2)+Vcount(w1)⋅count(w2)
Where V is the size of the vocabulary, and other terms are similar to previous smoothing
techniques.
• Advantages: More sophisticated than Laplace smoothing and particularly effective in
contexts like speech recognition.
• Disadvantages: More complex than Laplace and Good-Turing smoothing, requiring more
computational resources.
5. Back-off Models
Back-off models use lower-order n-grams when higher-order n-grams are not observed. In other
words, when a higher-order n-gram (like a trigram) is missing, the model "backs off" to a lower-
order n-gram (like a bigram or unigram).
• Example: For a trigram model, if the bigram "I am" has been observed but the trigram "I am
happy" has not, the model may back off to the probability of the bigram "I am", or even the
unigram "I".
1. Backoff Models
Backoff is a technique where the model defaults to a lower-order n-gram model when higher-order
n-grams are not available. This is useful in situations where a sequence of words (like a trigram) has
never been observed during training.
• Basic Idea: If a trigram like "I am happy" has never been seen, but the bigram "I am"
exists, the model can "back off" to the bigram model to estimate the probability of the next
word.
Advantages of Backoff:
• Simple and intuitive: It allows the model to use available lower-order n-grams when
higher-order n-grams are missing.
• Handling unseen n-grams: It helps to avoid assigning a probability of zero to unseen n-
grams.
Challenges of Backoff:
• Data sparsity: In rare cases, lower-order n-grams (e.g., bigrams or unigrams) might also be
sparse.
• Backoff weight tuning: Selecting appropriate backoff weights (denoted by λ) can be
challenging, and improper selection can degrade model performance.
2. Interpolation Models
Interpolation is another technique for smoothing n-gram probabilities, where the model combines
multiple n-gram models (e.g., unigram, bigram, trigram) by assigning weights to each. The idea is
to give each model a "vote" on the probability of a word sequence and to combine these
probabilities in a weighted manner.
• Basic Idea: Instead of completely relying on one n-gram model, the interpolated model
blends different orders of n-grams to improve robustness and account for unseen n-grams.
Advantages of Interpolation:
• Flexible and robust: Interpolation allows combining different models and thus improves
generalization by providing smoother estimates.
• Works well with unseen n-grams: Even if the trigram doesn't appear, the bigram or
unigram can contribute to the probability, preventing zero probabilities for unseen n-grams.
Challenges of Interpolation:
• Weight tuning: Like backoff, the weights (λ1,λ2,λ3) need to be carefully tuned to get the
best performance.
• Computational complexity: More models mean more computations, especially with higher-
order n-grams.
3. Interpolation vs Backoff
• Backoff: The model "backs off" to lower-order n-grams (e.g., trigram → bigram →
unigram) if higher-order n-grams are not observed. It’s simpler but might result in loss of
information when shifting to lower-order models.
• Interpolation: The model blends multiple n-gram models, allowing them to contribute
probabilistically. This method ensures that all models (higher-order and lower-order) have a
role in estimating probabilities, but it requires tuning the weights.
When to Use:
• Backoff is useful when you want a simple, hierarchical model that is easy to implement and
works well in many cases, especially when higher-order n-grams are sparse.
• Interpolation is ideal when you want to blend models of different orders and don’t want to
strictly rely on one model.
Example:
In a model with word classes, a trigram might be represented as:
• Original trigram: "The dog barks"
• Class-based trigram: "Det Noun Verb"
Here, the class-based trigram reduces the vocabulary size by considering general classes instead of
specific words, which is especially helpful in domains with large vocabulary sizes.
Example of a Lexicon:
A lexicon in rule-based POS tagging may contain entries like:
• "dog" → Noun
• "run" → Verb
• "quickly" → Adverb
• "is" → Verb (present tense)
Example of a Rule:
One simple rule could be:
• If a word follows a determiner (e.g., "the", "a"), it is most likely a noun:
• Rule: If the previous word is a determiner (DT) and the current word is a singular
noun (NN), tag the word as a noun.
Strengths of HMM
1. Simple and Efficient: HMMs are relatively simple and computationally efficient for
sequential data.
2. Clear Probabilistic Interpretation: The probabilistic nature of HMMs provides a clear
understanding of model uncertainty.
3. Effective for Sequential Data: HMMs perform well when the sequence has strong
Markovian dependencies and can be modeled with the assumption that current states depend
mostly on the previous state.
Limitations of HMM
1. Assumption of Independence: The Markov assumption (where the state depends only on
the previous state) may be too simplistic for many real-world problems, as many tasks
require considering broader context.
2. Limited Context: HMMs rely only on first-order dependencies (the immediate past state)
and do not handle long-range dependencies well.
3. Parameter Estimation: For tasks with complex vocabularies or large tag sets (e.g., in POS
tagging), the model may require a large amount of training data to accurately estimate
transition and emission probabilities.
Applications of HMMs
• Speech Recognition: HMMs are used to model sequences of speech sounds and recognize
spoken words.
• POS Tagging: HMMs are used to assign POS tags to words in a sentence.
• Named Entity Recognition (NER): HMMs are used to identify proper names (e.g.,
locations, organizations) in text.
• Bioinformatics: HMMs are used for gene prediction and sequence alignment in genomics.
Key Concepts
1. Entropy
• Entropy is a measure of uncertainty or unpredictability in a system. In information theory, it
quantifies the average "amount of surprise" in a set of outcomes.
• A probability distribution with higher entropy means it is more "spread out" or uncertain.
Conversely, a distribution with low entropy is more "concentrated" or deterministic.
3. Feature Functions
A MaxEnt model typically involves a set of features that capture the relevant information or
constraints about the data. These features are used to define the probability distribution over
possible outcomes.
For example, in POS tagging, the features might include:
• The current word in the sentence.
• The previous word (for capturing contextual information).
• The part of speech of the previous word.
• Word prefixes or suffixes.
The goal is to learn a model that maximizes entropy while satisfying the constraints imposed by the
features.
Unit 2
Derivations
Given the above CFG, let's derive a sentence:
• Start with S.
• S→NP VP
• NP→Det N
• Det→"the", N→"cat"
• So, cat"NP→"the cat".
• VP→V NP
• V→"chases", dog"NP→"the dog".
• So, the dog"VP→"chases the dog".
Thus, the sentence derived from the CFG is:
"The cat chases the dog."
Here, each node represents a non-terminal (e.g., S, NP, VP, etc.), and the leaves are the terminal
symbols (e.g., "the", "cat", "chases").
Advantages of CFG
1. Expressive Power: CFGs can describe a wide range of syntactic structures and are capable
of generating many natural languages.
2. Well-Established Theory: The theory behind CFGs is well-understood, and there are many
efficient algorithms for parsing and generating sentences.
3. Extensibility: CFGs can be extended to more complex grammatical frameworks, like
Extended CFGs or Tree Adjoining Grammars (TAGs), for more complex languages.
Limitations of CFG
1. Ambiguity: Many natural languages are ambiguous, and a single CFG might produce
multiple parse trees for a single sentence.
2. Limited Expressiveness: Some linguistic phenomena (such as cross-serial dependencies in
some languages) cannot be adequately captured by a CFG.
3. Inability to Capture Context Sensitivity: CFGs cannot capture dependencies that depend
on the context, such as agreement constraints or long-range dependencies.
2. Syntax Parsing
Syntax parsing is the process of analyzing a sentence structure based on a set of grammar rules. The
most common types of syntax parsing in NLP are constituency parsing (phrase structure) and
dependency parsing.
1. Constituency Parsing (Phrase Structure Grammar)
In constituency parsing, the goal is to break down a sentence into its constituent parts (such as
noun phrases, verb phrases, etc.). The grammar rules are usually represented in a Context-Free
Grammar (CFG) format.
• Example:
• Sentence: "The cat sleeps."
• Grammar rules:
• S → NP VP (Sentence → Noun Phrase + Verb Phrase)
• NP → Det N (Noun Phrase → Determiner + Noun)
• VP → V (Verb Phrase → Verb)
• Parse Tree:
mathematica
Copy code
S
/ \
NP VP
/ \ |
Det N V
| | |
The cat sleeps
2. Dependency Parsing
In dependency parsing, the goal is to represent the grammatical structure of a sentence in terms of
dependencies between words. In dependency parsing, each word is connected to its syntactically
dependent word (i.e., the word it governs).
• Example:
• Sentence: "The cat sleeps."
• Dependency structure:
• "sleeps" (verb) is the root (main verb)
• "cat" (noun) is the subject of "sleeps"
• "The" (article) modifies "cat"
Graph representation:
bash
Copy code
sleeps
|
cat
|
The
1. Word Formation:
• Affixes:
• Prefixes: "un-" (unhappy), "re-" (rebuild)
• Suffixes: "-ing" (running), "-ed" (walked)
• Inflection: Changing a word form to express grammatical features like tense, number,
gender, etc.
• Verb inflections: "walk" → "walks" (third-person singular), "walked" (past tense),
"walking" (present participle)
• Noun inflections: "cat" → "cats" (plural)
• Derivation: Creating new words by adding prefixes or suffixes.
• "Happy" → "Happiness" (noun formation)
• "Teach" → "Teacher" (agent noun formation)
2. Stemming and Lemmatization:
• Stemming: A process that removes prefixes and suffixes from words to obtain their root
form. Example: "running" → "run".
• Lemmatization: Similar to stemming but aims to return the root word (lemma) that is a
valid word in the dictionary. Example: "better" → "good".
6. Parsing Ambiguities
Parsing can be challenging due to ambiguities in grammar. Ambiguities occur when a sentence can
have more than one interpretation or structure. This happens in:
• Lexical Ambiguity: Words have multiple meanings (e.g., "bank" can refer to a financial
institution or the side of a river).
• Syntactic Ambiguity: A sentence has multiple valid parse trees (e.g., "I saw the man with
the telescope").
Example:
Sentence: "I saw the man with the telescope."
• Interpretation 1: "I used the telescope to see the man."
• Interpretation 2: "The man I saw was holding a telescope."
Treebanks in NLP
A treebank is a large annotated corpus that provides linguistic annotations for text in the form of
syntactic structures, typically as parse trees. These trees represent the syntactic structure of
sentences, showing how words and phrases relate to each other within a sentence according to a
particular grammatical theory (such as constituency grammar or dependency grammar).
Treebanks are vital resources in Natural Language Processing (NLP) as they are used for training
and evaluating syntactic parsing models, and they help in tasks like part-of-speech (POS) tagging,
machine translation, and information extraction.
Key Concepts of Treebanks
1. Syntactic Annotation:
• Each sentence in a treebank is annotated with a syntactic structure, usually in the
form of a tree diagram.
• The tree consists of nodes (representing words or syntactic constituents) and edges
(representing grammatical relationships between them).
• Annotations often follow a specific grammatical theory (such as Phrase Structure
Grammar or Dependency Grammar).
2. Constituency vs. Dependency Parsing:
• Constituency Treebanks: The tree structure represents hierarchical constituency
relationships. Phrases are nested inside each other (e.g., noun phrases inside verb
phrases).
• Example: The sentence "The cat sleeps" would be parsed as S → NP VP
(Sentence → Noun Phrase + Verb Phrase).
• Dependency Treebanks: The tree structure represents grammatical relationships
between words, where each word is connected to another word, with one root word
governing the others.
• Example: "The cat sleeps" would be represented with "sleeps" as the root
word, and "cat" as its dependent, with "The" depending on "cat".
Types of Treebanks
1. Annotated Constituency Treebanks:
• These treebanks use constituency grammar to represent sentence structure. They
focus on hierarchically grouping words into phrases (e.g., noun phrases, verb
phrases).
• Example: Penn Treebank.
2. Annotated Dependency Treebanks:
• These treebanks use dependency grammar to represent the relationships between
words in terms of head-dependent relations.
• Example: Universal Dependencies (UD).
4. PropBank
• PropBank extends the Penn Treebank with annotations for verb arguments and rolesets. It
provides a resource for semantic role labeling, where the roles of different participants in the
event described by a verb are labeled (e.g., agent, patient).
• Example: In the sentence "John ate the pizza," "John" would be labeled as the Agent and
"pizza" as the Theme.
5. OntoNotes
• OntoNotes is a large-scale corpus that includes syntactic, semantic, and coreference
annotations.
• It builds upon the Penn Treebank and provides rich semantic annotations to improve tasks
like named entity recognition (NER), coreference resolution, and semantic role labeling.
Importance of Treebanks in NLP
1. Training and Evaluation of Parsers:
• Treebanks are essential for training syntactic parsers that learn to identify sentence
structure. These parsers are evaluated based on how accurately they can reproduce
the syntactic structures found in a treebank.
2. Cross-Linguistic Research:
• Treebanks for multiple languages allow for comparative studies of linguistic
structures across languages. The Universal Dependencies project, for example,
makes it easier to develop multilingual parsers and compare syntactic features of
different languages.
3. Semantic Role Labeling (SRL):
• Treebanks with semantic annotations (e.g., PropBank and OntoNotes) provide the
foundation for tasks like semantic role labeling, where the roles of different
participants in an action (like agents, patients, and instruments) are identified.
4. Machine Translation:
• Syntactic information from treebanks can improve machine translation by
providing structure-sensitive translation models. The parse trees from a treebank
offer a way to represent sentences in a formal, structured manner that is more easily
translated into another language.
5. Part-of-Speech Tagging and Named Entity Recognition (NER):
• Treebanks often come with part-of-speech (POS) tags and named entity
annotations that help in POS tagging, NER, and other tasks requiring accurate
word-level annotation.
In formal language theory, a normal form for a grammar is a specific way of rewriting a grammar
to conform to a certain set of rules that simplify or standardize its structure. Normal forms are used
in both Context-Free Grammars (CFGs) and Context-Sensitive Grammars (CSGs) to make
tasks such as parsing and simplification easier. These normal forms help in the design and
implementation of parsing algorithms.
This is not in Chomsky Normal Form because it has a production like S → AB, which has two non-
terminals on the right-hand side, but it also includes a terminal symbol directly (S → a).
To convert it to CNF, we would make sure that every rule follows the structure mentioned earlier.
Here, the rule S → aA is in GNF because it starts with a terminal a followed by a non-terminal A.
Similarly, A → b is a valid production in GNF.
Dependency Parsing
Dependency parsing is the process of analyzing a sentence to determine the syntactic structure by
identifying the dependency relations between words. It involves two primary tasks:
1. Identifying the head of each word.
2. Assigning dependency labels to the relationships between words.
There are two main types of dependency parsers:
1. Transition-Based Parsers:
• These parsers build the dependency tree incrementally by applying a series of
transitions that modify the state of the parser.
• They are often fast and efficient, making them ideal for real-time applications.
• Example: Shift-Reduce Parsing and Arc-Standard Parsing are examples of
transition-based parsing methods.
2. Graph-Based Parsers:
• These parsers approach parsing by considering all possible dependency relations as a
graph and choosing the most likely tree structure based on statistical models.
• They often use dynamic programming or maximum spanning tree algorithms.
• Example: Eisner's Algorithm is one of the well-known algorithms used for graph-
based parsing.
Example of Dependency Tree
Consider the sentence: "The cat chased the mouse."
The corresponding dependency tree would look like this:
bash
Copy code
chased
/ \
cat mouse
|
the
|
The
1. Constituency Parsing
In constituency parsing, the sentence is broken down into subgroups called constituents, which
correspond to syntactic units like noun phrases (NP), verb phrases (VP), and prepositional phrases
(PP). The parse tree produced in constituency parsing reflects these hierarchical structures, where
each node represents a phrase or word.
Key Characteristics:
• Constituents: Phrases like noun phrases (NP), verb phrases (VP), adjective phrases (ADJP),
etc.
• Hierarchy: Constituents are combined into larger constituents, forming a hierarchical
structure.
• Context-Free Grammar (CFG): Constituency parsing typically follows Context-Free
Grammar (CFG) rules, where a non-terminal symbol can expand into one or more non-
terminal symbols and terminal symbols.
2. Dependency Parsing
Dependency parsing focuses on the relationships between individual words. It identifies the head
of each word and its dependents. The parse tree produced in dependency parsing is a directed
acyclic graph (DAG) where the words are connected by directed edges that represent syntactic
dependencies.
Key Characteristics:
• Head-Dependent Structure: Each word is connected to a governing word (head), and these
dependencies represent syntactic roles.
• Directionality: The edges are directed, indicating the direction of the syntactic relationship.
• No Hierarchical Phrase Structure: Unlike constituency parsing, which is based on
hierarchical phrase structure, dependency parsing represents the structure in terms of
relationships between words.
3. Parsing Techniques
There are several approaches and algorithms for syntactic parsing, both for constituency and
dependency parsing:
a. Top-Down Parsing
• Top-down parsing starts from the root of the tree and recursively tries to expand non-
terminal symbols until it matches the sentence.
• It uses a Context-Free Grammar (CFG) and tries to match the entire sentence by
predicting the possible structure of the sentence and then checking if it fits.
• Example: Recursive Descent Parsing is a popular top-down parsing technique.
b. Bottom-Up Parsing
• Bottom-up parsing begins with the words (terminals) in the sentence and combines them to
form constituents, gradually building the sentence structure.
• This approach is generally more efficient for handling ambiguity than top-down methods.
• Example: Earley Parsing and CYK Parsing (Cocke-Younger-Kasami) are common
bottom-up parsing methods.
c. Chart Parsing
• Chart parsing uses a dynamic programming approach to build partial parse trees. It can
be used for both constituency and dependency parsing, and is particularly useful for parsing
ambiguous sentences.
• It uses a chart (a table-like structure) to store intermediate parsing results, allowing the
parser to avoid redundant work.
• Example: The CYK algorithm is widely used for CFG-based parsing.
d. Transition-Based Parsing
• In transition-based parsing, a parser builds the dependency tree incrementally by applying
a sequence of transitions that change the state of the parser.
• The transitions move from one state to another by either shifting a word from the input into
a stack or reducing a stack of words into a dependency relation.
• Example: The Arc-Standard and Arc-Eager parsing algorithms are common transition-
based parsers.
e. Graph-Based Parsing
• Graph-based parsing focuses on generating a parse tree by considering all possible
dependency relations as a graph, where the goal is to find the maximum spanning tree of
the graph.
• Example: Eisner's Algorithm is one of the popular methods for graph-based dependency
parsing.
4. Parsing Evaluation
To evaluate the performance of a syntactic parser, different metrics are used, depending on the task
and the type of parsing (constituency or dependency). Common evaluation metrics include:
• Precision: The proportion of correctly identified syntactic structures out of all identified
structures.
• Recall: The proportion of correctly identified syntactic structures out of all true structures.
• F1 Score: The harmonic mean of precision and recall.
• Exact Match: The percentage of completely correct parses (often used for dependency
parsing).
• Unlabeled Attachment Score (UAS): Measures how many words are attached correctly,
without considering the specific dependency label.
• Labeled Attachment Score (LAS): Measures how many words are attached with the
correct dependency label.
5. Applications of Syntactic Parsing
Syntactic parsing is a fundamental component in many NLP applications:
1. Machine Translation:
• Syntactic parsing helps in mapping syntactic structures between source and target
languages, making it essential for accurate machine translation, especially for
languages with different syntactic structures.
2. Information Extraction:
• By identifying syntactic relations, parsers help systems extract relevant entities
(people, organizations, locations) and relations (e.g., "person X works at company
Y").
3. Sentiment Analysis:
• Syntactic analysis allows for understanding the grammatical structure of opinions
and sentiments, aiding in detecting sentiment in sentences where the meaning
depends on the syntactic relationships.
4. Question Answering:
• In question answering systems, syntactic parsing helps the system understand the
structure of a question, enabling it to find the relevant part of a document to extract
the correct answer.
5. Speech Recognition and Understanding:
• Accurate syntactic parsing improves speech-to-text systems by ensuring that the
transcribed sentence's syntactic structure is correctly understood.
1. Lexical Ambiguity
Lexical ambiguity occurs when a single word has multiple meanings, and its meaning is not clear
from the context. This is one of the most common types of ambiguity in NLP.
• Example 1: The word "bank" can refer to:
• A financial institution (e.g., "I went to the bank to withdraw money").
• The side of a river (e.g., "The boat landed on the bank of the river").
• Example 2: The word "bat" can mean:
• A flying mammal (e.g., "The bat flew through the night sky").
• A piece of sports equipment (e.g., "He hit the ball with the bat").
2. Syntactic Ambiguity
Syntactic ambiguity arises when a sentence or phrase can have more than one syntactic structure or
interpretation. This happens when words or phrases can be grouped or parsed in multiple ways.
• Example: "I saw the man with the telescope."
• Interpretation 1: I used a telescope to see the man (the telescope is the instrument
used).
• Interpretation 2: I saw a man who had a telescope (the man has the telescope).
• Example: "She told him that she would help him."
• Interpretation 1: She promised to help him.
• Interpretation 2: She said that she would help him, but the help may not be certain.
3. Semantic Ambiguity
Semantic ambiguity occurs when a sentence or phrase has multiple possible meanings, even after
resolving syntactic structure. This type of ambiguity is concerned with the meaning of words and
sentences.
• Example: "He is looking for a bat."
• Interpretation 1: He is searching for the flying mammal.
• Interpretation 2: He is searching for the sports equipment.
• Example: "The chicken is ready to eat."
• Interpretation 1: The chicken is cooked and ready for someone to eat it.
• Interpretation 2: The chicken itself is hungry and ready to eat something.
4. Pragmatic Ambiguity
Pragmatic ambiguity arises when a sentence is ambiguous because its meaning depends on context
or the speaker's intentions. It often involves social or conversational nuances that are not directly
stated in the sentence.
• Example: "Can you pass the salt?"
• Interpretation 1: A request for the action of passing the salt.
• Interpretation 2: A question asking if the person is capable of passing the salt.
• Example: "I can't wait to see you."
• Interpretation 1: Expressing excitement about seeing the person.
• Interpretation 2: Indicating impatience and not looking forward to it.
5. Structural Ambiguity
Structural ambiguity occurs when the grammatical structure of a sentence allows for more than one
interpretation, even if the individual words are unambiguous.
• Example: "I saw the man with the telescope."
• Interpretation 1: I saw the man who was holding the telescope.
• Interpretation 2: I used the telescope to see the man.
Handling Structural Ambiguity:
Structural ambiguity can be resolved by:
• Syntactic parsing: A more detailed syntactic analysis can distinguish between different
syntactic structures.
• Disambiguation based on the surrounding context: Using nearby words or general
discourse context to resolve which structure makes more sense.
7. Scope Ambiguity
Scope ambiguity arises when the scope of an operator (e.g., quantifiers, negations, modals) is
unclear and can be interpreted in different ways.
• Example: "Every student didn't pass the exam."
• Interpretation 1: No student passed the exam (the negation applies to "pass").
• Interpretation 2: Not every student passed the exam (the negation applies to "every
student").
For the input sentence "the man saw", Earley parsing would begin by predicting possible
expansions for the sentence, then scan and match tokens, and eventually construct a valid parse tree
by using the above production rules.
Example:
• Sentence: "The quick brown fox jumped over the lazy dog."
• Shallow parse result:
• NP (Noun Phrase): "The quick brown fox"
• VP (Verb Phrase): "jumped over the lazy dog"
• NP (Noun Phrase): "the lazy dog"
Here, the sentence is divided into two noun phrases (NP) and one verb phrase (VP). Shallow
parsing focuses on identifying these larger grammatical units.
1. Rule-based Chunking:
In rule-based shallow parsing, specific grammar rules are manually crafted to identify chunks based
on patterns of words or POS tags. For example, a rule might look like:
• NP → (DT) (JJ) (NN) (where DT = determiner, JJ = adjective, NN = noun)
• This rule would identify noun phrases that start with a determiner, followed by an
adjective and then a noun.
Example:
• Input: "The quick brown fox"
• The rule-based system might identify "The quick brown" as an adjective phrase and "fox" as
a noun phrase.
3. Hybrid Methods:
Hybrid methods combine rule-based and machine learning approaches. For instance, a rule-based
system might be used for initial chunk identification, and then a machine learning model can refine
the results or handle edge cases.
5. Applications of Shallow Parsing
Shallow parsing is valuable in several NLP tasks that require quick, efficient analysis of sentence-
level structures without needing to build full parse trees. Some applications include:
• Information Extraction (IE): Extracting structured data from unstructured text, such as
names, dates, and locations. Shallow parsing helps identify relevant chunks (e.g., noun
phrases or named entities).
• Named Entity Recognition (NER): Identifying and classifying entities such as people,
organizations, and locations. NER often relies on shallow parsing to identify noun phrases
that are likely to contain named entities.
• Question Answering: Shallow parsing helps break down the question into its key
components (e.g., subject, object, verb), making it easier to map the question to relevant
answers.
• Sentiment Analysis: Breaking sentences into chunks allows for better identification of
sentiment-bearing phrases or clauses (e.g., "very happy", "quite sad").
• Machine Translation: Shallow parsing can aid in translating sentence components rather
than attempting to fully translate every sentence with complex syntactic structure.
• Speech Recognition: Shallow parsing can help improve accuracy in speech-to-text systems
by chunking phrases that are common and meaningful in everyday speech.
Here, the rule S → NP VP means a sentence (S) consists of a noun phrase (NP) followed by a verb
phrase (VP).
In this PCFG:
• The production S → NP VP is assigned a probability of 0.9, meaning it's very likely that a
sentence consists of a noun phrase followed by a verb phrase.
• The rule NP → Det N has a probability of 0.8, indicating that a noun phrase is more likely
to be a determiner followed by a noun than a single noun alone.
• The sum of the probabilities for each non-terminal (e.g., NP → Det N [0.8] + NP → N
[0.2]) equals 1.
Parsing Algorithms:
The most commonly used parsing algorithms for PCFGs include:
1. CYK Parsing: This algorithm, originally designed for CFGs, can be adapted to work with
PCFGs. In CYK parsing, the chart stores both the possible non-terminal productions and
their associated probabilities. When constructing a parse tree, the parser chooses the rule
with the highest probability at each step.
2. Earley Parsing: This is another general-purpose parsing algorithm that can also be modified
to work with probabilities in PCFGs.
3. Dynamic Programming (DP) Parsing: This method uses dynamic programming to store
intermediate results (subtrees) along with their probabilities, allowing the parser to
efficiently compute the most likely parse.
4. Training a PCFG
To build a PCFG, we need to estimate the probabilities of each production rule. This is typically
done using maximum likelihood estimation (MLE), where the probability of a rule is estimated
based on its frequency in a training corpus.
Steps in Training a PCFG:
1. Corpus Parsing: First, a parsed corpus is needed. This corpus must contain sentences with
labeled syntactic structures (parse trees). Treebanks (like the Penn Treebank) are
commonly used for this purpose.
2. Counting Rule Frequencies: For each non-terminal, count how many times each of its
production rules appears in the training data.
3. Calculating Probabilities: For each non-terminal, the probability of each production rule is
computed as the relative frequency of that rule. For example, if a rule like NP → Det N
appears 80 times out of 100 possible NP rules, its probability would be
P(NP→DetN)=10080=0.8.
4. Normalization: Ensure that the probabilities of all rules for a given non-terminal sum to 1.
5. Advantages of PCFG
1. Handling Ambiguity: PCFGs help disambiguate sentences that have more than one possible
syntactic structure by assigning higher probabilities to more likely parses.
2. Statistical Foundation: By integrating probability, PCFGs provide a statistical basis for
parsing, making them suitable for tasks that require robustness and generalization over
unseen data.
3. Practicality: PCFGs are particularly useful for natural language tasks where the exact
structure is less important than finding the most likely parse. For example, in applications
like machine translation, information extraction, and speech recognition, using a PCFG
to select the most probable syntactic structure can lead to better overall performance.
6. Limitations of PCFG
1. Limited to Context-Free Grammars: PCFGs are still based on context-free grammar,
which means they cannot model more complex syntactic dependencies (such as those that
require long-distance dependencies).
2. Corpus Dependence: The accuracy of a PCFG is highly dependent on the quality and size
of the training corpus. If the corpus does not cover certain syntactic constructions, the
resulting model may perform poorly on unseen data.
3. Sparsity: In real-world language, some grammatical rules may be extremely rare or unseen
in the training data, which leads to sparse data problems. This can be mitigated by using
smoothing techniques, but it remains a challenge.
4. Inability to Capture Higher-Level Semantics: While PCFGs can model syntax, they do
not capture semantic relationships or dependency parsing, which may be important for
tasks like semantic role labeling or question answering.
7. Applications of PCFG
• Syntactic Parsing: PCFGs are widely used in syntactic parsers because they provide a
probabilistic framework for generating the most likely syntactic structure for a given
sentence.
• Machine Translation: In phrase-based machine translation, PCFGs can help align source
and target languages by providing syntactic structure for translations.
• Information Extraction: PCFGs can help extract meaningful chunks or entities from
unstructured text by ensuring the correct grammatical structure of the chunks.
• Speech Recognition: PCFGs are used in speech recognition systems to improve parsing and
the generation of possible transcriptions based on syntactic likelihoods.
Probabilistic CYK (PCYK)
The Probabilistic CYK (PCYK) algorithm is an extension of the CYK (Cocke-Younger-Kasami)
algorithm, which is traditionally used for parsing context-free grammars (CFGs). PCYK is adapted
to work with Probabilistic Context-Free Grammars (PCFGs), where each production rule in the
grammar is associated with a probability. The goal of PCYK is to efficiently compute the most
probable parse tree for a given sentence using probabilistic information from a PCFG.
The PCYK algorithm is commonly used in syntactic parsing where we want to not just identify
any valid parse tree but the one with the highest likelihood based on the given probabilistic
grammar.
and you have the sub-sequences w1,w2 and w3,w4 being generated by B and C respectively, the
probability of A generating w1,w4 is:
P(A→BC)=P(B→w1w2)×P(C→w3w4)×0.5
1. Initialization: The table for words "the" and "dog" is initialized with probabilities:
• T[1,1]: "the" can be generated by Det with probability 0.9.
• T[2,2]: "dog" can be generated by N with probability 0.8.
2. Building larger spans:
• For span [1,2] ("the dog"), we check possible rules:
• NP→DetN with probability 0.9 × 0.8 = 0.72 for "the dog".
3. Final Parse:
• We find that NP → Det N is the most likely parse for the sequence "the dog" with
probability 0.72.
• The final parse tree is constructed using this rule.
5. Complexity of PCYK
The time complexity of the CYK algorithm is O(n3), where n is the number of words in the
sentence. For the Probabilistic CYK (PCYK), the time complexity remains O(n3), since the
algorithm still needs to process all possible sub-sequences of the input sentence.
However, instead of just recording non-terminals, we also need to keep track of probabilities, which
introduces additional bookkeeping but does not change the overall asymptotic complexity.
• Space Complexity: The space complexity of PCYK is also O(n3), due to the need to store
probabilities for all sub-sequences.
6. Advantages of PCYK
• Probabilistic Information: PCYK allows you to choose the most likely parse based on
probabilistic rules, which is particularly useful when there are multiple valid parses for a
sentence.
• Efficient Parsing: Despite the added complexity of probabilities, PCYK remains efficient
with O(n3) time complexity, which is feasible for many real-world sentences.
• Improved Accuracy: By incorporating probabilities, PCYK typically outperforms
traditional CYK parsing in tasks where probabilistic decisions are needed (e.g.,
disambiguation).
In this case:
• S → NP VP means that a sentence (S) is likely to consist of a noun phrase (NP) followed
by a verb phrase (VP), with a probability of 0.9.
• N → "dog" means that a noun (N) is most likely the word "dog" with a probability of 0.8.
Thus, lexicalized grammar means including specific word forms as part of the grammar rules
themselves.
Example of PCFG-L:
css
Copy code
S → NP VP [0.9]
NP → Det "dog" [0.7]
NP → Det "cat" [0.3]
VP → V NP [0.6]
VP → V [0.4]
Det → "the" [0.9]
V → "chased" [0.5]
V → "barked" [0.5]
Here, the rule NP → Det "dog" means that an NP can consist of the determiner "the" followed
by the word "dog", with a probability of 0.7. The total probability for NP → Det "dog" is the
product of the probability of Det → "the" (which might be 0.9) and the lexical probability of
"dog" as a noun (0.7).
• The sentence "the dog chased the cat" would be parsed with the rules and probabilities
reflecting both the syntactic structure and the lexical choices made (e.g., "dog" for N,
"chased" for V).
5. Advantages of PCFG-L
• More Specific Syntactic Structures: Lexicalizing the grammar enables it to capture more
specific syntactic structures that are dependent on certain words. This is especially helpful
for words that have multiple meanings or syntactic functions (e.g., a word like "bark" can be
a noun or a verb, and its usage impacts the parse tree).
• Improved Disambiguation: By incorporating the lexical items, the grammar helps
disambiguate sentences. For instance, in the sentence "The dog barked," the verb "barked"
would trigger a different parse than "The dog barked loudly."
• Better Coverage with Sparse Data: Since lexicalized rules allow the grammar to directly
model specific word choices, it can better handle sparse data, particularly in cases where
certain word combinations occur infrequently in training data.
• More Accurate Parsing: The use of word-specific probabilities in PCFG-L makes it
possible to select the most probable parse tree, improving the overall accuracy of the
parsing process, especially in ambiguous or complex sentences.
7. Applications of PCFG-L
• Syntactic Parsing: PCFG-L is used in syntactic parsers to generate the most probable
syntactic structures for sentences, incorporating both syntactic structure and lexical
information.
• Machine Translation: In statistical machine translation, lexicalized grammars help
capture word-specific translation rules, improving the translation quality by using context-
sensitive translations.
• Speech Recognition: PCFG-L models help improve speech recognition systems by
considering both the syntax and the lexical context of the spoken words.
• Information Extraction: In tasks like information extraction, PCFG-L can help accurately
parse text and identify relevant entities or relationships by using lexicalized syntactic rules.
Here, each feature (e.g., Category, Number, Person) has an associated value (e.g., Noun,
Plural, Third).
Example of unification:
• Feature structure 1:
yaml
Copy code
[Category: Noun, Number: Singular]
• Feature structure 2:
yaml
Copy code
[Number: Singular, Person: Third]
When these two feature structures are unified, the result will be:
yaml
Copy code
[Category: Noun, Number: Singular, Person: Third]
• Feature structure 2:
csharp
Copy code
[Number: Plural]
The unification would fail because the Number features are in conflict (Singular vs. Plural).
Unification is particularly useful in head-driven parsing models (such as HPSG), where the head
of a phrase is associated with a feature structure, and the other parts of the phrase must unify with it.
This structure captures the semantic roles played by the participants in the action described by the
verb "give".
1. What is Unification?
Unification is the process of merging two feature structures that share compatible values for their
features while ensuring consistency. The result of unification is a single feature structure that
integrates the information from both input structures.
If the two feature structures are incompatible (i.e., they have conflicting values for any feature),
unification fails.
3. Unification Process
The unification process involves combining two feature structures, merging their features and
values, and ensuring that no contradictions arise. Here's how unification works step-by-step:
1. Compare Features: For each feature in the first structure, check if the second structure
contains the same feature. If it does, proceed to compare the values.
2. Merge Values:
• If both structures assign the same value to a feature, the feature is retained with that
value.
• If the values are atomic and identical, there is no problem—just retain the value.
• If the values are complex structures (e.g., other feature structures), recursively unify
them.
3. Detect Inconsistencies: If two feature structures assign different values to the same feature,
unification fails. For instance, if one structure assigns Number: Singular and the other
assigns Number: Plural, unification cannot proceed because the values contradict each
other.
4. Result: If unification is successful, the resulting structure contains all the features and
values from both input structures, merged in a consistent way.
4. Example of Unification
Let's look at a simple example to understand how unification works in practice. Consider the
following two feature structures:
Feature Structure 1 (FS1):
yaml
Copy code
[Category: Noun, Number: Singular, Person: Third]
Unification Step-by-Step:
1. Compare the Features: Both structures contain the features Person and Number.
2. Merge Values:
• The value of Person in both structures is Third, which is identical, so this feature
can be retained.
• The value of Number in both structures is Singular, so this feature is also
compatible and retained.
• The first structure has an additional feature, Category: Noun. This feature is
retained, as it does not conflict with any feature in the second structure.
The unification was successful, and the resulting structure retains all the features and values from
both input structures.
5. Example of Unification Failure
Now, let's consider a case where unification fails due to conflicting values.
Feature Structure 1 (FS1):
yaml
Copy code
[Number: Singular, Gender: Masculine]
Unification Step-by-Step:
1. Compare the Features:
• The features Number and Gender exist in both structures.
2. Merge Values:
• The value of Number in FS1 is Singular, while in FS2 it is Plural. These
values conflict, so unification cannot proceed.
• The value of Gender is Masculine in FS1 and Feminine in FS2, which are
also conflicting.
Result:
Unification fails because the values for both the Number and Gender features are contradictory.
Unification Step-by-Step:
1. Compare the Features:
• Both structures contain Category: Noun, so this feature is consistent.
• Both structures contain Modifiers, but each has different values: ["big"] in
FS1 and ["small"] in FS2.
2. Unify the Modifiers:
• The Modifiers values are lists of adjectives, and these lists can be unified by
combining their elements.
3. Result: The unified feature structure combines the adjectives in the Modifiers list:
yaml
Copy code
[Category: Noun, Number: Singular, Modifiers: [Adjective: "big", Adjective:
"small"]]
This example shows how recursive unification can merge complex feature structures.
7. Applications of Unification
Unification is used in many natural language processing tasks, particularly in grammar formalisms
that represent linguistic information as feature structures:
• Parsing: In unification-based parsing models (e.g., HPSG, LFG), unification is used to
combine syntactic features as the parser processes a sentence, ensuring that the sentence’s
syntactic structure is consistent with the grammar.
• Morphological Analysis: Unification is used in morphological analyzers to combine
feature structures representing word forms, helping identify inflections, stems, and
derivations.
• Semantic Interpretation: In semantics, unification is used to merge feature structures
representing meanings, enabling the system to combine information from different parts of a
sentence (e.g., subject, verb, object).
• Machine Translation: Unification-based grammars help translate sentences by ensuring that
syntactic and semantic features are compatible across languages.
• Information Extraction: Unification can help combine different pieces of extracted
information into a unified representation, such as identifying entities and their roles in a
sentence.