Natural Language Processing Internal 1
Natural Language Processing Internal 1
Morphology is the study of word structure how words are formed using morphemes (smallest
meaning units).
Two main types:
➔ Inflectional (changes form, keeps meaning): e.g., "talk" → "talked"
➔ Derivational (changes meaning or part of speech): e.g., "happy" → "happiness"
Used in NLP tasks like:
➔ Part-of-speech tagging
➔ Machine translation
➔ Named Entity Recognition (NER)
➔ Text-to-speech synthesis
➔ Spell checking & grammar tools
1. Rule-Based Models
➔ Use handwritten rules made by language experts.
➔ Based on linguistic knowledge.
➔ Follow if-then patterns for analyzing word forms.
Pros:
➔ Transparent and explainable.
➔ Work well for regular, low-morphology languages (e.g., English).
Cons:
➔ Hard to scale.
➔ Not flexible for languages with complex/irregular forms.
2. Statistical Models
➔ Learn morphology from annotated corpora using probabilistic methods.
Common techniques:
➔ Hidden Markov Models (HMMs)
➔ Conditional Random Fields (CRFs)
Pros:
➔ Better performance than rule-based.
➔ Can generalize to unseen words.
Cons:
➔ Need large annotated datasets.
➔ Limited understanding of deep structure.
3. Neural Models
➔ Use deep learning (e.g., RNNs, LSTMs, Transformers).
➔ Can model complex and irregular morphology.
How they work:
➔ Input words as sequences of characters or subwords.
➔ Learn patterns using multiple layers of neurons.
Pros:
➔ State-of-the-art results.
➔ Work well for morphologically rich languages (Arabic, Turkish, Finnish).
Cons:
➔ Need lots of data and compute power.
➔ Hard to interpret.
4. Dictionary Lookup
➔ Uses a lexicon (dictionary) of known words and their forms.
➔ When a word is found in text, its properties (POS, base form, etc.) are retrieved from the
dictionary.
Pros:
➔ Fast and simple.
➔ Effective for regular and low-morphology languages.
Cons:
➔ Fails with out-of-vocabulary or irregular words.
➔ Not suitable for complex languages.
Enhancements:
1. Lemmatization: Reduce word to its lemma (base form).
Example: "better" → "good", "running" → "run"
2. Stemming: Chop off affixes to get the stem.
Example: "running", "runner" → "run"
3. Morphological Analysis: Segment words into prefix + root + suffix and extract
grammatical features.
5. Finite-State Morphology
➔ Uses finite-state automata and finite-state transducers (FSTs).
➔ Works by defining rules as transitions in a state machine.
Functions:
➔ Analysis: Break a word into morphemes.
➔ Generation: Build word forms from morphemes + features.
Pros:
➔ Very efficient and fast.
➔ Works well for highly regular languages like Finnish, Turkish.
➔ Easy to visualize and debug.
Cons:
➔ Less effective for irregular or unpredictable word forms.
6. Unification-Based Morphology
➔ Based on feature structures and constraints (e.g., number, gender, tense).
➔ Uses unification (matching of features) to process word forms.
Pros:
➔ Handles complex and irregular forms.
➔ Good for rich feature languages (Arabic, German).
➔ Modular and extensible.
Cons:
➔ Computationally heavy (slow).
➔ Complex implementation.
7. Functional Morphology
➔ Focuses on meaning and function of word forms in context.
➔ Based on usage-based and cognitive linguistics.
➔ Ties morphology to discourse, semantics, and communication.
Pros:
➔ Great for analyzing real-world language use.
➔ Works well with corpus-based NLP, sentiment analysis, etc.
➔ Can adapt to context and speaker intent.
Cons:
➔ Needs large corpora.
➔ Less formal structure.
➔ Can be hard to generalize.
8. Morphology Induction
➔ Unsupervised learning approach – no labeled data needed.
➔ Learns word structure from raw text using patterns/statistics.
How it works:
➔ Segments words into subword units.
➔ Finds morphemes based on frequency, patterns, etc.
➔ Uses clustering, probabilistic models, or neural methods.
Pros:
➔ Useful for low-resource or under-studied languages.
➔ Language-independent.
Cons:
➔ May not be very accurate.
➔ Results are often hard to interpret.
Note: Any five models are enough
Types of Classifiers
1. Rule-based – manually written rules
2. Statistical – uses language models and probabilities
3. May be simple (short word sequences) or complex (word embeddings)
Perceptron Classifier (Example)
A linear model using:
1. Vectors for features and weights
2. Dot product to score and choose the best class
➔ Japanese: " 私は日本語が好きです " ("I like Japanese") → contains a mix of scripts (Kanji,
Hiragana, Katakana) and allows multiple possible segmentations.
➔ Thai: No spaces between words, no capitalization clues, and complex character
combinations make segmentation very difficult.
➔ Vietnamese: Although it uses a Latin-based script, words often consist of multiple syllables
with diacritics, making boundary detection tricky.
1. Rule-Based Methods
➔ Rule-based models use manually created grammar rules and dictionaries specific to each
language.
➔ These models often rely on:
(a)Lists of common words
(b)Morphological analysis (prefixes, suffixes)
(c)Handwritten heuristics about syllable or character patterns
➔ Example: In Chinese, if certain character sequences frequently occur together, they are
considered a word.
➔ Advantages: Fast and interpretable.
➔ Disadvantages: Hard to maintain, language-specific, and brittle to exceptions.
2. Statistical Methods
➔ These models use large annotated corpora to learn the probability of character or syllable
sequences forming valid words.
➔ Techniques include:
(a)N-gram models: Predict the likelihood of a sequence based on frequency.
(b)Hidden Markov Models (HMMs): Model sequences with hidden states
representing words.
(c)Maximum Entropy Models: Combine multiple contextual features for prediction.
➔ Statistical models can infer the most likely segmentation by maximizing the probability of a
sequence of words.
➔ Advantages: More flexible than rule-based; adapts to variations in language.
➔ Disadvantages: Needs large labeled datasets; struggles with rare words or phrases.
3. Neural Network Methods (Deep Learning)
➔ Neural approaches have recently become dominant for word segmentation tasks.
➔ Techniques include:
(a) Recurrent Neural Networks (RNNs) and Long Short-Term Memory Networks
(LSTMs): Model sequences while remembering long-range dependencies.
(b) Convolutional Neural Networks (CNNs): Extract local patterns in character
sequences.
(c) Transformer-based Models (like BERT): Pre-trained deep models that understand
context bidirectionally and predict segment boundaries.
(d) Sequence Labeling: Frame segmentation as a classification problem where each
character is labeled as "Begin", "Inside", or "End" of a word (e.g., using BIO tagging
schemes).
➔ Advantages: High accuracy, adaptable to complex scripts and contexts.
➔ Disadvantages: Requires large amounts of annotated data and computational resources.
1. Rule-Based Approaches
➔ Effective when:
• The document or language structure is simple.
• The rules governing word boundaries are clear and consistent.
➔ Struggles when:
• The language or document structure is complex or highly ambiguous.
• Significant manual effort is needed to craft, maintain, and update the rules.
• It faces difficulty adapting to new, unseen text patterns.
2. Statistical Approaches
➔ Effective when:
• There is a large amount of labeled data available for training.
• The document structure is relatively consistent across examples.
• Probabilistic models can accurately estimate the likelihood of different word
boundary patterns.
➔ Struggles when:
• Novel or rare document structures appear that were not well-represented in the
training data.
• Performance can degrade on out-of-domain texts or low-frequency word
combinations.
We build up the chart from the smallest spans (individual words) to larger spans (phrases and
whole sentence):
1. Lexical rules fill spans of one word each — like (0,1): Det → the.
2. Using grammar, we combine adjacent spans to form larger phrases:
1. (0,2) gets NP by combining (0,1) + (1,2)
2. (2,5) gets VP by combining V and NP
3. (0,5) becomes S by combining NP and VP
The topmost chart cell (0,5) has S → NP VP, meaning the whole sentence has been successfully
parsed.
4. Dependency Parsing and MST (Minimum Spanning Tree)
1. It represents the grammatical structure of a sentence as a directed graph (not a tree like
in phrase-structure parsing).
2. Each word is a node in the graph.
3. Edges connect words to show grammatical dependencies, like subject-of, object-of, etc.
4. The resulting graph is a Directed Acyclic Graph (DAG) and typically forms a tree called
the dependency tree.
Why Use MST (Minimum Spanning Tree)?
We can imagine all possible syntactic dependencies between words as edges in a complete
graph. But only one subset of those edges forms the most likely parse of the sentence the correct
tree.
To find this tree:
1. Each edge is assigned a score (how likely that dependency is)
2. The goal is to find the tree with the highest total score
This is exactly what MST algorithms do.
Chu-Liu/Edmonds Algorithm Steps (for directed MST)
1. Create a directed graph with:
1. Nodes = words
2. Edges = possible dependencies
3. Edge weights = scores based on statistical or neural models
2. Choose a root node usually the main verb (e.g., “chased”)
3. Assign scores to all edges using features (POS tags, distance, embeddings)
4. Apply MST algorithm:
1. For each node (except root), choose one incoming edge with max score
2. If there’s a cycle, resolve it by removing the lowest-weight edge in the cycle
3. Repeat until we get a tree with no cycles
5. Label the edges in the final tree with syntactic roles:
1. e.g., nsubj(chased, cat) or obj(chased, mouse)
Example Table: MST Dependency Parse for
"The cat chased the mouse"
Let's say we assign scores for possible head–dependent relations like this:
This gives us the tree:
ROOT
|
chased
/ \
cat mouse
| |
the the
And corresponding dependency edges:
1. root(ROOT, chased)
2. nsubj(chased, cat)
3. det(cat, the)
4. obj(chased, mouse)
5. det(mouse, the)
This is the MST (maximum-scoring tree) that forms a valid dependency parse.
Structure
➔ Each node: A phrase or word category (e.g., NP = Noun Phrase, VP = Verb Phrase).
Example
Sentence: “The cat sat on the mat”
Phrase structure representation:
(S
(NP (DT The) (NN cat))
(VP (VBD sat)
(PP (IN on)
(NP (DT the) (NN mat)))))
➔ Sentence (S) = Noun Phrase (NP) + Verb Phrase (VP)
Applications
1. Text-to-Speech systems
2. Natural Language Understanding (NLU)
3. Machine Translation
4. Syntax-based Language Generation
1. Ambiguity
➔ Homonyms:
Words spelled and pronounced the same but have different meanings.
Example: "bank" → financial institution / river bank
➔ Polysemy:
Words with multiple related meanings.
Example: "book" → physical object or act of reserving
➔ Syntactic Ambiguity:
Sentences with multiple valid parses.
Example: "I saw her duck" (bird or action?)
➔ Cultural/Linguistic Ambiguity:
Idioms or slang confusing NLP systems.
Example: "kick the bucket" (means "to die")
➔ Solutions:
Contextual embeddings, part-of-speech tagging, syntactic parsing, large datasets.
2. Morphology
➔ Languages have complex rules for word formation.
➔ Words change to show tense, number, gender, etc.
➔ Example: "run," "ran," "running," "runner"
➔ Solutions:
Morphological analyzers, lemmatization, morphological tagging.
3. Word Order
➔ The position of words affects meaning.
➔ In free-word-order languages, order varies without changing meaning.
Example: Russian, Hindi.
➔ Solutions:
Syntax parsing, dependency parsing.
4. Informal Language
➔ Slang, colloquialisms, emojis, abbreviations common in casual texts.
Example: "LOL", "gonna", "brb", "u r gr8"
➔ Solutions:
Text normalization, preprocessing techniques, social-media-trained models.
7. Language-Specific Challenges
➔ Every language has unique structures, rules, and exceptions.
Example: English methods may fail for Korean, Arabic, Finnish.
➔ Solutions:
Language-specific tools, multilingual models like mBERT, XLM-R.
8. Domain-Specific Challenges
➔ General NLP models struggle with specific domains (medicine, law, tech).
➔ Example: "virus" in medicine vs. cybersecurity.
➔ Solutions:
Fine-tuning models on domain-specific corpora.
9. Irregularity
➔ Languages have irregular verbs, plurals, inflections.
➔ Examples:
"go → went" (not "goed")
"child → children" (not "childs")
➔ Solutions:
Rule-based systems + ML models for spotting irregular patterns.
10. Productivity
➔ Languages constantly create new words using prefixes, suffixes, compounding.
Examples:
"happy → unhappy"
"smart + phone → smartphone"
➔ Solutions:
Morphological analysis tools, subword-aware models.
◦ For example, a rule might say that if there is a period followed by a space, it marks the
end of a sentence unless it’s part of an abbreviation.
2. Machine Learning Methods:
◦ These methods use algorithms that learn from data. The system is trained on documents
with labeled sentence boundaries and learns to recognize patterns.
◦ For example, it might look at the length of a sentence, punctuation marks, or the part of
speech of the last word to predict where a sentence ends.
3. Hybrid Methods:
◦ These combine both rule-based and machine learning methods.
◦ For example, the system might first use rules to find most of the sentence boundaries
and then apply machine learning to correct any mistakes or handle special cases.
Some tools and techniques used in sentence boundary detection include:
1. Regular Expressions:
◦ These are patterns that help find specific character sequences, like periods followed by
spaces, to mark the end of a sentence.
2. Hidden Markov Models (HMMs):
◦ These models look at the probabilities of different sentence-ending markers to predict
where sentences are likely to end.
3. Deep Learning Models:
◦ These are advanced neural networks that can learn complex patterns from large amounts
of data and are very effective at detecting sentence boundaries.
◦ Accurately detecting sentence boundaries is key to many NLP tasks.
◦ It helps systems understand and process text more accurately, which leads to better
summarization, information extraction, and other language-related tasks.
Disadvantages of NLP
1. Despite its benefits, NLP also faces some limitations:
2. It sometimes struggles to fully capture or represent context, leading to misinterpretations.
3. NLP systems can be unpredictable, especially in handling complex or ambiguous language
inputs.
4. Using NLP interfaces may require additional keystrokes compared to simpler input methods.
5. Most NLP systems are domain-specific and cannot easily adapt to new topics or tasks
without retraining or modification.
Sentiment Analysis
➔ Sentiment Analysis, also called opinion mining, is a technique used to evaluate the
emotional tone behind a body of text. By assigning values such as positive, negative, or
neutral, it helps identify the mood or emotional state of the sender (happy, sad, angry, etc.).
➔ Sentiment analysis combines techniques from NLP and statistical analysis and is widely
used to understand public opinion on websites, social media, and customer feedback
platforms.
LUNAR
LUNAR is a classical example of a Natural Language database interface system developed using
Woods' Procedural Semantics. It was designed to translate complex natural language expressions
into database queries and successfully handled approximately 78% of user requests without errors.
Semantic Parsing
➔ Semantic parsing is the process of automatically translating natural language utterances into
formal meaning representations that computers can understand and execute. For instance, a
geographical information system might use a semantic parser to interpret a user query like,
"What is the highest mountain in Europe?" into a structured database query.
➔ The working of semantic parsing involves mapping natural language inputs to machine-
understandable logical forms. These logical forms are executable against a real-world
environment or knowledge base to yield a response, known as the denotation.
Note: This information is completely based on the textbook reference. The actual context may vary depending on the specific question asked in the exam. Please ensure you understand
the concepts thoroughly and apply them appropriately based on the question requirements.