0% found this document useful (0 votes)
2 views44 pages

NLP Lecture 3

Lexical analysis is a critical component of natural language processing (NLP) that focuses on understanding word structure and morphology, essential for tasks such as machine translation and text processing. It involves various techniques like tokenization, lemmatization, and stemming, while also facing challenges with irregular morphology and complex word formations. Advances in lexical analysis include the integration of finite state morphology and paradigm-based models to improve accuracy and handle diverse linguistic structures.

Uploaded by

jwdtaymn8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views44 pages

NLP Lecture 3

Lexical analysis is a critical component of natural language processing (NLP) that focuses on understanding word structure and morphology, essential for tasks such as machine translation and text processing. It involves various techniques like tokenization, lemmatization, and stemming, while also facing challenges with irregular morphology and complex word formations. Advances in lexical analysis include the integration of finite state morphology and paradigm-based models to improve accuracy and handle diverse linguistic structures.

Uploaded by

jwdtaymn8
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Lexical Analysis in

NLP
Presented by:
Dr. Esraa Abdalla
Introduction to Lexical Analysis
Words are fundamental units in natural
language texts.
Lexical analysis aims to understand
word structure and morphology.
Words can be seen as strings or
abstract objects representing sets of
strings.
Example: 'delivers' relates to the
lemma 'DELIVER'.
Introduction to Lexical Analysis
• Lexical analysis is the study of word structure in
NLP.

• Words can have different morphological forms.

• Used in machine translation, search engines, and


text processing.
Importance of Lexical Analysis

• Allows better
• Helps in • Essential for
understanding of
syntactic parsing languages with
text structure and
and POS tagging. rich morphology.
meaning.
Lexical Analysis Tasks

• TOKENIZATION: SPLITTING • LEMMATIZATION: • MORPHOLOGICAL • STEMMING: REDUCING


TEXT INTO WORDS. MAPPING WORDS TO BASE PARSING: UNDERSTANDING WORDS TO ROOT FORMS.
FORMS. WORD COMPONENTS.
Key Challenges in Lexical Analysis

• Handling irregular word formations.

• Dealing with multiple languages and grammar


structures.

• Improving accuracy in morphological parsing.


What is Finite State Morphology?

• Uses Finite State Transducers • Helps in recognizing and • Commonly used in NLP
(FSTs) to analyze word generating word forms. applications.
structure.
Finite State Morphology (FSM)
▪FSM uses finite state transducers
(FSTs) to map between different levels
of representation.
▪FSTs are efficient and can handle both
parsing and generation.
▪Used to model morphology
▪Example: Mapping 'delivers' to
'DELIVER + {3rd, Sg, Present}'.
FSM - Morphonology
▪Morphonology deals with phonological changes at morpheme
boundaries.
▪Example: Plural affixation in English (e.g., 'cats' vs. 'dogs').
▪FSTs handle these changes through rule-governed mappings.
▪Efficiently manages orthographic variations and phonological
rules.
FSM - Morphotactics

Morphotactics
FSTs model the
involves the ordering
morphotactic rules
of morphemes in a
of a language.
word.

FSM captures the


Example: Turkish
structure of
verb inflection with
morphologically
ordered affixes.
complex words.
How FSM Works

• Uses states and


• Example: 'deliver'
transitions to
→ 'delivers',
represent
'delivered'.
morphological rules.

• Efficient for
handling regular
morphology.
Advantages of FSM

• SIMPLE AND • WORKS WELL WITH • CAN BE EASILY


COMPUTATIONALLY REGULAR WORD IMPLEMENTED IN NLP
EFFICIENT. FORMATIONS. SYSTEMS.
FSM and Morphonology
• Deals with phonological changes in words.
• Example: 'glass' → 'glasses' (adding -es).
• FSM handles such changes systematically.
FSM in Machine Translation
• Helps translate words accurately.
• Example:
- English 'going' → Spanish 'yendo'.
• Maintains grammatical correctness.
FSM in Search Engines
• Google uses stemming to improve search results.
• Example:
- Searching 'run' also finds 'running', 'ran'.
• Enhances search accuracy.
FSM Limitations
• Struggles with irregular morphology.
• Does not handle infixation or non-concatenative morphology
well.
• Requires additional rules for complex cases.
Example: English Pluralization
• FSM rule: If a word ends in 's', add 'es'.
• Examples:
- cat → cats
- fox → foxes
• Handles many cases but struggles with exceptions.
What is Difficult Morphology?

• Some languages do not follow simple prefix/suffix


rules.

• Example: irregular verb forms (sing → sang).

• Complex word formation in many languages.


Irregular Morphology

• Examples:
• FSM struggles to
• - English: go → went
handle such cases.
• - German: laufen → lief
Infixation in Morphology
• Some languages insert affixes inside words.
• Example (Tagalog): sulat → sumulat (write).
• FSM requires additional rules to handle this.
Non-Concatenative Morphology
• Arabic & Hebrew use root-and-pattern systems.
• Example:
- ktb (root) → kitab (book), katib (writer)
• Requires multi-layered processing.
Case Study: Russian Noun
Declensions
• Russian nouns change form based on case and number.
• Example:
- karta (map, nominative)
- karty (maps, plural)
• Requires complex analysis.
Problems with FSM for Complex
Morphology
• Cannot handle non-adjacent word changes.
• Needs additional models for complex word structures.
• Limited in languages with infixes or internal changes.
Possible Solutions

• Combining FSM • Using machine • Developing


with rule-based learning models hybrid models for
systems. for word analysis. better accuracy.
What is Paradigm-Based Lexical
Analysis?
• Words are stored in tables (paradigms) instead of simple affix
rules.
• Helps capture word irregularities.
• Used in hierarchical models.
Why Use Paradigm-Based
Models?
• Captures exceptions and irregularities.
• More flexible than FSM.
• Helps in handling word formation rules.
Comparison: FSM vs Paradigm
Approach
• FSM:
- Good for regular words
- Struggles with exceptions
• Paradigm:
- Better for irregular forms
- Captures relationships between words.
Difficult Morphology

Some languages Examples:


present Multiple affixes,
challenges for zero affixes,
FSM due to non- infixation, and
isomorphic and root-and-
non-contiguous template
morphology. morphology.

Infixation and
FSM handles
root-and-
these by
template
recasting
morphology
problems as
require complex
linear ones.
FSTs.
Paradigm-Based Lexical Analysis
▪Views word structure in terms of paradigms.
▪Each cell in the paradigm represents a unique combination of
morphosyntactic features.
▪Captures generalizations and exceptions through inheritance
hierarchies and default mechanisms.
▪Handles difficult morphology more naturally than FSM.
▪Example: Russian noun inflection classes.
Paradigm-Based - Inheritance

Inheritance Default
hierarchies inheritance allows
capture shared for efficient
features across representation of
inflectional regular and
classes. irregular forms.

Overriding
Example: Russian
defaults to handle
nouns sharing
exceptions and
case and number
semi-regular
features.
forms.
Handling Exceptions with
Paradigms
• Some words have unique forms.
• Example:
- English: mouse → mice (not mouses)
• Paradigm models store exceptions effectively.
Combining Paradigms with FSM
• Some NLP systems use both approaches.
• FSM for simple rules, paradigms for complex cases.
• Improves accuracy in word analysis.
Future of Paradigm-Based Models
• AI and machine learning improving word analysis.
• Hybrid models integrating FSM and paradigms.
• Better handling of multi-language morphology.
Advances in Lexical Analysis
• Deep learning is improving morphological analysis.
• Neural networks can learn complex word patterns.
Hybrid Approaches
• Combining FSM and Paradigm models for better results.
• Example:
- AI systems using rule-based + machine learning models.
Challenges in Lexical Analysis
• Handling low-resource languages.
• Building more robust models for diverse languages.
Why Lexical Analysis Matters
• Helps in machine translation, search engines, and AI
applications.
• Improves text understanding and generation.
Applications of Lexical Analysis
Machine Translation (MT): Mapping between source and target
language morphological structures.
Information Retrieval (IR): Aids in stemming and generating
search terms.
Text Preprocessing: Used for syntactic analysis and tokenization.
Example: Tokenization in languages without clear word
boundaries.
Challenges in Lexical Analysis
Handling morphologically rich languages.
Dealing with ambiguity in morphological analysis.
Integrating rule-based and statistical methods.
Example: Ambiguity in Russian case and number forms.
Future Directions in Lexical
Analysis
Enhancing FSM with paradigm-based insights.
Developing more robust statistical models for morphological
analysis.
Exploring the role of lexical analysis in multilingual NLP
applications.
Integrating symbolic and statistical approaches for better
performance.
Summary of Key Points
• Lexical analysis is crucial for NLP.
• FSM works well for regular words, but has limitations.
• Paradigm-based models are more flexible.
Final Thoughts
• Ongoing research is improving lexical analysis techniques.
• The future lies in combining rule-based and AI models.
• Questions?
Thank You!
References
Indurkhya, Nitin, and Fred J. Damerau. Handbook of natural
language processing. Chapman and Hall/CRC, 2010.

You might also like