Module 3: Morphology Morphological Parsing With Finite State
Module 3: Morphology Morphological Parsing With Finite State
CSE4022
3 Real-world Applications -
Morphemes, Morphology
2 types of Morphemes
Stem (roots)
Affixes : Prefix, Suffix, Infix, Circumfixes
2 types of Morphology
1 Inflection Morphology
2 Derivational Morphology
Stemming vs Lemmatization
Popular Stemming algorithm
1 Porter Stemmer - class nltk.stem.porter.PorterStemmer
2 LancasterStemmer - class nltk.stem.lancaster.LancasterStemmer
3 snowball - class nltk.stem.porter.PorterStemmer
4 RegexpStemmer - class nltk.stem.regexp.RegexpStemmer
September 11, 2022September 11, 2022 2/
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Recap from previous lectures
Given a word
Find its root word and Morphological features like:
+ N (noun)
+ V (Verb)
+ A (Adjective)
+ R (Adverb)
+ SG (Singular)
+ PL (Plural)
+ PRES-PART (Present Participles) - ending with ing
+ PAST-PART
September 11, 2022September 11, 2022 5/
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Morphological Parsing Examples
1 cats cat + N + PL
2 cat cat + N + SG
3 cities city + N + PL
4 geese goose + N + PL
5 goose (goose + N + SG) or (goose + V)
6 gooses goose + V + 3SG
7 merging merge + V + PRES-PART
8 caught (catch + V + PAST-PART) or (catch + V + PAST)
1 dogs
2 dog
3 walked
4 running
5 beautiful
6 quickly
1 dogs dog + N + PL
2 dog dog + N + SG
3 walked walk + V + PAST-PART
4 running run + V + PRES-PART
5 beautiful beuti + A (Adjective) + ful
6 quickly qick + R (Adverb) + ly
Lexicon
Lexicon: A lexicon repository for words.
Questiion How to store lexicon ?
Lexicon
Lexicon: A lexicon repository for words.
Questiion How to store lexicon ?
1 Naive way: To store explicit list of every word of a language.
including abbreviations (M.D., DK, P.M.) and proper names
(Tashi Phuntsho, Balaji P J, Ishan Yadav, Soumya Jha, Shah
Tanmay Biren, ,
Balaji P J
DK
Ishan yadav
...
Lexicon
Lexicon: A lexicon repository for words.
Questiion How to store lexicon ?
1 Naive way: To store explicit list of every word of a language. Cons:
Inconvenient and Impossible
2 Efficient way: list of each of stems and affixes of the language
together with a representation of morphotactics
to tell how they fit together.
Irreg-pas-verb-
form
catch =⇒
caught
eat =⇒ ate
eat =⇒ eaten
reg-verb-stem
walk =⇒ walk
fry =⇒ fry
talk =⇒ talk
impeach =⇒
impeach
pretertite (-ed) :
Simple past tense
walk =⇒ walked
fry =⇒ fried
talk =⇒ talked
impeach =⇒
impeached
Bleed =⇒
TASK:
1 Download raw and processed form of Enron email datasets from
link
2 Write a code to count total number of words (tokens), number of
unique words) from processed datasets.
without removing stop words and word normalization
(stemming/lemmatization)
after removing stop words
after removing stop words and stemming
after removing stop words and lemmatization.
3 Analyze the 100 most frequent and least frequent (rare) word.
Deadline 10/09/2022 (Saturday) before 5 PM
September 11, 2022September 11, 2022 23 /
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
September 11, 2022September 11, 2022 24 /
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24