0% found this document useful (0 votes)
80 views29 pages

Module 3: Morphology Morphological Parsing With Finite State

The document discusses morphological parsing using finite state automata. It provides examples of morphological parsing of words into stems and morphological features. It also describes how to build a morphological parser using a lexicon, morphotactics and spelling rules which can be modeled using finite state automata.

Uploaded by

shuchis785
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
80 views29 pages

Module 3: Morphology Morphological Parsing With Finite State

The document discusses morphological parsing using finite state automata. It provides examples of morphological parsing of words into stems and morphological features. It also describes how to build a morphological parser using a lexicon, morphotactics and spelling rules which can be modeled using finite state automata.

Uploaded by

shuchis785
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Natural Language Processing

CSE4022

Lecture-12: Morphology - Module 3


Morphological Parsing with Finite State

Dr. Durgesh Kumar


Assistant Professor, SCOPE, VIT Vellore
Table of contents

1 Recap Module-3 : Morphology

2 Morphological Parsing using Finite State


Automata

3 Real-world Applications -

September 11, 2022September 11, 2022 1/


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Recap from previous lectures

Morphemes, Morphology
2 types of Morphemes
Stem (roots)
Affixes : Prefix, Suffix, Infix, Circumfixes
2 types of Morphology
1 Inflection Morphology
2 Derivational Morphology

Stemming vs Lemmatization
Popular Stemming algorithm
1 Porter Stemmer - class nltk.stem.porter.PorterStemmer
2 LancasterStemmer - class nltk.stem.lancaster.LancasterStemmer
3 snowball - class nltk.stem.porter.PorterStemmer
4 RegexpStemmer - class nltk.stem.regexp.RegexpStemmer
September 11, 2022September 11, 2022 2/
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Recap from previous lectures

Document representation and pre-processing for various NLP task


1 Traditional approach
2 Embedding based approach
Traditional based Approach: Entry in Document-term matrix
binary (presence or absence of word)
term frequency (tf)
term frequency inverse document frequency (tf-idf)
Feature selection
Matrix Factorization based approach: EVD, SVD, LSI, pLSI
Probabilistic based approaches: Information Gain (IG), Normalized
Mutual Information (NMI)

September 11, 2022September 11, 2022 3/


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Recap from previous lectures

Important question of Job interviews : Curse of dimensionality


Nature paper
My Great learning Blog
deepai.org blog
Ungraded assignment: Spam email classification
For practical understanding of word-tokenization,
document-representation, and feature selection usning Spam-email

September 11, 2022September 11, 2022 4/


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Morphological Parsing

Given a word
Find its root word and Morphological features like:
+ N (noun)
+ V (Verb)
+ A (Adjective)
+ R (Adverb)
+ SG (Singular)
+ PL (Plural)
+ PRES-PART (Present Participles) - ending with ing
+ PAST-PART
September 11, 2022September 11, 2022 5/
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Morphological Parsing Examples

Table: Morphological Parsing examples

S.No Input Morphological Parse Output

1 cats cat + N + PL
2 cat cat + N + SG
3 cities city + N + PL
4 geese goose + N + PL
5 goose (goose + N + SG) or (goose + V)
6 gooses goose + V + 3SG
7 merging merge + V + PRES-PART
8 caught (catch + V + PAST-PART) or (catch + V + PAST)

First Column – words


Second column Stems; Morphological Features
September 11, 2022September 11, 2022 6/
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Morphological Parsing Examples (Contd.)

Table: Morphological Parsing examples

S.No Input Morphological Parse Output

1 dogs
2 dog
3 walked
4 running
5 beautiful
6 quickly

September 11, 2022September 11, 2022 7/


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Morphological Parsing Examples (Contd.)

Table: Morphological Parsing examples

S.No Input Morphological Parse Output

1 dogs dog + N + PL
2 dog dog + N + SG
3 walked walk + V + PAST-PART
4 running run + V + PRES-PART
5 beautiful beuti + A (Adjective) + ful
6 quickly qick + R (Adverb) + ly

September 11, 2022September 11, 2022 8/


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
How to build Morphological Parser?

We need following to build a Morphological Parser:


1 Lexicon -List of Stems and Affixes + Basic Information about them
(Noun Stem or Verb Stem)
2 Morphotactics –Model of Morpheme Ordering (which classes of
morphemes follow which other classes)
Plural morpheme follows a noun rather than preceding it. e.g.: cats,
dogs
3 Orthographic or Spelling Rules –Spelling rules that occur in a word
when two morphemes combine
y => ie (City + s => Cities)

September 11, 2022September 11, 2022 9/


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Lexicon and Morphotactics

Lexicon
Lexicon: A lexicon repository for words.
Questiion How to store lexicon ?

September 11, 2022September 11, 2022 10 /


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Lexicon and Morphotactics

Lexicon
Lexicon: A lexicon repository for words.
Questiion How to store lexicon ?
1 Naive way: To store explicit list of every word of a language.

including abbreviations (M.D., DK, P.M.) and proper names
(Tashi Phuntsho, Balaji P J, Ishan Yadav, Soumya Jha, Shah

Tanmay Biren, ,

Balaji P J
DK
Ishan yadav
...

September 11, 2022September 11, 2022 10 /


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Lexicon and Morphotactics

Lexicon
Lexicon: A lexicon repository for words.
Questiion How to store lexicon ?
1 Naive way: To store explicit list of every word of a language. Cons:
Inconvenient and Impossible
2 Efficient way: list of each of stems and affixes of the language
together with a representation of morphotactics
to tell how they fit together.

September 11, 2022September 11, 2022 11 /


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
How to model morphotactics?
One of the simplest way to model morphotactics is using Finite State
Automaton (FSA).
1 regular nouns: that take regular -s plural. (e.g. :cat, dog, kid).
Assumption: Plural form add only s as suffix. exception fox.
2 irregular singular nouns: e.g.: goose, mouse
3 irregular plural nouns: e.g.: geese, mice

Figure: Finite State Automata (FSA) for English nominal inflection.

September 11, 2022September 11, 2022 12 /


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Verbal Inflection - FSA

Irreg-pas-verb-
form
catch =⇒
caught
eat =⇒ ate
eat =⇒ eaten

Figure: FSA for English Verbal Inflection.

September 11, 2022September 11, 2022 13 /


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Verbal Inflection - FSA

reg-verb-stem
walk =⇒ walk
fry =⇒ fry
talk =⇒ talk
impeach =⇒
impeach

Figure: FSA for English Verbal Inflection.

September 11, 2022September 11, 2022 14 /


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Verbal Inflection - FSA

pretertite (-ed) :
Simple past tense
walk =⇒ walked

fry =⇒ fried

talk =⇒ talked

impeach =⇒
impeached

Figure: FSA for English Verbal Inflection.

September 11, 2022September 11, 2022 15 /


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Verbal Inflection
Table: Morphological Parsing examples

Simple Past 3rd Singular


S.No Present Prog (-ing)
Past Participle 3 SG(s/es/ies)

1 walk walked walked walks walking


2 eat ate eaten eats eating
3 arise arose arisen arises eating
3 fry fried fried fries frying
4 Begin began begun begins beginning

Bleed =⇒

September 11, 2022September 11, 2022 16 /


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Verbal Inflection
Table: Morphological Parsing examples

Simple Past 3rd Singular


S.No Present Prog (-ing)
Past Participle 3 SG(s/es/ies)

1 walk walked walked walks walking


2 eat ate eaten eats eating
3 arise arose arisen arises eating
3 fry fried fried fries frying
4 Begin began begun begins beginning

Bleed =⇒ Bleed, bled, bled, bleeds, bleeding


buy =⇒

September 11, 2022September 11, 2022 16 /


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Verbal Inflection
Table: Morphological Parsing examples

Simple Past 3rd Singular


S.No Present Prog (-ing)
Past Participle 3 SG(s/es/ies)

1 walk walked walked walks walking


2 eat ate eaten eats eating
3 arise arose arisen arises eating
3 fry fried fried fries frying
4 Begin began begun begins beginning

Bleed =⇒ Bleed, bled, bled, bleeds, bleeding


buy =⇒ buy, bought, bought, buys, bleeding
abide =⇒
September 11, 2022September 11, 2022 16 /
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Verbal Inflection
Table: Morphological Parsing examples

Simple Past 3rd Singular


S.No Present Prog (-ing)
Past Participle 3 SG(s/es/ies)

1 walk walked walked walks walking


2 eat ate eaten eats eating
3 arise arose arisen arises eating
3 fry fried fried fries frying
4 Begin began begun begins beginning

Bleed =⇒ Bleed, bled, bled, bleeds, bleeding


buy =⇒ buy, bought, bought, buys, bleeding
abide =⇒ abide, abode, abode, abides, abiding
September 11, 2022September 11, 2022 16 /
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Derivational Morphology- Adjective
Inflection - FSA- Antworh’s Proposal -1

pretertite (-ed) : Simple past tense


big =⇒ big, bigger, biggest
cool =⇒ cool, cooler, coolest
clear =⇒ unclear, unclearly,
clearer, clearest, clearly,
Figure: FSA for English Adjective
happy =⇒ happy, happier, morphology - Antworth’s proposal
happiest, unhappy, unhappier, -1.
unhappiest, unhappily

September 11, 2022September 11, 2022 17 /


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Adjective Inflection - FSA- Antworh’s
Proposal-2

pretertite (-ed) : Simple past tense


happy =⇒ happy, happier,
happiest, unhappy, unhappier,
unhappiest, unhappily
Figure: FSA for English Adjective
morphology - Antworth’s proposal
-2.

September 11, 2022September 11, 2022 18 /


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Derivational Morphology - FSA

Figure: FSA for another fragments of English Derivative Morphology.

September 11, 2022September 11, 2022 19 /


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
character level FSA for few English Nouns
with their inflection

September 11, 2022September 11, 2022 20 /


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Real world Application Examples- Ungraded
Assignments

1 Text Classification: To classify an email into SPAM or non-spam


using Enron email datasets.
2 Web Crawling and Scrapping - For Advanced learners
Develop a web crawler and scrapper to

September 11, 2022September 11, 2022 21 /


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
Text Classification - SPAM Email
Classification

Problem Statement: Classify an email into a SPAM or non-SPAM


(HAM)
Datasets:
Enron email dataset: contains data from about 150 users, mostly
senior management of Enron, organized into folders. The corpus
contains a total of about 0.5M messages.

September 11, 2022September 11, 2022 22 /


Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
SPAM Email Classification using ENRON
dataset

TASK:
1 Download raw and processed form of Enron email datasets from
link
2 Write a code to count total number of words (tokens), number of
unique words) from processed datasets.
without removing stop words and word normalization
(stemming/lemmatization)
after removing stop words
after removing stop words and stemming
after removing stop words and lemmatization.

3 Analyze the 100 most frequent and least frequent (rare) word.
Deadline 10/09/2022 (Saturday) before 5 PM
September 11, 2022September 11, 2022 23 /
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24
September 11, 2022September 11, 2022 24 /
Dr. Durgesh Kumar Lecture-02 | NLP | CSE4022 24

You might also like