0% found this document useful (0 votes)
8 views

Module2 NLP BAD613B Notes

The document covers Word Level Analysis and Syntactic Analysis in Natural Language Processing, focusing on techniques such as regular expressions, finite-state automata, and morphological parsing. It details methods for part-of-speech tagging, error detection, and the structure of words, including morphemes and their classifications. Additionally, it explains the concepts of deterministic and non-deterministic finite automata and their application in recognizing patterns in language.

Uploaded by

sb4083070
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Module2 NLP BAD613B Notes

The document covers Word Level Analysis and Syntactic Analysis in Natural Language Processing, focusing on techniques such as regular expressions, finite-state automata, and morphological parsing. It details methods for part-of-speech tagging, error detection, and the structure of words, including morphemes and their classifications. Additionally, it explains the concepts of deterministic and non-deterministic finite automata and their application in recognizing patterns in language.

Uploaded by

sb4083070
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Natural Language Processing [BAD613B]

Module – 2

Word Level Analysis & Syntactic Analysis


Word Level Analysis: Regular Expressions, Finite-State Automata, Morphological Parsing,
Spelling Error Detection and Correction, Words and Word Classes, Part-of Speech Tagging.

Syntactic Analysis: Context-Free Grammar, Constituency, Top-down and Bottom-up Parsing,


CYK Parsing.
Textbook 1: Ch. 3, Ch. 4.

Word Level Analysis


1. Introduction

Processing carried out at word level, including methods for characterizing word sequences,
identifying morphological variants, detecting and correcting misspelled words, and identifying
correct part-of-speech of a word.

1.1 The part-of-speech tagging methods:


1. Rule-based (linguistic).
2. Stochastic (data-driven).
3. Hybrid.
1.2 Regular expressions: standard notations for describing text patterns.
1.3 Implementation Regular expressions using finite-state automaton (FSA): applications
in speech recognition and synthesis, spell checking, and information extraction.
1.4 Detecting and correcting errors.

2. Regular Expressions (regexes)


• Pattern-matching standard for string parsing and replacement.
• Powerful way to find and replace strings that take a defined format.
• They are useful tools for the design of language compilers.
• Used in NLP for tokenization, describing lexicons, morphological analysis, etc..
• Perl was the first language that provided integrated support for regular expressions.
o It uses a slash “/” around each regular expression;
• A regular expression is an algebraic formula whose value is a pattern consisting of a set of strings,
called the language of the expression. Example: /a/

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 1


Natural Language Processing [BAD613B]

Some simple regular expressions: First instance of each match is underlined in table

Regular expression Example patterns


/book/ The world is a book, and those who do not travel read only one page.
/book/ Reporters, who do not read the stylebook, should not criticize their
editors.
/face/ Not everything that is faced can be changed. But nothing can be
changed until it is faced.
/a/ Reason, Observation, and Experience-the Holy Trinity of Science.

2.1 Character Classes

Characters are grouped by square brackets, matching one character from the class. For
example, /[abcd]/ matches a, b, c, or d, and /[0123456789]/ matches any digit. A dash specifies
a range, like /[5-9]/ or /[m-p]/. The caret at the start of a class negates the match, as in /[^x]/,
which matches any character except x. The caret is interpreted literally elsewhere.

Use of square brackets


RE Match Example patterns Matched
[abc] Match any of a, b, and c 'Refresher course will start
tomorrow'
[A-Z] Match any character between A and Z (ASCII order) the course will end on Jan. 10,
2006'
[^A-Z] Match any character other than an uppercase letter 'TREC Conference'

[^abc] Match anything other than a, b, and c 'TREC Conference'


[+ *?. ] Match any of +, *, ?, or the dot. '3 +2 = 5'
[a^] Match a or ^ ‘^ has different uses.’

• Regular expressions are case-sensitive (e.g., /s/ matches 's', not 'S').
• Use square brackets to handle case differences, like /[sS]/.
o /[sS]ana/ matches 'sana' or 'Sana'.
• The question mark (?) makes the previous character optional (e.g., /supernovas?/).
• The * allows zero or more occurrences (e.g., /b*/).
• /[ab]*/ matches zero or more occurrences of 'a' or 'b'.
• The + specifies one or more occurrences (e.g., /a+/).
• /[0-9]+/ matches a sequence of one or more digits.
• The caret (^) anchors the match at the start, and $ at the end of a line.
o /^The nature\.$/ will search exactly for this line.
• The dot (.) is a wildcard matching any single character (e.g., /./).
o Expression /.at/ matches with any of the string cat, bat, rat, gat, kat, mat, etc.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 2


Natural Language Processing [BAD613B]

Special characters

RE Description
. The dot matches any single character.
\n Matches a new line character (or CR+LF combination).
\t Matches a tab (ASCII 9).
\d Matches a digit [0-9].
\D Matches a non-digit.
\w Matches an alphanumeric character.
\W Matches a non-alphanumberic character.
\s Matches a whitespace character.
\S Matches a non-whitespace character.
\ Use \ to escape special characters. For example, l. matches a dot, \* matches a *
and \ matches a backslash.

• The wildcard symbol can count characters, e.g., /.....berry/ matches ten-letter strings
ending in "berry".
• This matches "strawberry", "sugarberry", but not "blueberry" or "hackberry".
• To search for "Tanveer" or "Siddiqui", use the disjunction operator (|), e.g.,
"Tanveer|Siddiqui".
• The pipe symbol matches either of the two patterns.
• Sequences take precedence over disjunction, so parentheses are needed to group patterns.
• Enclosing patterns in parentheses allows disjunction to apply correctly.

Example: Suppose we need to check if a string is an email address or not.

^[A-Za-z0-9_\ .- ]++@[A-Za-z0-9_\ .- ]++[A-Za-z0-9_][A-Za-z0-9_]$


Pattern Description
^[A-Za-z0-9_\.- ]+ Match a positive number of acceptable characters at the start of
the string.
@ Match the @ sign.
[A-Za-z0-9_\ .- ]+ Match any domain name, including a dot.
[A-Za-z0-9_] [A-Za-z0-9_] $ Match two acceptable characters but not a dot. This ensures that
the email address ends with .xx, .xxx, .xxxx, etc.

• The language of regular expressions is similar to formulas of Boolean logic.


• Regular languages may be encoded as finite state networks.
• A regular expression can contain symbol pairs, e.g., /a:b/, which represents a relation between
two strings.
• Regular languages can be encoded using finite-state automata (FSA), making it easier to
manipulate and process regular languages, which in turn aids in handling more complex
languages like context-free languages.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 3


Natural Language Processing [BAD613B]

3. Finite-State Automata
• Game Description: The game involves a board with pieces, dice or a wheel to generate random
numbers, and players rearranging pieces based on the number. There’s no skill or choice; the
game is entirely based on random numbers.
• States: The game progresses through various states, starting from the initial state (beginning
positions of pieces) to the final state (winning positions).
• Machine Analogy: A machine with input, memory, processor, and output follows a similar
process: it starts in an initial state, changes to the next state based on the input, and eventually
reaches a final state or gets stuck if the next state is undefined.
• Finite Automaton: This model, with finite states and input symbols, describes a machine that
automatically changes states based on the input, and it’s deterministic, meaning the next state is
fully determined by the current state and input.

A finite automaton has the following properties:


1. A finite set of states, one of which is designated the initial or start state, and one or more of which are
designated as the final states.
2. A finite alphabet set, ∑, consisting of input symbols.
3. A finite set of transitions that specify for each state and each symbol of the input alphabet, the state to
which it next goes.
This finite-state automaton is shown as a directed graph, called transition diagram.

A deterministic finite -state automaton (DFA)

Let ∑ = {a, b, c}, the set of states = {q0, q1, q2, q3, q4} with q0 being the start state and q4 the final state,
we have the following rules of transition:
1. From state q0 and with input a, go to state q1.
2. From state q1 and with input b, go to state q2.
3. From state q1 and with input c go to state q3.
4. From state q2 and with input b, go to state q4.
5. From state q3 and with input b, go to state q4.

A finite automaton can be deterministic or non-deterministic.


Deterministic Automata:

• The nodes in this diagram correspond to the states, and the arcs to transitions.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 4


Natural Language Processing [BAD613B]

• The arcs are labelled with inputs.


• The final state is represented by a double circle.
• There is exactly one transition leading out of each state. Hence, this automaton is deterministic.
• Any regular expression can be represented by a finite automaton and the language of any finite
automaton can be described by a regular expression.
• A deterministic finite-state automaton (DFA) is defined as a 5-tuple (Q, Σ, δ, S, F), where Q
is a set of states, Σ is an alphabet, S is the start state, F ⃀ Q is a set of final states, and δ is a
transition function.
• The transition function δ defines mapping from QxΣ to Q. That is, for each state q and symbol a,
there is at most one transition possible

Non-Deterministic Automata:

• For each state, there can be more than one transition on a given symbol, each leading to a different
state.
• This is shown in Figure, where there are two possible transitions from state q 0 on input symbol
a.
• The transition function of a non-deterministic finite-state automaton (NFA) maps Q× (Σ Ս {ε})
to a subset of the power set of Q.

Non-deterministic finite-state automaton (NFA)


How it Works for Regex – NLP?

• A path is a sequence of transitions beginning with the start state.


• A path leading to one of the final states is a successful path.
• The FSAs encode regular languages.
• The language that an FSA encodes is the set of strings that can be formed by concatenating the
symbols along each successful path.

Example:
1. Consider the deterministic automaton described in above example and the input, “ac”.

• We start with state q0 and input symbol a and will go to state


q1.
• The next input symbol is c, we go to state q3.
• No more input is left and we have not reached the final state.
• Hence, the string ac is not recognized by the automaton.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 5


Natural Language Processing [BAD613B]

2. Now, consider the input “acb”


• we start with state q0 and go to state q1
• The next input symbol is c, so we go to state q3.
• The next input symbol is b, which leads to state q4.
• No more input is left and we have reached to final state.
• The string acb is a word of the language defined by the automaton.

State-transition table

• The rows in this table represent states and the columns correspond to input.
• The entries in the table represent the transition corresponding to a given state-input pair.
• A ɸ entry indicates missing transition.
• This table contains all the information needed by FSA.

Input
State a b c
Start: q0 q1 ɸ ɸ
q1 ɸ q2 q3
q2 ɸ q4 ɸ
q3 ɸ q4 ɸ
Final: q4 ɸ ɸ ɸ
Deterministic finite -state automaton (DFA) The state-transition table of DFA
Example

• Consider a language consisting of all strings containing only a’s and b’s and ending with baa.
• We can specify this language by the regular expression→ /(a|b)*baa$/.
• The NFA implementing this regular expression is shown & state-transition table for the NFA is
as shown below.
Input
State a b
Start: q0 {q0} {q0, q1}
q1 {q2} ɸ
q2 {q3} ɸ
Final: q3 ɸ ɸ

NFA for /(a|b)*baa$/ State transition table

An NFA can be converted to an equivalent DFA and vice versa.

DFA for /(a|b)*baa$/

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 6


Natural Language Processing [BAD613B]

4. Morphological Parsing
• It is a sub-discipline of linguistics
• It studies word structure and the formation of words from smaller units (morphemes).
• The goal of morphological parsing is to discover the morphemes that build a given word.
• A morphological parser should be able to tell us that the word 'eggs' is the plural form of the noun
stem 'egg'.
Example:
The word 'bread' consists of a single morpheme.
'eggs' consist of two morphemes: the egg and -s
4.1 Two Broad classes of Morphemes:
1. Stems – Main morpheme, contains the central meaning.
2. Affixes – modify the meaning given by the stem.
o Affixes are divided into prefix, suffix, infix, and circumfix.
1. Prefix - morphemes which appear before a stem. (un-happy, be-waqt)
2. Suffix - morphemes applied to the end. (ghodha-on, gurramu-lu, bidr-s, शीतलता)
3. Infixes - morphemes that appear inside a stem.
• English slang word "abso-bloody-lutely." The morpheme "-bloody-" is
inserted into the stem "absolutely" to emphasize the meaning.
4. Circumfixes - morphemes that may be applied to beginning & end of the stem.
• German word - gespielt (played) → ge+spiel+t
Spiel – play (stem)
4.2 Three main ways of word formation: Inflection, Derivation, and Compounding
Inflection: a root word combined with a grammatical morpheme to yield a word of the same class as the
original stem.
Ex. play (verb)+ ed (suffix) = Played (inflected form – past-tense)
Derivation: a root word combined with a grammatical morpheme to yield a word belonging to a different
class.

Ex. Compute (verb)+ion=Computation (noun).

Care (noun)+ ful (suffix)= careful (adjective).

Compounding: The process of merging two or more words to form a new word.

Ex. Personal computer, desktop, overlook.

Morphological analysis and generation deal with inflection, derivation and compounding process in
word formation and essential to many NLP applications:
1. Spelling corrections to machine translations.
2. In Information retrieval – to identify the presence of a query word in a document in spite of
different morphological variants.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 7


Natural Language Processing [BAD613B]

4.3 Morphological parsing:


It converts inflected words into their canonical form (lemma) with syntactical and morphological tags
(e.g., tense, gender, number).
Morphological generation reverses this process, and both parsing and generation rely on a dictionary
of valid lemmas and inflection paradigms for correct word forms.
A morphological parser uses following information sources:
1. Lexicon: A lexicon lists stems and affixes together with basic information about them.
2. Morphotactics: Ordering among the morphemes that constitute a word, It describes the way
morphemes are arranged or touch each other. Ex. Rest-less-ness is a valid word, but not Rest-
ness-less.
3. Orthographic rules: Spelling rules that specify the changes that occur when two given
morphemes combine. Ex. 'easy' to 'easier' and not to 'easyer'. (y → ier spelling rule)

Morphological analysis can be avoided if an exhaustive lexicon is available that lists features for all the
word-forms of all the roots.

Ex. A sample lexicon entry:


Word form Category Root Gender Number Person
Ghodhaa Noun GhoDaa Masculine Singular 3rd

Ghodhii -do- -do- feminine -do- -do-

Ghodhon -do- -do- Masculine plural -do-

Ghodhe -do- -do- -do- -do- -do-

Limitations in Lexical entry:


1. It puts a heavy demand on memory.
2. Fails to show the relationship between different roots having similar word-forms.
3. Number of possible word-forms may be theoretically infinite (complex languages like Turkish).

4.4 Stemmers:
• The simplest morphological systems
• Collapse morphological variations of a given word (word-forms) to one lemma or stem.
• Stemmers do not use a lexicon; instead, they make use of rewrite rules of the form:
o ier → y (e.g., earlier → early)
o ing → ε (e.g., playing → play)
• Stemming algorithms work in two steps:
(i) Suffix removal: This step removes predefined endings from words.
(ii) Recoding: This step adds predefined endings to the output of the first step.
• Two widely used stemming algorithms have been developed by Lovins (1968) and Porter (1980).

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 8


Natural Language Processing [BAD613B]

o Lovins's stemmer performs Suffix removal & Recoding sequentially


e.g. earlier→ first removes ier and recodes as Early
o Porter's stemmer performs Suffix removal & Recoding simultaneously
e.g. ational → ate
To transform word such as 'rotational' into 'rotate'.
Limitations:
• It is difficult to use stemming with morphologically rich languages.
• E.g. Transformation of the word 'organization' into 'organ'
• It reduces only suffixes and prefixes; Compound words are not reduced. E.g. “toothbrush” or
“snowball” can’t be broken.
A more efficient two-level morphological model – Koskenniemi (1983)
• Morphological parsing is viewed as a mapping from the surface level into morpheme and feature
sequences on the lexical level.
• The surface level represents the actual spelling of the word.
• The lexical level represents the concatenation of its constituent morphemes.
e.g. 'playing' is represented in the
Surface level → play + V + PP
Lexical level → 'play' followed by the morphological information +V +PP

Surface Level → p l a y i n g
Lexical Level → p l a y +V +PP

Similarly, 'books' is represented in the lexical form as 'book + N + PL'


This model is usually implemented with a kind of finite-state automata, called finite-state transducer
(FST).
Finite-state transducer (FST)
• FST maps an input word to its morphological components (root, affixes, etc.) and can also
generate the possible forms of a word based on its root and morphological rules.
• An FST can be thought of as a two-state automaton, which recognizes or generates a pair of
strings.
E.g. Walking
Analysis (Decomposition):
The analyzer uses a transducer that:
• Identifies the base form ("walk") from the surface form ("walking").
• Recognizes the suffix ("-ing") and removes it.
Generation (Synthesis):
The generator uses another transducer that:

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 9


Natural Language Processing [BAD613B]

• Identifies the base form ("walk") and applies the appropriate suffix to generate different surface
forms, like "walked" or "walking".
A finite-state transducer is a 6-tuple (Σ1, Σ2, Q, δ, S, F), where Q is set of states, S is the initial state, and
F ⃀ Q is a set of final states, Σ1 is input alphabet, Σ2 is output alphabet, and δ is a function mapping Q x
(Σ1 Ս {€}) x (Σ2 Ս {€}) to a subset of the power set of Q.

δ: Q × (Σ1∪{ε}) × (Σ2∪{ε}) → 2Q
Thus, an FST is similar to an NFA except in that transitions are made on strings rather than on symbols
and, in addition, they have outputs. FSTs encode regular relations between regular languages, with the
upper language on the top and the lower language on the bottom. For a transducer T and string s, T(s)
represents the set of strings in the relation. FSTs are closed under union, concatenation, composition, and
Kleene closure, but not under intersection or complementation.

Two-step morphological parser

Two-level morphology using FSTs involves analyzing surface forms in two steps.

Fig. Two-step morphological parser

Step1: Words are split into morphemes, considering spelling rules and possible splits (e.g., "boxe + s" or
"box + s").

Step2: The output is a concatenation of stems and affixes, with multiple representations possible for each
word.

We need to build two transducers: one that maps the surface form to the intermediate form and another
that maps the intermediate form to the lexical form.
A transducer maps the surface form "lesser" to its comparative form, where ɛ represents the empty string.
This bi-directional FST can be used for both analysis (surface to base) and generation (base to surface).

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 10


Natural Language Processing [BAD613B]

FST-based morphological parser for singular and plural nouns in English

• The plural form of regular nouns usually ends with -s or -es. (not necessarily be the plural form
– class, miss, bus).
• One of the required translations is the deletion of the 'e' when introducing a morpheme boundary.
o E.g. Boxes, This deletion is usually required for words ending in xes, ses, zes.
• This is done by below transducer – Mapping English nouns to the intermediate form:

Bird+s

Box+e+s

Quiz+e+s

Mapping English nouns to the intermediate form

• The next step is to develop a transducer that does the mapping from the intermediate level to the
lexical level. The input to transducer has one of the following forms:
• Regular noun stem, e.g., bird, cat
• Regular noun stem + s, e.g., bird + s
• Singular irregular noun stem, e.g., goose
• Plural irregular noun stem, e.g., geese
• In the first case, the transducer has to map all symbols of the stem to themselves and then output
N and sg.
• In the second case, it has to map all symbols of the stem to themselves, but then output N and
replaces PL with s.
• In the third case, it has to do the same as in the first case.
• Finally, in the fourth case, the transducer has to map the irregular plural noun stem to the
corresponding singular stem (e.g., geese to goose) and then it should add Nand PL.

Transducer for Step 2

The mapping from State 1 to State 2, 3, or 4 is carried out with the help of a transducer encoding a lexicon.
The transducer implementing the lexicon maps the individual regular and irregular noun stems to their
correct noun stem, replacing labels like regular noun form, etc.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 11


Natural Language Processing [BAD613B]

This lexicon maps the surface form geese, which is an irregular noun, to its correct stem goose in the
following way:
g:g e:o e:o s:s e:e
Mapping for the regular surface form of bird is b:b i:i r:r d:d. Representing pairs like a:a with a single
letter, these two representations are reduced to g e:o e:o s e and b i r d respectively.
Composing this transducer with the previous one, we get a single two-level transducer with one input
tape and one output tape. This maps plural nouns into the stem plus the morphological marker + pl and
singular nouns into the stem plus the morpheme + sg. Thus a surface word form birds will be mapped to
bird + N + pl as follows.
b:b i:i r:r d:d + ε:N + s:pl
Each letter maps to itself, while & maps to morphological feature +N, and s maps to morphological feature
pl. Figure shows the resulting composed transducer.

A transducer mapping nouns to their stem and morphological features

5. Spelling Error Detection and Correction

Typing mistakes: single character omission, insertion, substitution, and reversal are the most common
typing mistakes.

• Omission: When a single character is missed - 'concpt'


• Insertion error: Presence of an extra character in a word - 'error' is misspell as 'errorn'
• Substitution error: When a wrong letter is typed in place of the right one - 'errpr'
• Reversal: A situation in which the sequence of characters is reversed - 'aer' instead of 'are'.

OCR errors: Usually grouped into five classes: substitution, multi-substitution (or framing), space
deletion or insertion, and failures.
Substitution errors: Caused due to visual similarity such as c→e, 1→l, r→n.
The same is true for multi-substitution, e.g., m→rn.
Failure occurs when the OCR algorithm fails to select a letter with sufficient accuracy.
Solution: These errors can be corrected using 'context' or by using linguistic structures.

Phonetic errors:

• Speech recognition matches spoken utterances to a dictionary of phonemes.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 12


Natural Language Processing [BAD613B]

• Spelling errors are often phonetic, with incorrect words sounding like correct ones.
• Phonetic errors are harder to correct due to more complex distortions.
• Phonetic variations are common in translation

Spelling errors: are classified as non-word or real-word errors.

• Non-word error:
o Word that does not appear in a given lexicon or is not a valid orthographic word form.
o The two main techniques to find non-word errors: n-gram analysis and dictionary lookup.
• Real-word error:
o It occurs due to typographical mistakes or spelling errors.
o E.g. piece for peace or meat for meet.
o May cause local syntactic errors, global syntactic errors, semantic errors, or errors at
discourse or pragmatic levels.
o Impossible to decide that a word is wrong without some contextual information

Spelling correction: consists of detecting and correcting errors. Error detection is the process of finding
misspelled words and error correction is the process of suggesting correct words to a misspelled one.
These sub-problems are addressed in two ways:

1. Isolated-error detection and correction


2. Context-dependent error detection and correction

Isolated-error detection and correction: Each word is checked separately, independent of its context.

• It requires the existence of a lexicon containing all correct words.


• Would take a long time to compile and occupy a lot of space.
• It is impossible to list all the correct words of highly productive languages.

Context dependent error detection and correction methods: Utilize the context of a word. This requires
grammatical analysis and is thus more complex and language dependent. the list of candidate words must
first be obtained using an isolated-word method before making a selection depending on the context.

The spelling correction algorithm:

Broadly categorized by Kukich (1992) as follows:

Minimum edit distance The minimum edit distance between two strings is the minimum number of
operations (insertions, deletions, or substitutions) required to transform one string into another.

Similarity key techniques The basic idea in a similarity key technique is to change a given string into a
key such that similar strings will change into the same key.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 13


Natural Language Processing [BAD613B]

n-gram based techniques n-gram techniques usually require a large corpus or dictionary as training data,
so that an n-gram table of possible combinations of letters can be compiled. In case of real-word error
detection, we calculate the likelihood of one character following another and use this information to find
possible correct word candidates.

Neural nets These have the ability to do associative recall based on incomplete and noisy data. They can
be trained to adapt to specific spelling error patterns. Note: They are computationally expensive.

Rule-based techniques In a rule-based technique, a set of rules (heuristics) derived from knowledge of
a common spelling error pattern is used to transform misspelled words into valid words.

5.1 Minimum Edit Distance:

The minimum edit distance is the number of insertions, deletions, and substitutions required to change
one string into another.

For example, the minimum edit distance between 'tutor' and 'tumour' is 2: We substitute 'm' for 't' and
insert 'u' before 'r'.

Edit distance can be viewed as a string alignment problem. By aligning two strings, we can measure the
degree to which they match. There may be more than one possible alignment between two strings.

Alignment 1:

t u t o - r
t u m o u r
The best possible alignment corresponds to the minimum edit distance between the strings. The alignment
shown here, between tutor and tumour, has a distance of 2.

A dash in the upper string indicates insertion. A substitution occurs when the two alignment symbols do
not match (shown in bold).

The Levensthein distance between two sequences is obtained by assigning a unit cost to each operation,
therefore distance is 2.

Alignment 2:
Another possible alignment for these sequences is
t u t - o - r
t u - m o u r
which has a cost of 3.
Dynamic programming algorithms can be quite useful for finding minimum edit distance between two
sequences. (table-driven approach to solve problems by combining solutions to sub-problems).

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 14


Natural Language Processing [BAD613B]

The dynamic programming algorithm for minimum edit distance is implemented by creating an edit
distance matrix.

• This matrix has one row for each symbol in the source string and one column for each matrix in
the target string.
• The (i, j)th cell in this matrix represents the distance between the first i character of the source
and the first j character of the target string.
• Each cell can be computed as a simple function of its surrounding cells. Thus, by starting at the
beginning of the matrix, it is possible to fill each entry iteratively.
• The value in each cell is computed in terms of three possible paths.

• The substitution will be 0 if the ith character in the source matches with jth character in the target.
• The minimum edit distance algorithm is shown below.

Input: Two strings, X and Y


Output: The minimum edit distance between X and Y
m « length(X)
n«length(Y)
for i = 0 to m do
dist[i,0] « i
for j = 0 to n do
dist[0,j] « j
for i = 0 to m do
for j = 0 to n do
dist[i,j]=min{ dist[i-1,j]+insert_cost, dist[i-1,j-1]+subst_cost(Xi,Yj), dist[ij-1]+ delet_cost}

• How the algorithm computes the minimum edit distance between tutor and tumour is shown in
table.
# t u m o u r
# 0 1 2 3 4 5 6
t 1 0 1 2 3 4 5
u 2 1 0 1 2 3 4
t 3 2 1 1 2 3 4
o 4 3 2 2 1 2 3
r 5 4 3 3 2 2 2

Minimum edit distance algorithms are also useful for determining accuracy in speech recognition
systems.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 15


Natural Language Processing [BAD613B]

6.Words & Word Classes

• Words are classified into categories called part-of-speech.


• These are sometimes called word classes or lexical categories.
• These lexical categories are usually defined by their syntactic and morphological behaviours.
• The most common lexical categories are nouns and verbs. Other lexical categories include
adjectives, adverbs, prepositions, and conjunctions.

NN noun Student, chair, proof, mechanism


VB verb Study, increase, produce
ADJ adj Large, high, tall, few
JJ adverb Carefully, slowly, uniformly
IN preposition in, on, to, of
PRP pronoun I, me, they
DET determiner the, a, an, this, those

Table shows some of the word classes in English. Lexical categories and their properties vary from
language to language.

Word classes are further categorized as open and closed word classes.

• Open word classes constantly acquire new members while closed word classes do not (or only
infrequently do so).
• Nouns, verbs (except auxiliary verbs), adjectives, adverbs, and interjections are open word
classes.

e.g. computer, happiness, dog, run, think, discover, beautiful, large, happy, quickly, very, easily

• Prepositions, auxiliary verbs, delimiters, conjunction, and Interjections are closed word classes.
e.g. in, on, under, between, he, she, it, they, the, a, some, this, and, but, or, because, oh, wow,
ouch

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 16

You might also like