Module2 NLP BAD613B Notes
Module2 NLP BAD613B Notes
Module – 2
Processing carried out at word level, including methods for characterizing word sequences,
identifying morphological variants, detecting and correcting misspelled words, and identifying
correct part-of-speech of a word.
Some simple regular expressions: First instance of each match is underlined in table
Characters are grouped by square brackets, matching one character from the class. For
example, /[abcd]/ matches a, b, c, or d, and /[0123456789]/ matches any digit. A dash specifies
a range, like /[5-9]/ or /[m-p]/. The caret at the start of a class negates the match, as in /[^x]/,
which matches any character except x. The caret is interpreted literally elsewhere.
• Regular expressions are case-sensitive (e.g., /s/ matches 's', not 'S').
• Use square brackets to handle case differences, like /[sS]/.
o /[sS]ana/ matches 'sana' or 'Sana'.
• The question mark (?) makes the previous character optional (e.g., /supernovas?/).
• The * allows zero or more occurrences (e.g., /b*/).
• /[ab]*/ matches zero or more occurrences of 'a' or 'b'.
• The + specifies one or more occurrences (e.g., /a+/).
• /[0-9]+/ matches a sequence of one or more digits.
• The caret (^) anchors the match at the start, and $ at the end of a line.
o /^The nature\.$/ will search exactly for this line.
• The dot (.) is a wildcard matching any single character (e.g., /./).
o Expression /.at/ matches with any of the string cat, bat, rat, gat, kat, mat, etc.
Special characters
RE Description
. The dot matches any single character.
\n Matches a new line character (or CR+LF combination).
\t Matches a tab (ASCII 9).
\d Matches a digit [0-9].
\D Matches a non-digit.
\w Matches an alphanumeric character.
\W Matches a non-alphanumberic character.
\s Matches a whitespace character.
\S Matches a non-whitespace character.
\ Use \ to escape special characters. For example, l. matches a dot, \* matches a *
and \ matches a backslash.
• The wildcard symbol can count characters, e.g., /.....berry/ matches ten-letter strings
ending in "berry".
• This matches "strawberry", "sugarberry", but not "blueberry" or "hackberry".
• To search for "Tanveer" or "Siddiqui", use the disjunction operator (|), e.g.,
"Tanveer|Siddiqui".
• The pipe symbol matches either of the two patterns.
• Sequences take precedence over disjunction, so parentheses are needed to group patterns.
• Enclosing patterns in parentheses allows disjunction to apply correctly.
3. Finite-State Automata
• Game Description: The game involves a board with pieces, dice or a wheel to generate random
numbers, and players rearranging pieces based on the number. There’s no skill or choice; the
game is entirely based on random numbers.
• States: The game progresses through various states, starting from the initial state (beginning
positions of pieces) to the final state (winning positions).
• Machine Analogy: A machine with input, memory, processor, and output follows a similar
process: it starts in an initial state, changes to the next state based on the input, and eventually
reaches a final state or gets stuck if the next state is undefined.
• Finite Automaton: This model, with finite states and input symbols, describes a machine that
automatically changes states based on the input, and it’s deterministic, meaning the next state is
fully determined by the current state and input.
Let ∑ = {a, b, c}, the set of states = {q0, q1, q2, q3, q4} with q0 being the start state and q4 the final state,
we have the following rules of transition:
1. From state q0 and with input a, go to state q1.
2. From state q1 and with input b, go to state q2.
3. From state q1 and with input c go to state q3.
4. From state q2 and with input b, go to state q4.
5. From state q3 and with input b, go to state q4.
• The nodes in this diagram correspond to the states, and the arcs to transitions.
Non-Deterministic Automata:
• For each state, there can be more than one transition on a given symbol, each leading to a different
state.
• This is shown in Figure, where there are two possible transitions from state q 0 on input symbol
a.
• The transition function of a non-deterministic finite-state automaton (NFA) maps Q× (Σ Ս {ε})
to a subset of the power set of Q.
Example:
1. Consider the deterministic automaton described in above example and the input, “ac”.
State-transition table
• The rows in this table represent states and the columns correspond to input.
• The entries in the table represent the transition corresponding to a given state-input pair.
• A ɸ entry indicates missing transition.
• This table contains all the information needed by FSA.
Input
State a b c
Start: q0 q1 ɸ ɸ
q1 ɸ q2 q3
q2 ɸ q4 ɸ
q3 ɸ q4 ɸ
Final: q4 ɸ ɸ ɸ
Deterministic finite -state automaton (DFA) The state-transition table of DFA
Example
• Consider a language consisting of all strings containing only a’s and b’s and ending with baa.
• We can specify this language by the regular expression→ /(a|b)*baa$/.
• The NFA implementing this regular expression is shown & state-transition table for the NFA is
as shown below.
Input
State a b
Start: q0 {q0} {q0, q1}
q1 {q2} ɸ
q2 {q3} ɸ
Final: q3 ɸ ɸ
4. Morphological Parsing
• It is a sub-discipline of linguistics
• It studies word structure and the formation of words from smaller units (morphemes).
• The goal of morphological parsing is to discover the morphemes that build a given word.
• A morphological parser should be able to tell us that the word 'eggs' is the plural form of the noun
stem 'egg'.
Example:
The word 'bread' consists of a single morpheme.
'eggs' consist of two morphemes: the egg and -s
4.1 Two Broad classes of Morphemes:
1. Stems – Main morpheme, contains the central meaning.
2. Affixes – modify the meaning given by the stem.
o Affixes are divided into prefix, suffix, infix, and circumfix.
1. Prefix - morphemes which appear before a stem. (un-happy, be-waqt)
2. Suffix - morphemes applied to the end. (ghodha-on, gurramu-lu, bidr-s, शीतलता)
3. Infixes - morphemes that appear inside a stem.
• English slang word "abso-bloody-lutely." The morpheme "-bloody-" is
inserted into the stem "absolutely" to emphasize the meaning.
4. Circumfixes - morphemes that may be applied to beginning & end of the stem.
• German word - gespielt (played) → ge+spiel+t
Spiel – play (stem)
4.2 Three main ways of word formation: Inflection, Derivation, and Compounding
Inflection: a root word combined with a grammatical morpheme to yield a word of the same class as the
original stem.
Ex. play (verb)+ ed (suffix) = Played (inflected form – past-tense)
Derivation: a root word combined with a grammatical morpheme to yield a word belonging to a different
class.
Compounding: The process of merging two or more words to form a new word.
Morphological analysis and generation deal with inflection, derivation and compounding process in
word formation and essential to many NLP applications:
1. Spelling corrections to machine translations.
2. In Information retrieval – to identify the presence of a query word in a document in spite of
different morphological variants.
Morphological analysis can be avoided if an exhaustive lexicon is available that lists features for all the
word-forms of all the roots.
4.4 Stemmers:
• The simplest morphological systems
• Collapse morphological variations of a given word (word-forms) to one lemma or stem.
• Stemmers do not use a lexicon; instead, they make use of rewrite rules of the form:
o ier → y (e.g., earlier → early)
o ing → ε (e.g., playing → play)
• Stemming algorithms work in two steps:
(i) Suffix removal: This step removes predefined endings from words.
(ii) Recoding: This step adds predefined endings to the output of the first step.
• Two widely used stemming algorithms have been developed by Lovins (1968) and Porter (1980).
Surface Level → p l a y i n g
Lexical Level → p l a y +V +PP
• Identifies the base form ("walk") and applies the appropriate suffix to generate different surface
forms, like "walked" or "walking".
A finite-state transducer is a 6-tuple (Σ1, Σ2, Q, δ, S, F), where Q is set of states, S is the initial state, and
F ⃀ Q is a set of final states, Σ1 is input alphabet, Σ2 is output alphabet, and δ is a function mapping Q x
(Σ1 Ս {€}) x (Σ2 Ս {€}) to a subset of the power set of Q.
δ: Q × (Σ1∪{ε}) × (Σ2∪{ε}) → 2Q
Thus, an FST is similar to an NFA except in that transitions are made on strings rather than on symbols
and, in addition, they have outputs. FSTs encode regular relations between regular languages, with the
upper language on the top and the lower language on the bottom. For a transducer T and string s, T(s)
represents the set of strings in the relation. FSTs are closed under union, concatenation, composition, and
Kleene closure, but not under intersection or complementation.
Two-level morphology using FSTs involves analyzing surface forms in two steps.
Step1: Words are split into morphemes, considering spelling rules and possible splits (e.g., "boxe + s" or
"box + s").
Step2: The output is a concatenation of stems and affixes, with multiple representations possible for each
word.
We need to build two transducers: one that maps the surface form to the intermediate form and another
that maps the intermediate form to the lexical form.
A transducer maps the surface form "lesser" to its comparative form, where ɛ represents the empty string.
This bi-directional FST can be used for both analysis (surface to base) and generation (base to surface).
• The plural form of regular nouns usually ends with -s or -es. (not necessarily be the plural form
– class, miss, bus).
• One of the required translations is the deletion of the 'e' when introducing a morpheme boundary.
o E.g. Boxes, This deletion is usually required for words ending in xes, ses, zes.
• This is done by below transducer – Mapping English nouns to the intermediate form:
Bird+s
Box+e+s
Quiz+e+s
• The next step is to develop a transducer that does the mapping from the intermediate level to the
lexical level. The input to transducer has one of the following forms:
• Regular noun stem, e.g., bird, cat
• Regular noun stem + s, e.g., bird + s
• Singular irregular noun stem, e.g., goose
• Plural irregular noun stem, e.g., geese
• In the first case, the transducer has to map all symbols of the stem to themselves and then output
N and sg.
• In the second case, it has to map all symbols of the stem to themselves, but then output N and
replaces PL with s.
• In the third case, it has to do the same as in the first case.
• Finally, in the fourth case, the transducer has to map the irregular plural noun stem to the
corresponding singular stem (e.g., geese to goose) and then it should add Nand PL.
The mapping from State 1 to State 2, 3, or 4 is carried out with the help of a transducer encoding a lexicon.
The transducer implementing the lexicon maps the individual regular and irregular noun stems to their
correct noun stem, replacing labels like regular noun form, etc.
This lexicon maps the surface form geese, which is an irregular noun, to its correct stem goose in the
following way:
g:g e:o e:o s:s e:e
Mapping for the regular surface form of bird is b:b i:i r:r d:d. Representing pairs like a:a with a single
letter, these two representations are reduced to g e:o e:o s e and b i r d respectively.
Composing this transducer with the previous one, we get a single two-level transducer with one input
tape and one output tape. This maps plural nouns into the stem plus the morphological marker + pl and
singular nouns into the stem plus the morpheme + sg. Thus a surface word form birds will be mapped to
bird + N + pl as follows.
b:b i:i r:r d:d + ε:N + s:pl
Each letter maps to itself, while & maps to morphological feature +N, and s maps to morphological feature
pl. Figure shows the resulting composed transducer.
Typing mistakes: single character omission, insertion, substitution, and reversal are the most common
typing mistakes.
OCR errors: Usually grouped into five classes: substitution, multi-substitution (or framing), space
deletion or insertion, and failures.
Substitution errors: Caused due to visual similarity such as c→e, 1→l, r→n.
The same is true for multi-substitution, e.g., m→rn.
Failure occurs when the OCR algorithm fails to select a letter with sufficient accuracy.
Solution: These errors can be corrected using 'context' or by using linguistic structures.
Phonetic errors:
• Spelling errors are often phonetic, with incorrect words sounding like correct ones.
• Phonetic errors are harder to correct due to more complex distortions.
• Phonetic variations are common in translation
• Non-word error:
o Word that does not appear in a given lexicon or is not a valid orthographic word form.
o The two main techniques to find non-word errors: n-gram analysis and dictionary lookup.
• Real-word error:
o It occurs due to typographical mistakes or spelling errors.
o E.g. piece for peace or meat for meet.
o May cause local syntactic errors, global syntactic errors, semantic errors, or errors at
discourse or pragmatic levels.
o Impossible to decide that a word is wrong without some contextual information
Spelling correction: consists of detecting and correcting errors. Error detection is the process of finding
misspelled words and error correction is the process of suggesting correct words to a misspelled one.
These sub-problems are addressed in two ways:
Isolated-error detection and correction: Each word is checked separately, independent of its context.
Context dependent error detection and correction methods: Utilize the context of a word. This requires
grammatical analysis and is thus more complex and language dependent. the list of candidate words must
first be obtained using an isolated-word method before making a selection depending on the context.
Minimum edit distance The minimum edit distance between two strings is the minimum number of
operations (insertions, deletions, or substitutions) required to transform one string into another.
Similarity key techniques The basic idea in a similarity key technique is to change a given string into a
key such that similar strings will change into the same key.
n-gram based techniques n-gram techniques usually require a large corpus or dictionary as training data,
so that an n-gram table of possible combinations of letters can be compiled. In case of real-word error
detection, we calculate the likelihood of one character following another and use this information to find
possible correct word candidates.
Neural nets These have the ability to do associative recall based on incomplete and noisy data. They can
be trained to adapt to specific spelling error patterns. Note: They are computationally expensive.
Rule-based techniques In a rule-based technique, a set of rules (heuristics) derived from knowledge of
a common spelling error pattern is used to transform misspelled words into valid words.
The minimum edit distance is the number of insertions, deletions, and substitutions required to change
one string into another.
For example, the minimum edit distance between 'tutor' and 'tumour' is 2: We substitute 'm' for 't' and
insert 'u' before 'r'.
Edit distance can be viewed as a string alignment problem. By aligning two strings, we can measure the
degree to which they match. There may be more than one possible alignment between two strings.
Alignment 1:
t u t o - r
t u m o u r
The best possible alignment corresponds to the minimum edit distance between the strings. The alignment
shown here, between tutor and tumour, has a distance of 2.
A dash in the upper string indicates insertion. A substitution occurs when the two alignment symbols do
not match (shown in bold).
The Levensthein distance between two sequences is obtained by assigning a unit cost to each operation,
therefore distance is 2.
Alignment 2:
Another possible alignment for these sequences is
t u t - o - r
t u - m o u r
which has a cost of 3.
Dynamic programming algorithms can be quite useful for finding minimum edit distance between two
sequences. (table-driven approach to solve problems by combining solutions to sub-problems).
The dynamic programming algorithm for minimum edit distance is implemented by creating an edit
distance matrix.
• This matrix has one row for each symbol in the source string and one column for each matrix in
the target string.
• The (i, j)th cell in this matrix represents the distance between the first i character of the source
and the first j character of the target string.
• Each cell can be computed as a simple function of its surrounding cells. Thus, by starting at the
beginning of the matrix, it is possible to fill each entry iteratively.
• The value in each cell is computed in terms of three possible paths.
• The substitution will be 0 if the ith character in the source matches with jth character in the target.
• The minimum edit distance algorithm is shown below.
• How the algorithm computes the minimum edit distance between tutor and tumour is shown in
table.
# t u m o u r
# 0 1 2 3 4 5 6
t 1 0 1 2 3 4 5
u 2 1 0 1 2 3 4
t 3 2 1 1 2 3 4
o 4 3 2 2 1 2 3
r 5 4 3 3 2 2 2
Minimum edit distance algorithms are also useful for determining accuracy in speech recognition
systems.
Table shows some of the word classes in English. Lexical categories and their properties vary from
language to language.
Word classes are further categorized as open and closed word classes.
• Open word classes constantly acquire new members while closed word classes do not (or only
infrequently do so).
• Nouns, verbs (except auxiliary verbs), adjectives, adverbs, and interjections are open word
classes.
e.g. computer, happiness, dog, run, think, discover, beautiful, large, happy, quickly, very, easily
• Prepositions, auxiliary verbs, delimiters, conjunction, and Interjections are closed word classes.
e.g. in, on, under, between, he, she, it, they, the, a, some, this, and, but, or, because, oh, wow,
ouch