100% found this document useful (1 vote)

166 views16 pages

Module2 NLP BAD613B Notes

The document covers Word Level Analysis and Syntactic Analysis in Natural Language Processing, focusing on techniques such as regular expressions, finite-state automata, and morphological parsing. It details methods for part-of-speech tagging, error detection, and the structure of words, including morphemes and their classifications. Additionally, it explains the concepts of deterministic and non-deterministic finite automata and their application in recognizing patterns in language.

Uploaded by

sb4083070

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

166 views16 pages

Module2 NLP BAD613B Notes

Uploaded by

sb4083070

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Natural Language Processing [BAD613B]

Module – 2

Word Level Analysis & Syntactic Analysis

Word Level Analysis: Regular Expressions, Finite-State Automata, Morphological Parsing,
Spelling Error Detection and Correction, Words and Word Classes, Part-of Speech Tagging.

Syntactic Analysis: Context-Free Grammar, Constituency, Top-down and Bottom-up Parsing,

CYK Parsing.
Textbook 1: Ch. 3, Ch. 4.

Word Level Analysis

1. Introduction

Processing carried out at word level, including methods for characterizing word sequences,
identifying morphological variants, detecting and correcting misspelled words, and identifying
correct part-of-speech of a word.

1.1 The part-of-speech tagging methods:

1. Rule-based (linguistic).
2. Stochastic (data-driven).
3. Hybrid.
1.2 Regular expressions: standard notations for describing text patterns.
1.3 Implementation Regular expressions using finite-state automaton (FSA): applications
in speech recognition and synthesis, spell checking, and information extraction.
1.4 Detecting and correcting errors.

2. Regular Expressions (regexes)

• Pattern-matching standard for string parsing and replacement.
• Powerful way to find and replace strings that take a defined format.
• They are useful tools for the design of language compilers.
• Used in NLP for tokenization, describing lexicons, morphological analysis, etc..
• Perl was the first language that provided integrated support for regular expressions.
o It uses a slash “/” around each regular expression;
• A regular expression is an algebraic formula whose value is a pattern consisting of a set of strings,
called the language of the expression. Example: /a/

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 1

Natural Language Processing [BAD613B]

Some simple regular expressions: First instance of each match is underlined in table

Regular expression Example patterns

/book/ The world is a book, and those who do not travel read only one page.
/book/ Reporters, who do not read the stylebook, should not criticize their
editors.
/face/ Not everything that is faced can be changed. But nothing can be
changed until it is faced.
/a/ Reason, Observation, and Experience-the Holy Trinity of Science.

2.1 Character Classes

Characters are grouped by square brackets, matching one character from the class. For
example, /[abcd]/ matches a, b, c, or d, and /[0123456789]/ matches any digit. A dash specifies
a range, like /[5-9]/ or /[m-p]/. The caret at the start of a class negates the match, as in /[^x]/,
which matches any character except x. The caret is interpreted literally elsewhere.

Use of square brackets

RE Match Example patterns Matched
[abc] Match any of a, b, and c 'Refresher course will start
tomorrow'
[A-Z] Match any character between A and Z (ASCII order) the course will end on Jan. 10,
2006'
[^A-Z] Match any character other than an uppercase letter 'TREC Conference'

[^abc] Match anything other than a, b, and c 'TREC Conference'

[+ *?. ] Match any of +, *, ?, or the dot. '3 +2 = 5'
[a^] Match a or ^ ‘^ has different uses.’

• Regular expressions are case-sensitive (e.g., /s/ matches 's', not 'S').
• Use square brackets to handle case differences, like /[sS]/.
o /[sS]ana/ matches 'sana' or 'Sana'.
• The question mark (?) makes the previous character optional (e.g., /supernovas?/).
• The * allows zero or more occurrences (e.g., /b*/).
• /[ab]*/ matches zero or more occurrences of 'a' or 'b'.
• The + specifies one or more occurrences (e.g., /a+/).
• /[0-9]+/ matches a sequence of one or more digits.
• The caret (^) anchors the match at the start, and $ at the end of a line.
o /^The nature\.$/ will search exactly for this line.
• The dot (.) is a wildcard matching any single character (e.g., /./).
o Expression /.at/ matches with any of the string cat, bat, rat, gat, kat, mat, etc.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 2

Natural Language Processing [BAD613B]

Special characters

RE Description
. The dot matches any single character.
\n Matches a new line character (or CR+LF combination).
\t Matches a tab (ASCII 9).
\d Matches a digit [0-9].
\D Matches a non-digit.
\w Matches an alphanumeric character.
\W Matches a non-alphanumberic character.
\s Matches a whitespace character.
\S Matches a non-whitespace character.
\ Use \ to escape special characters. For example, l. matches a dot, \* matches a *
and \ matches a backslash.

• The wildcard symbol can count characters, e.g., /.....berry/ matches ten-letter strings
ending in "berry".
• This matches "strawberry", "sugarberry", but not "blueberry" or "hackberry".
• To search for "Tanveer" or "Siddiqui", use the disjunction operator (|), e.g.,
"Tanveer|Siddiqui".
• The pipe symbol matches either of the two patterns.
• Sequences take precedence over disjunction, so parentheses are needed to group patterns.
• Enclosing patterns in parentheses allows disjunction to apply correctly.

Example: Suppose we need to check if a string is an email address or not.

^[A-Za-z0-9_\ .- ]++@[A-Za-z0-9_\ .- ]++[A-Za-z0-9_][A-Za-z0-9_]$

Pattern Description
^[A-Za-z0-9_\.- ]+ Match a positive number of acceptable characters at the start of
the string.
@ Match the @ sign.
[A-Za-z0-9_\ .- ]+ Match any domain name, including a dot.
[A-Za-z0-9_] [A-Za-z0-9_] $ Match two acceptable characters but not a dot. This ensures that
the email address ends with .xx, .xxx, .xxxx, etc.

• The language of regular expressions is similar to formulas of Boolean logic.

• Regular languages may be encoded as finite state networks.
• A regular expression can contain symbol pairs, e.g., /a:b/, which represents a relation between
two strings.
• Regular languages can be encoded using finite-state automata (FSA), making it easier to
manipulate and process regular languages, which in turn aids in handling more complex
languages like context-free languages.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 3

Natural Language Processing [BAD613B]

3. Finite-State Automata
• Game Description: The game involves a board with pieces, dice or a wheel to generate random
numbers, and players rearranging pieces based on the number. There’s no skill or choice; the
game is entirely based on random numbers.
• States: The game progresses through various states, starting from the initial state (beginning
positions of pieces) to the final state (winning positions).
• Machine Analogy: A machine with input, memory, processor, and output follows a similar
process: it starts in an initial state, changes to the next state based on the input, and eventually
reaches a final state or gets stuck if the next state is undefined.
• Finite Automaton: This model, with finite states and input symbols, describes a machine that
automatically changes states based on the input, and it’s deterministic, meaning the next state is
fully determined by the current state and input.

A finite automaton has the following properties:

1. A finite set of states, one of which is designated the initial or start state, and one or more of which are
designated as the final states.
2. A finite alphabet set, ∑, consisting of input symbols.
3. A finite set of transitions that specify for each state and each symbol of the input alphabet, the state to
which it next goes.
This finite-state automaton is shown as a directed graph, called transition diagram.

A deterministic finite -state automaton (DFA)

Let ∑ = {a, b, c}, the set of states = {q0, q1, q2, q3, q4} with q0 being the start state and q4 the final state,
we have the following rules of transition:
1. From state q0 and with input a, go to state q1.
2. From state q1 and with input b, go to state q2.
3. From state q1 and with input c go to state q3.
4. From state q2 and with input b, go to state q4.
5. From state q3 and with input b, go to state q4.

A finite automaton can be deterministic or non-deterministic.

Deterministic Automata:

• The nodes in this diagram correspond to the states, and the arcs to transitions.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 4

Natural Language Processing [BAD613B]

• The arcs are labelled with inputs.

• The final state is represented by a double circle.
• There is exactly one transition leading out of each state. Hence, this automaton is deterministic.
• Any regular expression can be represented by a finite automaton and the language of any finite
automaton can be described by a regular expression.
• A deterministic finite-state automaton (DFA) is defined as a 5-tuple (Q, Σ, δ, S, F), where Q
is a set of states, Σ is an alphabet, S is the start state, F ⃀ Q is a set of final states, and δ is a
transition function.
• The transition function δ defines mapping from QxΣ to Q. That is, for each state q and symbol a,
there is at most one transition possible

Non-Deterministic Automata:

• For each state, there can be more than one transition on a given symbol, each leading to a different
state.
• This is shown in Figure, where there are two possible transitions from state q 0 on input symbol
a.
• The transition function of a non-deterministic finite-state automaton (NFA) maps Q× (Σ Ս {ε})
to a subset of the power set of Q.

Non-deterministic finite-state automaton (NFA)

How it Works for Regex – NLP?

• A path is a sequence of transitions beginning with the start state.

• A path leading to one of the final states is a successful path.
• The FSAs encode regular languages.
• The language that an FSA encodes is the set of strings that can be formed by concatenating the
symbols along each successful path.

Example:
1. Consider the deterministic automaton described in above example and the input, “ac”.

• We start with state q0 and input symbol a and will go to state

q1.
• The next input symbol is c, we go to state q3.
• No more input is left and we have not reached the final state.
• Hence, the string ac is not recognized by the automaton.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 5

Natural Language Processing [BAD613B]

2. Now, consider the input “acb”

• we start with state q0 and go to state q1
• The next input symbol is c, so we go to state q3.
• The next input symbol is b, which leads to state q4.
• No more input is left and we have reached to final state.
• The string acb is a word of the language defined by the automaton.

State-transition table

• The rows in this table represent states and the columns correspond to input.
• The entries in the table represent the transition corresponding to a given state-input pair.
• A ɸ entry indicates missing transition.
• This table contains all the information needed by FSA.

Input
State a b c
Start: q0 q1 ɸ ɸ
q1 ɸ q2 q3
q2 ɸ q4 ɸ
q3 ɸ q4 ɸ
Final: q4 ɸ ɸ ɸ
Deterministic finite -state automaton (DFA) The state-transition table of DFA
Example

• Consider a language consisting of all strings containing only a’s and b’s and ending with baa.
• We can specify this language by the regular expression→ /(a|b)*baa$/.
• The NFA implementing this regular expression is shown & state-transition table for the NFA is
as shown below.
Input
State a b
Start: q0 {q0} {q0, q1}
q1 {q2} ɸ
q2 {q3} ɸ
Final: q3 ɸ ɸ

NFA for /(a|b)*baa$/ State transition table

An NFA can be converted to an equivalent DFA and vice versa.

DFA for /(a|b)*baa$/

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 6

Natural Language Processing [BAD613B]

4. Morphological Parsing
• It is a sub-discipline of linguistics
• It studies word structure and the formation of words from smaller units (morphemes).
• The goal of morphological parsing is to discover the morphemes that build a given word.
• A morphological parser should be able to tell us that the word 'eggs' is the plural form of the noun
stem 'egg'.
Example:
The word 'bread' consists of a single morpheme.
'eggs' consist of two morphemes: the egg and -s
4.1 Two Broad classes of Morphemes:
1. Stems – Main morpheme, contains the central meaning.
2. Affixes – modify the meaning given by the stem.
o Affixes are divided into prefix, suffix, infix, and circumfix.
1. Prefix - morphemes which appear before a stem. (un-happy, be-waqt)
2. Suffix - morphemes applied to the end. (ghodha-on, gurramu-lu, bidr-s, शीतलता)
3. Infixes - morphemes that appear inside a stem.
• English slang word "abso-bloody-lutely." The morpheme "-bloody-" is
inserted into the stem "absolutely" to emphasize the meaning.
4. Circumfixes - morphemes that may be applied to beginning & end of the stem.
• German word - gespielt (played) → ge+spiel+t
Spiel – play (stem)
4.2 Three main ways of word formation: Inflection, Derivation, and Compounding
Inflection: a root word combined with a grammatical morpheme to yield a word of the same class as the
original stem.
Ex. play (verb)+ ed (suffix) = Played (inflected form – past-tense)
Derivation: a root word combined with a grammatical morpheme to yield a word belonging to a different
class.

Ex. Compute (verb)+ion=Computation (noun).

Care (noun)+ ful (suffix)= careful (adjective).

Compounding: The process of merging two or more words to form a new word.

Ex. Personal computer, desktop, overlook.

Morphological analysis and generation deal with inflection, derivation and compounding process in
word formation and essential to many NLP applications:
1. Spelling corrections to machine translations.
2. In Information retrieval – to identify the presence of a query word in a document in spite of
different morphological variants.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 7

Natural Language Processing [BAD613B]

4.3 Morphological parsing:

It converts inflected words into their canonical form (lemma) with syntactical and morphological tags
(e.g., tense, gender, number).
Morphological generation reverses this process, and both parsing and generation rely on a dictionary
of valid lemmas and inflection paradigms for correct word forms.
A morphological parser uses following information sources:
1. Lexicon: A lexicon lists stems and affixes together with basic information about them.
2. Morphotactics: Ordering among the morphemes that constitute a word, It describes the way
morphemes are arranged or touch each other. Ex. Rest-less-ness is a valid word, but not Rest-
ness-less.
3. Orthographic rules: Spelling rules that specify the changes that occur when two given
morphemes combine. Ex. 'easy' to 'easier' and not to 'easyer'. (y → ier spelling rule)

Morphological analysis can be avoided if an exhaustive lexicon is available that lists features for all the
word-forms of all the roots.

Ex. A sample lexicon entry:

Word form Category Root Gender Number Person
Ghodhaa Noun GhoDaa Masculine Singular 3rd

Ghodhii -do- -do- feminine -do- -do-

Ghodhon -do- -do- Masculine plural -do-

Ghodhe -do- -do- -do- -do- -do-

Limitations in Lexical entry:

1. It puts a heavy demand on memory.
2. Fails to show the relationship between different roots having similar word-forms.
3. Number of possible word-forms may be theoretically infinite (complex languages like Turkish).

4.4 Stemmers:
• The simplest morphological systems
• Collapse morphological variations of a given word (word-forms) to one lemma or stem.
• Stemmers do not use a lexicon; instead, they make use of rewrite rules of the form:
o ier → y (e.g., earlier → early)
o ing → ε (e.g., playing → play)
• Stemming algorithms work in two steps:
(i) Suffix removal: This step removes predefined endings from words.
(ii) Recoding: This step adds predefined endings to the output of the first step.
• Two widely used stemming algorithms have been developed by Lovins (1968) and Porter (1980).

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 8

Natural Language Processing [BAD613B]

o Lovins's stemmer performs Suffix removal & Recoding sequentially

e.g. earlier→ first removes ier and recodes as Early
o Porter's stemmer performs Suffix removal & Recoding simultaneously
e.g. ational → ate
To transform word such as 'rotational' into 'rotate'.
Limitations:
• It is difficult to use stemming with morphologically rich languages.
• E.g. Transformation of the word 'organization' into 'organ'
• It reduces only suffixes and prefixes; Compound words are not reduced. E.g. “toothbrush” or
“snowball” can’t be broken.
A more efficient two-level morphological model – Koskenniemi (1983)
• Morphological parsing is viewed as a mapping from the surface level into morpheme and feature
sequences on the lexical level.
• The surface level represents the actual spelling of the word.
• The lexical level represents the concatenation of its constituent morphemes.
e.g. 'playing' is represented in the
Surface level → play + V + PP
Lexical level → 'play' followed by the morphological information +V +PP

Surface Level → p l a y i n g
Lexical Level → p l a y +V +PP

Similarly, 'books' is represented in the lexical form as 'book + N + PL'

This model is usually implemented with a kind of finite-state automata, called finite-state transducer
(FST).
Finite-state transducer (FST)
• FST maps an input word to its morphological components (root, affixes, etc.) and can also
generate the possible forms of a word based on its root and morphological rules.
• An FST can be thought of as a two-state automaton, which recognizes or generates a pair of
strings.
E.g. Walking
Analysis (Decomposition):
The analyzer uses a transducer that:
• Identifies the base form ("walk") from the surface form ("walking").
• Recognizes the suffix ("-ing") and removes it.
Generation (Synthesis):
The generator uses another transducer that:

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 9

Natural Language Processing [BAD613B]

• Identifies the base form ("walk") and applies the appropriate suffix to generate different surface
forms, like "walked" or "walking".
A finite-state transducer is a 6-tuple (Σ1, Σ2, Q, δ, S, F), where Q is set of states, S is the initial state, and
F ⃀ Q is a set of final states, Σ1 is input alphabet, Σ2 is output alphabet, and δ is a function mapping Q x
(Σ1 Ս {€}) x (Σ2 Ս {€}) to a subset of the power set of Q.

δ: Q × (Σ1∪{ε}) × (Σ2∪{ε}) → 2Q
Thus, an FST is similar to an NFA except in that transitions are made on strings rather than on symbols
and, in addition, they have outputs. FSTs encode regular relations between regular languages, with the
upper language on the top and the lower language on the bottom. For a transducer T and string s, T(s)
represents the set of strings in the relation. FSTs are closed under union, concatenation, composition, and
Kleene closure, but not under intersection or complementation.

Two-step morphological parser

Two-level morphology using FSTs involves analyzing surface forms in two steps.

Fig. Two-step morphological parser

Step1: Words are split into morphemes, considering spelling rules and possible splits (e.g., "boxe + s" or
"box + s").

Step2: The output is a concatenation of stems and affixes, with multiple representations possible for each
word.

We need to build two transducers: one that maps the surface form to the intermediate form and another
that maps the intermediate form to the lexical form.
A transducer maps the surface form "lesser" to its comparative form, where ɛ represents the empty string.
This bi-directional FST can be used for both analysis (surface to base) and generation (base to surface).

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 10

Natural Language Processing [BAD613B]

FST-based morphological parser for singular and plural nouns in English

• The plural form of regular nouns usually ends with -s or -es. (not necessarily be the plural form
– class, miss, bus).
• One of the required translations is the deletion of the 'e' when introducing a morpheme boundary.
o E.g. Boxes, This deletion is usually required for words ending in xes, ses, zes.
• This is done by below transducer – Mapping English nouns to the intermediate form:

Bird+s

Box+e+s

Quiz+e+s

Mapping English nouns to the intermediate form

• The next step is to develop a transducer that does the mapping from the intermediate level to the
lexical level. The input to transducer has one of the following forms:
• Regular noun stem, e.g., bird, cat
• Regular noun stem + s, e.g., bird + s
• Singular irregular noun stem, e.g., goose
• Plural irregular noun stem, e.g., geese
• In the first case, the transducer has to map all symbols of the stem to themselves and then output
N and sg.
• In the second case, it has to map all symbols of the stem to themselves, but then output N and
replaces PL with s.
• In the third case, it has to do the same as in the first case.
• Finally, in the fourth case, the transducer has to map the irregular plural noun stem to the
corresponding singular stem (e.g., geese to goose) and then it should add Nand PL.

Transducer for Step 2

The mapping from State 1 to State 2, 3, or 4 is carried out with the help of a transducer encoding a lexicon.
The transducer implementing the lexicon maps the individual regular and irregular noun stems to their
correct noun stem, replacing labels like regular noun form, etc.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 11

Natural Language Processing [BAD613B]

This lexicon maps the surface form geese, which is an irregular noun, to its correct stem goose in the
following way:
g:g e:o e:o s:s e:e
Mapping for the regular surface form of bird is b:b i:i r:r d:d. Representing pairs like a:a with a single
letter, these two representations are reduced to g e:o e:o s e and b i r d respectively.
Composing this transducer with the previous one, we get a single two-level transducer with one input
tape and one output tape. This maps plural nouns into the stem plus the morphological marker + pl and
singular nouns into the stem plus the morpheme + sg. Thus a surface word form birds will be mapped to
bird + N + pl as follows.
b:b i:i r:r d:d + ε:N + s:pl
Each letter maps to itself, while & maps to morphological feature +N, and s maps to morphological feature
pl. Figure shows the resulting composed transducer.

A transducer mapping nouns to their stem and morphological features

5. Spelling Error Detection and Correction

Typing mistakes: single character omission, insertion, substitution, and reversal are the most common
typing mistakes.

• Omission: When a single character is missed - 'concpt'

• Insertion error: Presence of an extra character in a word - 'error' is misspell as 'errorn'
• Substitution error: When a wrong letter is typed in place of the right one - 'errpr'
• Reversal: A situation in which the sequence of characters is reversed - 'aer' instead of 'are'.

OCR errors: Usually grouped into five classes: substitution, multi-substitution (or framing), space
deletion or insertion, and failures.
Substitution errors: Caused due to visual similarity such as c→e, 1→l, r→n.
The same is true for multi-substitution, e.g., m→rn.
Failure occurs when the OCR algorithm fails to select a letter with sufficient accuracy.
Solution: These errors can be corrected using 'context' or by using linguistic structures.

Phonetic errors:

• Speech recognition matches spoken utterances to a dictionary of phonemes.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 12

Natural Language Processing [BAD613B]

• Spelling errors are often phonetic, with incorrect words sounding like correct ones.
• Phonetic errors are harder to correct due to more complex distortions.
• Phonetic variations are common in translation

Spelling errors: are classified as non-word or real-word errors.

• Non-word error:
o Word that does not appear in a given lexicon or is not a valid orthographic word form.
o The two main techniques to find non-word errors: n-gram analysis and dictionary lookup.
• Real-word error:
o It occurs due to typographical mistakes or spelling errors.
o E.g. piece for peace or meat for meet.
o May cause local syntactic errors, global syntactic errors, semantic errors, or errors at
discourse or pragmatic levels.
o Impossible to decide that a word is wrong without some contextual information

Spelling correction: consists of detecting and correcting errors. Error detection is the process of finding
misspelled words and error correction is the process of suggesting correct words to a misspelled one.
These sub-problems are addressed in two ways:

1. Isolated-error detection and correction

2. Context-dependent error detection and correction

Isolated-error detection and correction: Each word is checked separately, independent of its context.

• It requires the existence of a lexicon containing all correct words.

• Would take a long time to compile and occupy a lot of space.
• It is impossible to list all the correct words of highly productive languages.

Context dependent error detection and correction methods: Utilize the context of a word. This requires
grammatical analysis and is thus more complex and language dependent. the list of candidate words must
first be obtained using an isolated-word method before making a selection depending on the context.

The spelling correction algorithm:

Broadly categorized by Kukich (1992) as follows:

Minimum edit distance The minimum edit distance between two strings is the minimum number of
operations (insertions, deletions, or substitutions) required to transform one string into another.

Similarity key techniques The basic idea in a similarity key technique is to change a given string into a
key such that similar strings will change into the same key.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 13

Natural Language Processing [BAD613B]

n-gram based techniques n-gram techniques usually require a large corpus or dictionary as training data,
so that an n-gram table of possible combinations of letters can be compiled. In case of real-word error
detection, we calculate the likelihood of one character following another and use this information to find
possible correct word candidates.

Neural nets These have the ability to do associative recall based on incomplete and noisy data. They can
be trained to adapt to specific spelling error patterns. Note: They are computationally expensive.

Rule-based techniques In a rule-based technique, a set of rules (heuristics) derived from knowledge of
a common spelling error pattern is used to transform misspelled words into valid words.

5.1 Minimum Edit Distance:

The minimum edit distance is the number of insertions, deletions, and substitutions required to change
one string into another.

For example, the minimum edit distance between 'tutor' and 'tumour' is 2: We substitute 'm' for 't' and
insert 'u' before 'r'.

Edit distance can be viewed as a string alignment problem. By aligning two strings, we can measure the
degree to which they match. There may be more than one possible alignment between two strings.

Alignment 1:

t u t o - r
t u m o u r
The best possible alignment corresponds to the minimum edit distance between the strings. The alignment
shown here, between tutor and tumour, has a distance of 2.

A dash in the upper string indicates insertion. A substitution occurs when the two alignment symbols do
not match (shown in bold).

The Levensthein distance between two sequences is obtained by assigning a unit cost to each operation,
therefore distance is 2.

Alignment 2:
Another possible alignment for these sequences is
t u t - o - r
t u - m o u r
which has a cost of 3.
Dynamic programming algorithms can be quite useful for finding minimum edit distance between two
sequences. (table-driven approach to solve problems by combining solutions to sub-problems).

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 14

Natural Language Processing [BAD613B]

The dynamic programming algorithm for minimum edit distance is implemented by creating an edit
distance matrix.

• This matrix has one row for each symbol in the source string and one column for each matrix in
the target string.
• The (i, j)th cell in this matrix represents the distance between the first i character of the source
and the first j character of the target string.
• Each cell can be computed as a simple function of its surrounding cells. Thus, by starting at the
beginning of the matrix, it is possible to fill each entry iteratively.
• The value in each cell is computed in terms of three possible paths.

• The substitution will be 0 if the ith character in the source matches with jth character in the target.
• The minimum edit distance algorithm is shown below.

Input: Two strings, X and Y

Output: The minimum edit distance between X and Y
m « length(X)
n«length(Y)
for i = 0 to m do
dist[i,0] « i
for j = 0 to n do
dist[0,j] « j
for i = 0 to m do
for j = 0 to n do
dist[i,j]=min{ dist[i-1,j]+insert_cost, dist[i-1,j-1]+subst_cost(Xi,Yj), dist[ij-1]+ delet_cost}

• How the algorithm computes the minimum edit distance between tutor and tumour is shown in
table.
# t u m o u r
# 0 1 2 3 4 5 6
t 1 0 1 2 3 4 5
u 2 1 0 1 2 3 4
t 3 2 1 1 2 3 4
o 4 3 2 2 1 2 3
r 5 4 3 3 2 2 2

Minimum edit distance algorithms are also useful for determining accuracy in speech recognition
systems.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 15

Natural Language Processing [BAD613B]

6.Words & Word Classes

• Words are classified into categories called part-of-speech.

• These are sometimes called word classes or lexical categories.
• These lexical categories are usually defined by their syntactic and morphological behaviours.
• The most common lexical categories are nouns and verbs. Other lexical categories include
adjectives, adverbs, prepositions, and conjunctions.

NN noun Student, chair, proof, mechanism

VB verb Study, increase, produce
ADJ adj Large, high, tall, few
JJ adverb Carefully, slowly, uniformly
IN preposition in, on, to, of
PRP pronoun I, me, they
DET determiner the, a, an, this, those

Table shows some of the word classes in English. Lexical categories and their properties vary from
language to language.

Word classes are further categorized as open and closed word classes.

• Open word classes constantly acquire new members while closed word classes do not (or only
infrequently do so).
• Nouns, verbs (except auxiliary verbs), adjectives, adverbs, and interjections are open word
classes.

e.g. computer, happiness, dog, run, think, discover, beautiful, large, happy, quickly, very, easily

• Prepositions, auxiliary verbs, delimiters, conjunction, and Interjections are closed word classes.
e.g. in, on, under, between, he, she, it, they, the, a, some, this, and, but, or, because, oh, wow,
ouch

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 16

NLP - PPT - Module 3 - Naïve Bayes, Text Classification and Sentiment
100% (1)
NLP - PPT - Module 3 - Naïve Bayes, Text Classification and Sentiment
86 pages
Computer Science: Coding Challenges Booklet
No ratings yet
Computer Science: Coding Challenges Booklet
31 pages
Dart Apprentice Beyond The Basics
No ratings yet
Dart Apprentice Beyond The Basics
17 pages
Malware Forensics Introduction
No ratings yet
Malware Forensics Introduction
16 pages
NLP StudyMaterial
No ratings yet
NLP StudyMaterial
540 pages
Module1_NLP_BAD613B_Notes
No ratings yet
Module1_NLP_BAD613B_Notes
37 pages
Regular Expression Pocket Reference - Regular Expressions For Perl, Ruby, PHP, Python, C, Java and .NET (Pocket Reference (O'Reilly) ) (PDFDrive)
No ratings yet
Regular Expression Pocket Reference - Regular Expressions For Perl, Ruby, PHP, Python, C, Java and .NET (Pocket Reference (O'Reilly) ) (PDFDrive)
128 pages
CSS code answers
No ratings yet
CSS code answers
17 pages
NLP chapter 3
No ratings yet
NLP chapter 3
36 pages
Chevallier, Wichmann - A Note on the ‘toponym’ R Package A Practical introduction
No ratings yet
Chevallier, Wichmann - A Note on the ‘toponym’ R Package A Practical introduction
8 pages
Compiler Design 1
100% (1)
Compiler Design 1
30 pages
ACD-11
0% (1)
ACD-11
67 pages
VIT CSE BTech Course Plan
50% (2)
VIT CSE BTech Course Plan
76 pages
Regex - Rust
No ratings yet
Regex - Rust
24 pages
NLP chapter 4
No ratings yet
NLP chapter 4
35 pages
NLP chapter 2
No ratings yet
NLP chapter 2
29 pages
3 Csdsyll
No ratings yet
3 Csdsyll
29 pages
CS 224N / Ling 280 - Natural Language Processing: Course Description
No ratings yet
CS 224N / Ling 280 - Natural Language Processing: Course Description
6 pages
Python Unit 4&5 Que
No ratings yet
Python Unit 4&5 Que
33 pages
Modern Data Science With R-775437 Chapters
No ratings yet
Modern Data Science With R-775437 Chapters
10 pages
Compilers CSI-CST8152-Winter2022
No ratings yet
Compilers CSI-CST8152-Winter2022
8 pages
Course Code CS: I P, N & P
No ratings yet
Course Code CS: I P, N & P
47 pages
Reg Ex Cheat Sheet
No ratings yet
Reg Ex Cheat Sheet
1 page
Compiler Design-Notes
100% (2)
Compiler Design-Notes
212 pages
Unix Text Processing
No ratings yet
Unix Text Processing
11 pages
Awk One-Liners Explained (Preview Copy)
No ratings yet
Awk One-Liners Explained (Preview Copy)
12 pages
Lesson 1: Introducing Regular Expressions
No ratings yet
Lesson 1: Introducing Regular Expressions
4 pages
Pre Processors
No ratings yet
Pre Processors
22 pages
Creating Dataframes Reshaping Data
100% (1)
Creating Dataframes Reshaping Data
2 pages
Selenium Commands
No ratings yet
Selenium Commands
25 pages
Practical Natural Language Processing A Comprehensive Guide To Building Real World NLP Systems 1st Edition Sowmya Vajjala
100% (3)
Practical Natural Language Processing A Comprehensive Guide To Building Real World NLP Systems 1st Edition Sowmya Vajjala
62 pages
Nptel - Data Mining - Week 2
No ratings yet
Nptel - Data Mining - Week 2
4 pages
CST395 NEURAL NETWORKS AND DEEP LEARNING, DECEMBER 2021 (2)
No ratings yet
CST395 NEURAL NETWORKS AND DEEP LEARNING, DECEMBER 2021 (2)
3 pages
BRKARC 1008 Intro Ios XR
100% (1)
BRKARC 1008 Intro Ios XR
125 pages
R Programming Lab
100% (1)
R Programming Lab
46 pages
Python TOC
No ratings yet
Python TOC
5 pages
BioPerl Tutorial
100% (1)
BioPerl Tutorial
12 pages
Lexical Analysis
No ratings yet
Lexical Analysis
47 pages
Data Structures Using C (MSBTE), 1/e: Reema Thareja
No ratings yet
Data Structures Using C (MSBTE), 1/e: Reema Thareja
37 pages
LTRT-28647 SIP Message Manipulation Reference Guide Ver. 7.0
100% (1)
LTRT-28647 SIP Message Manipulation Reference Guide Ver. 7.0
96 pages
Exscript en
No ratings yet
Exscript en
33 pages
Facets of Data
No ratings yet
Facets of Data
6 pages
Java Important Questions Set
No ratings yet
Java Important Questions Set
64 pages
My Sabre Scribe Reference Guide
No ratings yet
My Sabre Scribe Reference Guide
129 pages
NPDA Problem Solution Manual
No ratings yet
NPDA Problem Solution Manual
10 pages
Mining Social Network Graphs
No ratings yet
Mining Social Network Graphs
35 pages
NLP UNIT 5 part b
100% (2)
NLP UNIT 5 part b
31 pages
Regular Expressions and Its Applications
No ratings yet
Regular Expressions and Its Applications
6 pages
Core Libraries For Machine Learning
No ratings yet
Core Libraries For Machine Learning
5 pages
Web Development Using PHP
No ratings yet
Web Development Using PHP
65 pages
UNIT V Application Layer
100% (1)
UNIT V Application Layer
18 pages
Huffman Coding: Greedy Algorithm
No ratings yet
Huffman Coding: Greedy Algorithm
27 pages
NATURAL LANGUAGE PROCESSING IN Education
No ratings yet
NATURAL LANGUAGE PROCESSING IN Education
17 pages
Chapter 4 - Syntax Analysis
No ratings yet
Chapter 4 - Syntax Analysis
82 pages
Evolution of Big Data
No ratings yet
Evolution of Big Data
21 pages
BDA Lab ManuaL[1]
No ratings yet
BDA Lab ManuaL[1]
83 pages
Query Processing and Optimization
No ratings yet
Query Processing and Optimization
42 pages
R Language
No ratings yet
R Language
59 pages
Computer Organization Notes
No ratings yet
Computer Organization Notes
82 pages
Lecture Notes: IV B. Tech I Semester (JNTUH-R13)
No ratings yet
Lecture Notes: IV B. Tech I Semester (JNTUH-R13)
18 pages
SMB-R Programming Lab
No ratings yet
SMB-R Programming Lab
57 pages
Table of Content
No ratings yet
Table of Content
13 pages
Done Classic - Data - Structures - by - D. - Samanta PDF
No ratings yet
Done Classic - Data - Structures - by - D. - Samanta PDF
99 pages
Lecture Notes
100% (1)
Lecture Notes
82 pages
Unit-II BDA
No ratings yet
Unit-II BDA
19 pages
CP Lecture Notes
No ratings yet
CP Lecture Notes
129 pages
Data Science Laboratory Lab Manual: Prepared by Dr. R Obulakonda Reddy, Associate Professor
No ratings yet
Data Science Laboratory Lab Manual: Prepared by Dr. R Obulakonda Reddy, Associate Professor
35 pages
Scientific Python Workshop
100% (1)
Scientific Python Workshop
2 pages
Crack the Data Analyst Interview: Real-Time Questions & Expert Answers
From Everand
Crack the Data Analyst Interview: Real-Time Questions & Expert Answers
Yash d.
No ratings yet
Reference Book For UGC-NET-JRF Computer Science and Application
No ratings yet
Reference Book For UGC-NET-JRF Computer Science and Application
2 pages
C Programming Question Bank
No ratings yet
C Programming Question Bank
3 pages
Flat CH 2
No ratings yet
Flat CH 2
86 pages
BBM401 Automata Theory and Formal Languages 1
No ratings yet
BBM401 Automata Theory and Formal Languages 1
34 pages
Chapter-1:-Introduction To R Language: 1.1 History and Overview
No ratings yet
Chapter-1:-Introduction To R Language: 1.1 History and Overview
7 pages
Advanced Unix Programming
From Everand
Advanced Unix Programming
Prof. N. B Venkateswarlu
No ratings yet
r05310501 Formal Languages and Automata Theory
No ratings yet
r05310501 Formal Languages and Automata Theory
8 pages
Data Structures Using C and C++ - Y. Langsam, M. Augenstein and A. M. Tenenbaum
No ratings yet
Data Structures Using C and C++ - Y. Langsam, M. Augenstein and A. M. Tenenbaum
99 pages
DBMS Notes
No ratings yet
DBMS Notes
180 pages
Oomd (U1&u2)
100% (1)
Oomd (U1&u2)
83 pages
MFCS Practicles PDF
No ratings yet
MFCS Practicles PDF
16 pages
Broadcasting Chat Server
83% (6)
Broadcasting Chat Server
25 pages
Lecture 1: Introduction To NLP: Understand Concepts Applications
No ratings yet
Lecture 1: Introduction To NLP: Understand Concepts Applications
32 pages
New Learning of Python by Practical Innovation and Technology
From Everand
New Learning of Python by Practical Innovation and Technology
Sudhir Pathania
No ratings yet
Chapter 3 - Lexical Analysis
100% (1)
Chapter 3 - Lexical Analysis
51 pages
Module-1 PPT Data Communication
100% (1)
Module-1 PPT Data Communication
168 pages
BD Problem Solving - I
No ratings yet
BD Problem Solving - I
2 pages
Kaushal Chavda
No ratings yet
Kaushal Chavda
137 pages
Excel 2013/2016: Get Your Hands Dirty
From Everand
Excel 2013/2016: Get Your Hands Dirty
Sam Akrasi
No ratings yet
Master of Science-Computer Science-Syllabus
No ratings yet
Master of Science-Computer Science-Syllabus
22 pages
IML-IITKGP - Assignment 7 Solution
No ratings yet
IML-IITKGP - Assignment 7 Solution
8 pages
Daa Lab Manual
No ratings yet
Daa Lab Manual
60 pages
Principal of Programming Language
No ratings yet
Principal of Programming Language
67 pages
ST 2
No ratings yet
ST 2
46 pages
C & Data Structures
From Everand
C & Data Structures
Prof. P. Padmanabham
No ratings yet

Module2 NLP BAD613B Notes

Uploaded by

Module2 NLP BAD613B Notes

Uploaded by

Natural Language Processing [BAD613B]

Word Level Analysis & Syntactic Analysis

Syntactic Analysis: Context-Free Grammar, Constituency, Top-down and Bottom-up Parsing,

Word Level Analysis

1.1 The part-of-speech tagging methods:

2. Regular Expressions (regexes)

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 1

Regular expression Example patterns

2.1 Character Classes

Use of square brackets

[^abc] Match anything other than a, b, and c 'TREC Conference'

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 2

Example: Suppose we need to check if a string is an email address or not.

^[A-Za-z0-9_\ .- ]++@[A-Za-z0-9_\ .- ]++[A-Za-z0-9_][A-Za-z0-9_]$

• The language of regular expressions is similar to formulas of Boolean logic.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 3

A finite automaton has the following properties:

A deterministic finite -state automaton (DFA)

A finite automaton can be deterministic or non-deterministic.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 4

• The arcs are labelled with inputs.

Non-deterministic finite-state automaton (NFA)

• A path is a sequence of transitions beginning with the start state.

• We start with state q0 and input symbol a and will go to state

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 5

2. Now, consider the input “acb”

NFA for /(a|b)*baa$/ State transition table

An NFA can be converted to an equivalent DFA and vice versa.

DFA for /(a|b)*baa$/

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 6

Ex. Compute (verb)+ion=Computation (noun).

Care (noun)+ ful (suffix)= careful (adjective).

Ex. Personal computer, desktop, overlook.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 7

4.3 Morphological parsing:

Ex. A sample lexicon entry:

Ghodhii -do- -do- feminine -do- -do-

Ghodhon -do- -do- Masculine plural -do-

Ghodhe -do- -do- -do- -do- -do-

Limitations in Lexical entry:

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 8

o Lovins's stemmer performs Suffix removal & Recoding sequentially

Similarly, 'books' is represented in the lexical form as 'book + N + PL'

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 9

Two-step morphological parser

Fig. Two-step morphological parser

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 10

FST-based morphological parser for singular and plural nouns in English

Mapping English nouns to the intermediate form

Transducer for Step 2

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 11

A transducer mapping nouns to their stem and morphological features

5. Spelling Error Detection and Correction

• Omission: When a single character is missed - 'concpt'

• Speech recognition matches spoken utterances to a dictionary of phonemes.

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 12

Spelling errors: are classified as non-word or real-word errors.

1. Isolated-error detection and correction

• It requires the existence of a lexicon containing all correct words.

The spelling correction algorithm:

Broadly categorized by Kukich (1992) as follows:

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 13

5.1 Minimum Edit Distance:

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 14

Input: Two strings, X and Y

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 15

6.Words & Word Classes

• Words are classified into categories called part-of-speech.

NN noun Student, chair, proof, mechanism

Dept. of CSE-DS, RNSIT Dr. Mahantesh K 16

You might also like