Word Level Analysis
Word Level Analysis
NLP involves different levels and complexities of processing. So, one way to
analyse natural language text is by breaking down into constituent units.
Regular Expressions:
● Regular expressions are the beautiful way to describing words in many
text applications.
● Regular expressions or regexes for short, are a pattern-matching
standard for string parsing and replacement.
● They are a powerful way to find and replace strings that make a defined
format
● we can use regular expressions to parse dates, urls,email addresses etc.
● They are the useful tools for design of language compilers and have been used in
NLP for tokenization, describing lexicons, morphological analysis etc.
Consider: supernovas?
→ accepts supernova, supernovas use of question mark makes the previous character
optional
Consider b*: specifies zero or more occurrence of b(* specifies zero or more )
occurrence,hence a space is also considered or any more occurrences of b is also
considered.
Example of matching strings: b, bb,bbb
Complex regular expressions can be built up from simpler ones by means of regular
expression operators.
● caret(^)--> used to match at the beginning of the string
● dollar($)--> used to match the end of the string
Example to search a line containing ‘The nature.’ → ^The nature\.$
Derivation: combines a word stem with grammatical morpheme to yield a different class
of words example: computation(noun) from compute(verb)
Formation of noun from verb is called nominalization.
Compounding: process of merging two or more words to form a new word. Bed+room
In linguistic morphology, inflection (or inflexion) is a process of word formation, in
which a word is modified to express different grammatical categories such as tense, case,
voice, aspect, person, number, gender, mood, animacy, and definiteness.
Applications of Morphological parsing
● Spelling correction
● Machine Translation
● In information retrieval morphological analysis helps in identifying presence of a
query word in a document.
Morphological parsing takes as input the inflected surface form of the word and
produces the output consisting of canonical form(or lemma of the word) and set of tags
showing its syntactic category and morphological characteristics.
Ex parts of speech and or inflected properties (Gender, number,person,tense etc).
Morphological generation is the inverse of morphological parsing.
Both Analysis and Generation rely on 2 sources of information
1) Dictionary of valid lemmas of the language
2) Set of inflection paradigms.
Morphological parsing uses following information sources:
● Lexicon: lexicon lists stem and affixes together with basic information about them.
Eg:play:stem Suffix:ing- Playing
● Morphotactics: It deals with the ordering of morphemes. It describes the way morphemes
are arranged or touched each other.
Eg:happyunhappiness
● Orthographic rules: Spelling rules that specify the changes that occur when two given
morphemes combine.
Example: changes from easy to easier and not easyer
Morphological Analysis can be avoided if an exhaustive lexicon is available that lists
features for all the word forms for all the roots.
Limitations of this approach:
● Heavy demand of memory.
● Exhaustive Lexicon fails to show the relationship between different roots having
similar word-forms.
● For morphologically complex languages , the number of possible word-forms may be
theoretically infinite.
These limitations makes morphological parsing necessary.
Stemmers are the simplest morphological systems but stemmers are not perfect.
Stemming algorithm works with two steps
1. Suffix removal: removes predefined suffixes
2. Recoding: adds precoded ending to the output of the first step.
Two widely used stemming algorithms are Lovins stemmer and porter stemmer
Two level morphological model proposed by koskenniemi can be used for highly inflected
languages.
Morphological parsing can be viewed as a mapping from the surface level into morpheme and
feature sequences on lexical level.
Playing → surface form
Play + V+ PP→ lexical form
(stem) (morphological info) that tell us it is present participle form of verb.
Surface level→ actual word
Lexical level→ concatenation of its constituent morphemes
● Plural form of regular words end with s or es.(certain word like miss
ends with s but not a plural)
Example: buses,prizes,foxes
● Consider boxes : stem word of this is box and it ends with es
● In such a case ‘e’ should be deleted when introducing morpheme
boundary.
2 step: develop transducer that does the mapping from intermediate level to the lexical
level.
● Regular noun form→ bird--> Noun and reads nothing and indicates as singular
● Irregular singular noun → goose→ Noun and reads nothing and indicates as singular
● Irregular plural noun→ geese→ Noun and reads nothing and indicates as PL
● Mapping from state1 to state4 is carried out with the help of a transducer encoding a
lexicon which maps individual regular and irregular noun stems to their correct noun
stem, replacing lableslike regular form(geese to goose)
Single two level transducer
Chances of generating
● Single character omission
● Insertion
● Substitution
Are more in case of OCR and similar automatic reading devices introduce all above
errors but not reversal
OCR errors are grouped into 5 categories:
● Substitution
● Multiple substitution
● Space deletion
● insertion
● failures
Due to visual similarity (c-->e,1-->L etc)
These errors can be corrected using context or linguistic structures.
Spelling error are mainly phonetic: misspelled words are pronounced in the same
way as correct word: but they are hard to set right.
2 categories of Spelling errors:
● Non word errors(error due to non appearing word in lexicon or not a orthographic
word from) currently resolved.
● Real word errors: Solution: two main techniques: n-gram and dictionary lookup.
Context dependent error detection and correction methods uses the context and hence
require to do grammatical analysis and hence language dependent.
● Even in case of context dependent methods, list of candidate words to be identified
first using isolated word method before making selection dependent on context
Various spelling correction algorithms:
Minimum edit distance: minimum number of operations required to transform one string
to another.(most widely used)
Source: "kitten“
Target: "sitting“
To transform "kitten" into "sitting", we can follow these steps:
Similarity Key techniques: change the given string into a key such that similar strings will
change into the same key.
3. N-gram based techniques: can be used for both non word and real word error
detection .
Non-word error: In english certain bi-grams and trigrams letters rarely occurs or will
never occur example: qst for tri-gram and qd for bi-gram.(Lexicon can be used).
Real word error: corpus is used with the likelihood of the letter occurrence to find
possible correct words.
Neural nets: These have ability to do associative recall based on
incomplete and noisy data. They can be trained to adapt to specific
spelling error patterns.
Drawback: they are computationally expensive.
Rule-based techniques: set of rules derived from knowledge
of a common spelling error pattern is used to transform
misspelled words into a valid words.
Example: many error generated due to typing mistake as ue
instead of eu can be written as a rule.
Minimum Edit Distance:
Number of insertions, deletions and substitutions required to change one string into
another.
It is the minimum edit distance
Example: consider “tutor” and “tumour”
T u t o r
T u m o u r so it is 2
Edit distance can be viewed as string alignment problem.
There can be more than one possible alignments
T u t o r
T u _ m o u r in this case So
we can consider the minimum distance as 2
● Edit distance between 2 strings can be represented as a binary
function, ed, which maps two strings to their edit distance.
● Ed is symmetric for any string ed(s,t)= ed(t,s)
● Dynamic programming approach can be used for edit distance
between two sequences.
● Dynamic programming refers to a class of algorithms that apply a
table-driven approach to solve problems by combining solutions to
subproblems.
Dynamic programming algorithms for Minimum Edit Distabetweennce