0% found this document useful (0 votes)
392 views49 pages

Word Level Analysis

Uploaded by

21e10.ashwitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
392 views49 pages

Word Level Analysis

Uploaded by

21e10.ashwitha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 49

Word Level Analysis

NLP involves different levels and complexities of processing. So, one way to
analyse natural language text is by breaking down into constituent units.

Regular Expressions:
● Regular expressions are the beautiful way to describing words in many
text applications.
● Regular expressions or regexes for short, are a pattern-matching
standard for string parsing and replacement.
● They are a powerful way to find and replace strings that make a defined
format
● we can use regular expressions to parse dates, urls,email addresses etc.

● They are the useful tools for design of language compilers and have been used in
NLP for tokenization, describing lexicons, morphological analysis etc.

● Regular expression is an algebraic formulae whose value is a pattern consisting


of a set of strings, called the language of the expression

● Regular expressions (regex or regexp) are extremely useful in extracting information


from any text by searching for one or more matches of a specific search pattern (i.e. a
specific sequence of ASCII or unicode characters).

● Fields of application range from validation to parsing/replacing strings,


passing through translating data to other formats and web scraping.
/a/ denotes the set containing the string ‘a’, it matches with all occurrences of ‘a’
/supernova/ denotes set containing string ‘supernova’ and nothing else.
Character classes
Characters are grouped by putting them between brackets.
● [abc] will match with any of the a,b,c.
● [5-9] used to specify a range, so any numbers between 5 to 9 is matched.
● [^x] matches any character except x.(only if ^ appears as first symbol else it is
just referred to caret symbol.
● Regular expressions are case sensitive.
[s]ana matches only with sana and not Sana
→ to resolve this [sS]ana can be used.

Consider: supernovas?
→ accepts supernova, supernovas use of question mark makes the previous character
optional

Consider b*: specifies zero or more occurrence of b(* specifies zero or more )
occurrence,hence a space is also considered or any more occurrences of b is also
considered.
Example of matching strings: b, bb,bbb
Complex regular expressions can be built up from simpler ones by means of regular
expression operators.
● caret(^)--> used to match at the beginning of the string
● dollar($)--> used to match the end of the string
Example to search a line containing ‘The nature.’ → ^The nature\.$

( . ) wild card character can be used for counting characters


/...berry/ matches six characters that end with berry. Ex twoberry

In order to apply disjunction(|) operator a specific pattern, we need to enclose in parenthesis


Example: to match black berry|berries if parenthesis is not used it matches with blackberry or
just berries
black(berry|berries) is the right one
Special characters used in regular expressions
Finite state Automata
A finite automaton has the following properties:
1. A finite set of states ,one of which is designated as initial state or start
state and one or more of which designated as Final state.
2. A finite alphabet set, 𝚺 containing input symbols.
3. A finite set of transitions for each state and each symbol of the input alphabet,the
state to which it next goes.

Finite Automaton can be deterministic or Non-deterministic.


In case of non deterministic automaton for the same input multiple transitions are
possible.
●DFA and NFA are both mathematical
models used to recognize patterns in
strings of symbols, with DFA being
deterministic and following a single path
for each input, while NFA is non-
deterministic and can explore multiple
paths simultaneously.
Finite state automata have been used in a wide variety of areas
including linguistics, electrical engineering, computer science,
mathematics and logic.

These are important to all computational linguistics and have been


used as a mathematical device to implement regular expressions.
(𝚺,Q,ẟ,S,F)
Deterministic finite state automaton(DFA) is defined as 5 tuple

𝚺→ set of input symbols


Q→ set of state
ẟ→ transition ,
S→ start symbol
F→ final state

Any regular expression can be represented by a finite automaton and


the language of any finite automaton can be described by a regular
expression.
ac is the string not accepted by the automaton.
Morphological parsing
Morphology is a sub-discipline of linguistics.It is the study of word structure and
formation of words from smaller units(morphemes).
Parsing: taking an input and producing some kind of structure out of it.

Understanding morphology is important to understand the syntactic and semantic


properties of words.
● Morphemes are the smallest meaning bearing units in language.
Example: bread contains single morpheme
Eggs contains two morpheme
Morpheme a)egg and morpheme -s
● Morphological parser should be able to tell us that the word eggs is the
plural form of egg.
Two broad classes of morphemes:

● Stem (main morpheme)


● Affixes (modify the meaning of the stem)
Affixes can be prefix,suffix, infix and circumfix.
❖ prefix→ morphemes which appear before the stem (ex: unable)
❖ postfix→ morphemes which appear after the stem(ex: respectful)
❖ Infix→ morphemes which appear inside the stem(ex:geese, choose)
❖ Circumfix→ morphemes that appear on both before and after the stem(ex:
inhumanity)
Three main ways of word formation
● Inflection
● Derivation
● Compounding
Inflection: root word combined with grammatical morpheme to yield a word of the same
class (example : pass -passed-passing )

Derivation: combines a word stem with grammatical morpheme to yield a different class
of words example: computation(noun) from compute(verb)
Formation of noun from verb is called nominalization.
Compounding: process of merging two or more words to form a new word. Bed+room
In linguistic morphology, inflection (or inflexion) is a process of word formation, in
which a word is modified to express different grammatical categories such as tense, case,
voice, aspect, person, number, gender, mood, animacy, and definiteness.
Applications of Morphological parsing
● Spelling correction
● Machine Translation
● In information retrieval morphological analysis helps in identifying presence of a
query word in a document.
Morphological parsing takes as input the inflected surface form of the word and
produces the output consisting of canonical form(or lemma of the word) and set of tags
showing its syntactic category and morphological characteristics.
Ex parts of speech and or inflected properties (Gender, number,person,tense etc).
Morphological generation is the inverse of morphological parsing.
Both Analysis and Generation rely on 2 sources of information
1) Dictionary of valid lemmas of the language
2) Set of inflection paradigms.
Morphological parsing uses following information sources:
● Lexicon: lexicon lists stem and affixes together with basic information about them.
Eg:play:stem Suffix:ing- Playing
● Morphotactics: It deals with the ordering of morphemes. It describes the way morphemes
are arranged or touched each other.
Eg:happyunhappiness
● Orthographic rules: Spelling rules that specify the changes that occur when two given
morphemes combine.
Example: changes from easy to easier and not easyer
Morphological Analysis can be avoided if an exhaustive lexicon is available that lists
features for all the word forms for all the roots.
Limitations of this approach:
● Heavy demand of memory.
● Exhaustive Lexicon fails to show the relationship between different roots having
similar word-forms.
● For morphologically complex languages , the number of possible word-forms may be
theoretically infinite.
These limitations makes morphological parsing necessary.
Stemmers are the simplest morphological systems but stemmers are not perfect.
Stemming algorithm works with two steps
1. Suffix removal: removes predefined suffixes
2. Recoding: adds precoded ending to the output of the first step.
Two widely used stemming algorithms are Lovins stemmer and porter stemmer

Two level morphological model proposed by koskenniemi can be used for highly inflected
languages.
Morphological parsing can be viewed as a mapping from the surface level into morpheme and
feature sequences on lexical level.
Playing → surface form
Play + V+ PP→ lexical form
(stem) (morphological info) that tell us it is present participle form of verb.
Surface level→ actual word
Lexical level→ concatenation of its constituent morphemes

● This model is implemented with a finite-state automata called finite-


state transducer(FST). A transducer maps a set of symbols to
another.
● FST is thought of as a two state automaton, which recognizes or
generates a pair of strings.
● FST passes over the input string by passing over the input string by
consuming the input symbols on to the tape it traverses and consists
This model is usually implemented with a kind of finite state automata

A finite state transducer is a 6-tuple (C,𝚺2,Q,ẟ,S,F)


called finite-state transducer(FST)

𝚺1→ input alphabet


𝚺2→ output alphabet
ẟ→ function mapping QX(𝚺1U {ε}) to a subset of the power set of Q
C->The set of states the FST can be in
FST is similar to NFA except in that transitions are made on strings
rather than on symbols and in addition, they have outputs.

Fig shows a simple transducer


that accepts two input strings.
FSA encode regular languages where
as FST encode regular relationship
To get surface form of a word to its morphological analysis, it is done in
2 steps
1. Split the words into its possible components (this step considers
spelling rules).
Example: bird+s→ where + indicates the morpheme boundaries
Two possible ways of splitting up boxes,
● Boxe+s(assumes box as stem and e is included due to
spelling rule)
● Box+s (assumes box as the stem and s as suffix)
Output of this step is concatenation of morphemes and affixes.
2. Lexicon is used to look up categories of the stems and meaning of
the affixes.
So, box+s is mapped to box+N+PL

In case of boxes , through lexicon we will be able to find it out as not a


legal word.
For spelling variations like spourses and parses orthographic rules are
used .

Example for orthographic rules:


Add e after -s,-z,-x,-ch,-sh before the s(box-->boxes, dish→ dishes)

Each of these steps can be implemented using transducers:


Hence we require 2 transducer
1. That maps surface form to intermediate form
Transducers can be both deterministic or non deterministic
Every NFA will have a equivalent DFA, but not all non deterministic
transducers will have an equivalent deterministic transducer
FST based morphological parse for singular and plural nouns in english

● Plural form of regular words end with s or es.(certain word like miss
ends with s but not a plural)
Example: buses,prizes,foxes
● Consider boxes : stem word of this is box and it ends with es
● In such a case ‘e’ should be deleted when introducing morpheme
boundary.
2 step: develop transducer that does the mapping from intermediate level to the lexical
level.

● Regular noun form→ bird--> Noun and reads nothing and indicates as singular
● Irregular singular noun → goose→ Noun and reads nothing and indicates as singular
● Irregular plural noun→ geese→ Noun and reads nothing and indicates as PL
● Mapping from state1 to state4 is carried out with the help of a transducer encoding a
lexicon which maps individual regular and irregular noun stems to their correct noun
stem, replacing lableslike regular form(geese to goose)
Single two level transducer

Same transducer can be used for both analysis and generation.


Spelling Error Detection and Correction
Source of spelling error:
● Single character omission (charcter instead of character)
● Insertion (errorn instead of error)
● Substitution (errpr instead of error0
● reversal(transposition) (aer instead of are)

Chances of generating
● Single character omission
● Insertion
● Substitution
Are more in case of OCR and similar automatic reading devices introduce all above
errors but not reversal
OCR errors are grouped into 5 categories:
● Substitution
● Multiple substitution
● Space deletion
● insertion
● failures
Due to visual similarity (c-->e,1-->L etc)
These errors can be corrected using context or linguistic structures.

Spelling error are mainly phonetic: misspelled words are pronounced in the same
way as correct word: but they are hard to set right.
2 categories of Spelling errors:
● Non word errors(error due to non appearing word in lexicon or not a orthographic
word from) currently resolved.
● Real word errors: Solution: two main techniques: n-gram and dictionary lookup.

Spelling correction consists of detecting and correcting errors.


● Error detection: process of finding misspelled words
● Error correction: process of suggesting correct word to misspelled word

These subproblems are addressed in 2 ways


1. Isolated error detection and correction(each word is checked, independent of context)
2. Context dependent error detection and correction
Simple way of correcting spelling error is to make use of lexicon
But the problems associated with it is
● Existence of lexicon containing all correct words.such a lexicon would take long time
to compile and occupy lot of space
● In case of highly productive language . It is impossible to list all the correct words of
such languages.
● If spelling error occurs due to a word that is available in lexicon(it is called a real
word error: when theses is written in place of these.
● Larger the lexicon more chances of error being undetected

Context dependent error detection and correction methods uses the context and hence
require to do grammatical analysis and hence language dependent.
● Even in case of context dependent methods, list of candidate words to be identified
first using isolated word method before making selection dependent on context
Various spelling correction algorithms:
Minimum edit distance: minimum number of operations required to transform one string
to another.(most widely used)
Source: "kitten“
Target: "sitting“
To transform "kitten" into "sitting", we can follow these steps:
Similarity Key techniques: change the given string into a key such that similar strings will
change into the same key.
3. N-gram based techniques: can be used for both non word and real word error
detection .
Non-word error: In english certain bi-grams and trigrams letters rarely occurs or will
never occur example: qst for tri-gram and qd for bi-gram.(Lexicon can be used).
Real word error: corpus is used with the likelihood of the letter occurrence to find
possible correct words.
Neural nets: These have ability to do associative recall based on
incomplete and noisy data. They can be trained to adapt to specific
spelling error patterns.
Drawback: they are computationally expensive.
Rule-based techniques: set of rules derived from knowledge
of a common spelling error pattern is used to transform
misspelled words into a valid words.
Example: many error generated due to typing mistake as ue
instead of eu can be written as a rule.
Minimum Edit Distance:
Number of insertions, deletions and substitutions required to change one string into
another.
It is the minimum edit distance
Example: consider “tutor” and “tumour”
T u t o r
T u m o u r so it is 2
Edit distance can be viewed as string alignment problem.
There can be more than one possible alignments
T u t o r
T u _ m o u r in this case So
we can consider the minimum distance as 2
● Edit distance between 2 strings can be represented as a binary
function, ed, which maps two strings to their edit distance.
● Ed is symmetric for any string ed(s,t)= ed(t,s)
● Dynamic programming approach can be used for edit distance
between two sequences.
● Dynamic programming refers to a class of algorithms that apply a
table-driven approach to solve problems by combining solutions to
subproblems.
Dynamic programming algorithms for Minimum Edit Distabetweennce

● Dynamic programming algorithms is implemented by creating an edit distance


matrix.
● Matrix has one row for each symbol in the source string and one column for each
matrix in the target string.
● (i,j)th cell in the matrix represents the distance between first i character of the source
and first j character of the target string.
● Each cell can be computed as a simple function of its surrounding cells
● The value in each cell is computed in terms of three possible paths.
Words and Word Classes
● Words are classified into categories called part of speech. They are also
called word classes or lexical categories.
● Lexical categories are defined by their syntactic and morphological
behaviours.
● Most common lexical categories: Nouns, verbs
● Lexical categories vary from language to language.
● Word classes are categorized as open and closed word classes

Open word classes Closed word classes


Nouns, verbs etc Adjectives, adverbs, interjections etc
Parts of speech tagging
● It is the process of assigning a part of speech to each word in a sentence.

(Input )sequence of words of NL and specified tags sets--> tagging algorithm

tagging algorithm→ parts of speech for each word(output)

● In tagging we try to determine the correct lexical category of a each word in a


sentence in every case. No tagger is efficient enough to identify the correct lexical
category of each word in a sentence in every case.
● Tag-set: collection of tags used by a particular tagger
● Tag sets define in how they define categories and how finely they divide words into
categories. In a certain tagset both eat/eats are assigned the tagset as verb.
● In other tagsets it might assign distinct values
Most tagsets capture morpho-syntactic information such as singular/ plural, number , gender,
tense etc
● Penn Treebank tag set contains 45 tags
● C7 tagset uses 164 tags
● TOSCA-ICE uses 270 tags
● TESS uses 200 tags
● Larger the tagset , task of tagging becomes complicated and requires manual correction.
Bigger tagset can be used for morphologically rich language.
● POS tagging is an early stage of text processing in many NLP applications including
speech synthesis machine translation, information retrieval and information extraction.
● Tagging is not complex as parsing. (complete parse tree is not built)

You might also like