0% found this document useful (0 votes)

392 views49 pages

Word Level Analysis

Uploaded by

21e10.ashwitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

392 views49 pages

Word Level Analysis

Uploaded by

21e10.ashwitha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 49

Word Level Analysis

NLP involves different levels and complexities of processing. So, one way to
analyse natural language text is by breaking down into constituent units.

Regular Expressions:
● Regular expressions are the beautiful way to describing words in many
text applications.
● Regular expressions or regexes for short, are a pattern-matching
standard for string parsing and replacement.
● They are a powerful way to find and replace strings that make a defined
format
● we can use regular expressions to parse dates, urls,email addresses etc.

● They are the useful tools for design of language compilers and have been used in
NLP for tokenization, describing lexicons, morphological analysis etc.

● Regular expression is an algebraic formulae whose value is a pattern consisting

of a set of strings, called the language of the expression

● Regular expressions (regex or regexp) are extremely useful in extracting information

from any text by searching for one or more matches of a specific search pattern (i.e. a
specific sequence of ASCII or unicode characters).

● Fields of application range from validation to parsing/replacing strings,

passing through translating data to other formats and web scraping.
/a/ denotes the set containing the string ‘a’, it matches with all occurrences of ‘a’
/supernova/ denotes set containing string ‘supernova’ and nothing else.
Character classes
Characters are grouped by putting them between brackets.
● [abc] will match with any of the a,b,c.
● [5-9] used to specify a range, so any numbers between 5 to 9 is matched.
● [^x] matches any character except x.(only if ^ appears as first symbol else it is
just referred to caret symbol.
● Regular expressions are case sensitive.
[s]ana matches only with sana and not Sana
→ to resolve this [sS]ana can be used.

Consider: supernovas?
→ accepts supernova, supernovas use of question mark makes the previous character
optional

Consider b*: specifies zero or more occurrence of b(* specifies zero or more )
occurrence,hence a space is also considered or any more occurrences of b is also
considered.
Example of matching strings: b, bb,bbb
Complex regular expressions can be built up from simpler ones by means of regular
expression operators.
● caret(^)--> used to match at the beginning of the string
● dollar($)--> used to match the end of the string
Example to search a line containing ‘The nature.’ → ^The nature\.$

( . ) wild card character can be used for counting characters

/...berry/ matches six characters that end with berry. Ex twoberry

In order to apply disjunction(|) operator a specific pattern, we need to enclose in parenthesis

Example: to match black berry|berries if parenthesis is not used it matches with blackberry or
just berries
black(berry|berries) is the right one
Special characters used in regular expressions
Finite state Automata
A finite automaton has the following properties:
1. A finite set of states ,one of which is designated as initial state or start
state and one or more of which designated as Final state.
2. A finite alphabet set, 𝚺 containing input symbols.
3. A finite set of transitions for each state and each symbol of the input alphabet,the
state to which it next goes.

Finite Automaton can be deterministic or Non-deterministic.

In case of non deterministic automaton for the same input multiple transitions are
possible.
●DFA and NFA are both mathematical
models used to recognize patterns in
strings of symbols, with DFA being
deterministic and following a single path
for each input, while NFA is non-
deterministic and can explore multiple
paths simultaneously.
Finite state automata have been used in a wide variety of areas
including linguistics, electrical engineering, computer science,
mathematics and logic.

These are important to all computational linguistics and have been

used as a mathematical device to implement regular expressions.
(𝚺,Q,ẟ,S,F)
Deterministic finite state automaton(DFA) is defined as 5 tuple

𝚺→ set of input symbols

Q→ set of state
ẟ→ transition ,
S→ start symbol
F→ final state

Any regular expression can be represented by a finite automaton and

the language of any finite automaton can be described by a regular
expression.
ac is the string not accepted by the automaton.
Morphological parsing
Morphology is a sub-discipline of linguistics.It is the study of word structure and
formation of words from smaller units(morphemes).
Parsing: taking an input and producing some kind of structure out of it.

Understanding morphology is important to understand the syntactic and semantic

properties of words.
● Morphemes are the smallest meaning bearing units in language.
Example: bread contains single morpheme
Eggs contains two morpheme
Morpheme a)egg and morpheme -s
● Morphological parser should be able to tell us that the word eggs is the
plural form of egg.
Two broad classes of morphemes:

● Stem (main morpheme)

● Affixes (modify the meaning of the stem)
Affixes can be prefix,suffix, infix and circumfix.
❖ prefix→ morphemes which appear before the stem (ex: unable)
❖ postfix→ morphemes which appear after the stem(ex: respectful)
❖ Infix→ morphemes which appear inside the stem(ex:geese, choose)
❖ Circumfix→ morphemes that appear on both before and after the stem(ex:
inhumanity)
Three main ways of word formation
● Inflection
● Derivation
● Compounding
Inflection: root word combined with grammatical morpheme to yield a word of the same
class (example : pass -passed-passing )

Derivation: combines a word stem with grammatical morpheme to yield a different class
of words example: computation(noun) from compute(verb)
Formation of noun from verb is called nominalization.
Compounding: process of merging two or more words to form a new word. Bed+room
In linguistic morphology, inflection (or inflexion) is a process of word formation, in
which a word is modified to express different grammatical categories such as tense, case,
voice, aspect, person, number, gender, mood, animacy, and definiteness.
Applications of Morphological parsing
● Spelling correction
● Machine Translation
● In information retrieval morphological analysis helps in identifying presence of a
query word in a document.
Morphological parsing takes as input the inflected surface form of the word and
produces the output consisting of canonical form(or lemma of the word) and set of tags
showing its syntactic category and morphological characteristics.
Ex parts of speech and or inflected properties (Gender, number,person,tense etc).
Morphological generation is the inverse of morphological parsing.
Both Analysis and Generation rely on 2 sources of information
1) Dictionary of valid lemmas of the language
2) Set of inflection paradigms.
Morphological parsing uses following information sources:
● Lexicon: lexicon lists stem and affixes together with basic information about them.
Eg:play:stem Suffix:ing- Playing
● Morphotactics: It deals with the ordering of morphemes. It describes the way morphemes
are arranged or touched each other.
Eg:happyunhappiness
● Orthographic rules: Spelling rules that specify the changes that occur when two given
morphemes combine.
Example: changes from easy to easier and not easyer
Morphological Analysis can be avoided if an exhaustive lexicon is available that lists
features for all the word forms for all the roots.
Limitations of this approach:
● Heavy demand of memory.
● Exhaustive Lexicon fails to show the relationship between different roots having
similar word-forms.
● For morphologically complex languages , the number of possible word-forms may be
theoretically infinite.
These limitations makes morphological parsing necessary.
Stemmers are the simplest morphological systems but stemmers are not perfect.
Stemming algorithm works with two steps
1. Suffix removal: removes predefined suffixes
2. Recoding: adds precoded ending to the output of the first step.
Two widely used stemming algorithms are Lovins stemmer and porter stemmer

Two level morphological model proposed by koskenniemi can be used for highly inflected
languages.
Morphological parsing can be viewed as a mapping from the surface level into morpheme and
feature sequences on lexical level.
Playing → surface form
Play + V+ PP→ lexical form
(stem) (morphological info) that tell us it is present participle form of verb.
Surface level→ actual word
Lexical level→ concatenation of its constituent morphemes

● This model is implemented with a finite-state automata called finite-

state transducer(FST). A transducer maps a set of symbols to
another.
● FST is thought of as a two state automaton, which recognizes or
generates a pair of strings.
● FST passes over the input string by passing over the input string by
consuming the input symbols on to the tape it traverses and consists
This model is usually implemented with a kind of finite state automata

A finite state transducer is a 6-tuple (C,𝚺2,Q,ẟ,S,F)

called finite-state transducer(FST)

𝚺1→ input alphabet

𝚺2→ output alphabet
ẟ→ function mapping QX(𝚺1U {ε}) to a subset of the power set of Q
C->The set of states the FST can be in
FST is similar to NFA except in that transitions are made on strings
rather than on symbols and in addition, they have outputs.

Fig shows a simple transducer

that accepts two input strings.
FSA encode regular languages where
as FST encode regular relationship
To get surface form of a word to its morphological analysis, it is done in
2 steps
1. Split the words into its possible components (this step considers
spelling rules).
Example: bird+s→ where + indicates the morpheme boundaries
Two possible ways of splitting up boxes,
● Boxe+s(assumes box as stem and e is included due to
spelling rule)
● Box+s (assumes box as the stem and s as suffix)
Output of this step is concatenation of morphemes and affixes.
2. Lexicon is used to look up categories of the stems and meaning of
the affixes.
So, box+s is mapped to box+N+PL

In case of boxes , through lexicon we will be able to find it out as not a

legal word.
For spelling variations like spourses and parses orthographic rules are
used .

Example for orthographic rules:

Add e after -s,-z,-x,-ch,-sh before the s(box-->boxes, dish→ dishes)

Each of these steps can be implemented using transducers:

Hence we require 2 transducer
1. That maps surface form to intermediate form
Transducers can be both deterministic or non deterministic
Every NFA will have a equivalent DFA, but not all non deterministic
transducers will have an equivalent deterministic transducer
FST based morphological parse for singular and plural nouns in english

● Plural form of regular words end with s or es.(certain word like miss
ends with s but not a plural)
Example: buses,prizes,foxes
● Consider boxes : stem word of this is box and it ends with es
● In such a case ‘e’ should be deleted when introducing morpheme
boundary.
2 step: develop transducer that does the mapping from intermediate level to the lexical
level.

● Regular noun form→ bird--> Noun and reads nothing and indicates as singular
● Irregular singular noun → goose→ Noun and reads nothing and indicates as singular
● Irregular plural noun→ geese→ Noun and reads nothing and indicates as PL
● Mapping from state1 to state4 is carried out with the help of a transducer encoding a
lexicon which maps individual regular and irregular noun stems to their correct noun
stem, replacing lableslike regular form(geese to goose)
Single two level transducer

Same transducer can be used for both analysis and generation.

Spelling Error Detection and Correction
Source of spelling error:
● Single character omission (charcter instead of character)
● Insertion (errorn instead of error)
● Substitution (errpr instead of error0
● reversal(transposition) (aer instead of are)

Chances of generating
● Single character omission
● Insertion
● Substitution
Are more in case of OCR and similar automatic reading devices introduce all above
errors but not reversal
OCR errors are grouped into 5 categories:
● Substitution
● Multiple substitution
● Space deletion
● insertion
● failures
Due to visual similarity (c-->e,1-->L etc)
These errors can be corrected using context or linguistic structures.

Spelling error are mainly phonetic: misspelled words are pronounced in the same
way as correct word: but they are hard to set right.
2 categories of Spelling errors:
● Non word errors(error due to non appearing word in lexicon or not a orthographic
word from) currently resolved.
● Real word errors: Solution: two main techniques: n-gram and dictionary lookup.

Spelling correction consists of detecting and correcting errors.

● Error detection: process of finding misspelled words
● Error correction: process of suggesting correct word to misspelled word

These subproblems are addressed in 2 ways

1. Isolated error detection and correction(each word is checked, independent of context)
2. Context dependent error detection and correction
Simple way of correcting spelling error is to make use of lexicon
But the problems associated with it is
● Existence of lexicon containing all correct words.such a lexicon would take long time
to compile and occupy lot of space
● In case of highly productive language . It is impossible to list all the correct words of
such languages.
● If spelling error occurs due to a word that is available in lexicon(it is called a real
word error: when theses is written in place of these.
● Larger the lexicon more chances of error being undetected

Context dependent error detection and correction methods uses the context and hence
require to do grammatical analysis and hence language dependent.
● Even in case of context dependent methods, list of candidate words to be identified
first using isolated word method before making selection dependent on context
Various spelling correction algorithms:
Minimum edit distance: minimum number of operations required to transform one string
to another.(most widely used)
Source: "kitten“
Target: "sitting“
To transform "kitten" into "sitting", we can follow these steps:
Similarity Key techniques: change the given string into a key such that similar strings will
change into the same key.
3. N-gram based techniques: can be used for both non word and real word error
detection .
Non-word error: In english certain bi-grams and trigrams letters rarely occurs or will
never occur example: qst for tri-gram and qd for bi-gram.(Lexicon can be used).
Real word error: corpus is used with the likelihood of the letter occurrence to find
possible correct words.
Neural nets: These have ability to do associative recall based on
incomplete and noisy data. They can be trained to adapt to specific
spelling error patterns.
Drawback: they are computationally expensive.
Rule-based techniques: set of rules derived from knowledge
of a common spelling error pattern is used to transform
misspelled words into a valid words.
Example: many error generated due to typing mistake as ue
instead of eu can be written as a rule.
Minimum Edit Distance:
Number of insertions, deletions and substitutions required to change one string into
another.
It is the minimum edit distance
Example: consider “tutor” and “tumour”
T u t o r
T u m o u r so it is 2
Edit distance can be viewed as string alignment problem.
There can be more than one possible alignments
T u t o r
T u _ m o u r in this case So
we can consider the minimum distance as 2
● Edit distance between 2 strings can be represented as a binary
function, ed, which maps two strings to their edit distance.
● Ed is symmetric for any string ed(s,t)= ed(t,s)
● Dynamic programming approach can be used for edit distance
between two sequences.
● Dynamic programming refers to a class of algorithms that apply a
table-driven approach to solve problems by combining solutions to
subproblems.
Dynamic programming algorithms for Minimum Edit Distabetweennce

● Dynamic programming algorithms is implemented by creating an edit distance

matrix.
● Matrix has one row for each symbol in the source string and one column for each
matrix in the target string.
● (i,j)th cell in the matrix represents the distance between first i character of the source
and first j character of the target string.
● Each cell can be computed as a simple function of its surrounding cells
● The value in each cell is computed in terms of three possible paths.
Words and Word Classes
● Words are classified into categories called part of speech. They are also
called word classes or lexical categories.
● Lexical categories are defined by their syntactic and morphological
behaviours.
● Most common lexical categories: Nouns, verbs
● Lexical categories vary from language to language.
● Word classes are categorized as open and closed word classes

Open word classes Closed word classes

Nouns, verbs etc Adjectives, adverbs, interjections etc
Parts of speech tagging
● It is the process of assigning a part of speech to each word in a sentence.

(Input )sequence of words of NL and specified tags sets--> tagging algorithm

tagging algorithm→ parts of speech for each word(output)

● In tagging we try to determine the correct lexical category of a each word in a

sentence in every case. No tagger is efficient enough to identify the correct lexical
category of each word in a sentence in every case.
● Tag-set: collection of tags used by a particular tagger
● Tag sets define in how they define categories and how finely they divide words into
categories. In a certain tagset both eat/eats are assigned the tagset as verb.
● In other tagsets it might assign distinct values
Most tagsets capture morpho-syntactic information such as singular/ plural, number , gender,
tense etc
● Penn Treebank tag set contains 45 tags
● C7 tagset uses 164 tags
● TOSCA-ICE uses 270 tags
● TESS uses 200 tags
● Larger the tagset , task of tagging becomes complicated and requires manual correction.
Bigger tagset can be used for morphologically rich language.
● POS tagging is an early stage of text processing in many NLP applications including
speech synthesis machine translation, information retrieval and information extraction.
● Tagging is not complex as parsing. (complete parse tree is not built)

Unit Ii - NLP
No ratings yet
Unit Ii - NLP
35 pages
NLP - AI2214601 Unit 1to Unit 5 Notes
No ratings yet
NLP - AI2214601 Unit 1to Unit 5 Notes
98 pages
NLP Lab Manual 3-2 Aiml R22 Update
100% (1)
NLP Lab Manual 3-2 Aiml R22 Update
20 pages
NLP Unit-Ii
No ratings yet
NLP Unit-Ii
45 pages
NLP Sem Questions and Answers
No ratings yet
NLP Sem Questions and Answers
72 pages
NLP Notes For Students
100% (2)
NLP Notes For Students
18 pages
NLP Lect Unit I
100% (1)
NLP Lect Unit I
140 pages
21ML1601 NLP QB
No ratings yet
21ML1601 NLP QB
34 pages
NLP Notes
No ratings yet
NLP Notes
18 pages
NLP Unit-Ii
No ratings yet
NLP Unit-Ii
71 pages
Unit 1 2 3 4 5 NLP Notes Merged
100% (1)
Unit 1 2 3 4 5 NLP Notes Merged
105 pages
R16 Question Papers Flat
No ratings yet
R16 Question Papers Flat
12 pages
KRR Unit I Notes
100% (1)
KRR Unit I Notes
32 pages
System Paradigms in NLP
No ratings yet
System Paradigms in NLP
8 pages
KRR Unit-5
100% (1)
KRR Unit-5
51 pages
NLP Question Paper Solution
No ratings yet
NLP Question Paper Solution
27 pages
Unit 2
No ratings yet
Unit 2
15 pages
NLP Unit-3-Semantics-And-Pragmatics
No ratings yet
NLP Unit-3-Semantics-And-Pragmatics
20 pages
Wordlevel Analysis - Chap2
No ratings yet
Wordlevel Analysis - Chap2
97 pages
NLP QB
100% (2)
NLP QB
14 pages
Understanding Inputs and Outputs of Mapreduce
No ratings yet
Understanding Inputs and Outputs of Mapreduce
13 pages
Morphology A Study of The Relation Between Meaning and Form
100% (1)
Morphology A Study of The Relation Between Meaning and Form
247 pages
Information Visualization Technologies
No ratings yet
Information Visualization Technologies
15 pages
NLP UNIT 1 (Ques Ans Bank)
No ratings yet
NLP UNIT 1 (Ques Ans Bank)
20 pages
Notes of NLP - Unit-2
No ratings yet
Notes of NLP - Unit-2
23 pages
Unit 5
No ratings yet
Unit 5
20 pages
Group 1 - Definition and The Scope of Morphology
No ratings yet
Group 1 - Definition and The Scope of Morphology
15 pages
NLP - (Natural Language Processing Lab Manual)
No ratings yet
NLP - (Natural Language Processing Lab Manual)
12 pages
Natural Language Processing
No ratings yet
Natural Language Processing
47 pages
5.knowledge Acquisition in Artificial Intelligence
No ratings yet
5.knowledge Acquisition in Artificial Intelligence
19 pages
NLP Question Bank
No ratings yet
NLP Question Bank
1 page
UNIT 4 Information Retrieval Using NLP
No ratings yet
UNIT 4 Information Retrieval Using NLP
13 pages
Unit-III PDF
No ratings yet
Unit-III PDF
72 pages
NLP UNIT 2 (Ques Ans Bank)
No ratings yet
NLP UNIT 2 (Ques Ans Bank)
26 pages
NLP Unit-Ii
No ratings yet
NLP Unit-Ii
118 pages
AM601PC KRR Unit 1
No ratings yet
AM601PC KRR Unit 1
16 pages
NLP Lab Manual Updated
No ratings yet
NLP Lab Manual Updated
34 pages
Unit 4 NLP Notes
No ratings yet
Unit 4 NLP Notes
35 pages
NLP Unit-Iv
No ratings yet
NLP Unit-Iv
124 pages
NATURAL LANGUAGE PROCESSING (18CS2T50) - Mid Term Exam - 2021-2022
No ratings yet
NATURAL LANGUAGE PROCESSING (18CS2T50) - Mid Term Exam - 2021-2022
2 pages
NLP Unit 1 Answers
No ratings yet
NLP Unit 1 Answers
7 pages
1-NLP - Lab Manual
No ratings yet
1-NLP - Lab Manual
15 pages
Natural Language Processing
No ratings yet
Natural Language Processing
17 pages
NLP Important and Super Important Questions-18CS743
No ratings yet
NLP Important and Super Important Questions-18CS743
2 pages
STM - Lab - Manul III Cse II Sem
No ratings yet
STM - Lab - Manul III Cse II Sem
36 pages
SEM-2-NLP Questions
No ratings yet
SEM-2-NLP Questions
3 pages
Unit 4 NLP
No ratings yet
Unit 4 NLP
51 pages
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
No ratings yet
Unit V Intelligence and Applications: Morphological Analysis/Lexical Analysis
30 pages
344 GiaoTrinhCuPhapHinhThaiHocTiengAnh
No ratings yet
344 GiaoTrinhCuPhapHinhThaiHocTiengAnh
126 pages
NLP ORAL - Sample Question Bank: Modul e No. Sr. No - Description
No ratings yet
NLP ORAL - Sample Question Bank: Modul e No. Sr. No - Description
9 pages
Natural Language Processing
No ratings yet
Natural Language Processing
37 pages
CSE4022 Natural-Language-Processing ETH 1 AC41
No ratings yet
CSE4022 Natural-Language-Processing ETH 1 AC41
6 pages
Dbms Lab Manual II Cse II Sem
No ratings yet
Dbms Lab Manual II Cse II Sem
58 pages
NLP Lab Tasks
No ratings yet
NLP Lab Tasks
16 pages
NLP Lab Manual
No ratings yet
NLP Lab Manual
15 pages
Natural Language Processing Parsing Techniques:: Unit IV
100% (1)
Natural Language Processing Parsing Techniques:: Unit IV
24 pages
Natural Language Processing
100% (1)
Natural Language Processing
21 pages
Irt 2 Marks With Answer
No ratings yet
Irt 2 Marks With Answer
15 pages
Unit 5 - Notes
No ratings yet
Unit 5 - Notes
11 pages
PAT Trees and PAT Arrays
No ratings yet
PAT Trees and PAT Arrays
12 pages
Question Bank
No ratings yet
Question Bank
13 pages
TUẦN 1: Morpheme: definition, classification, characteristics, suffixal homophones Test 1
100% (1)
TUẦN 1: Morpheme: definition, classification, characteristics, suffixal homophones Test 1
2 pages
Information Retrieval Systems U6
No ratings yet
Information Retrieval Systems U6
13 pages
Lexicology Exam
100% (1)
Lexicology Exam
3 pages
Extra Exercise 1 Chapter 1 - Morphemes
67% (3)
Extra Exercise 1 Chapter 1 - Morphemes
2 pages
Từ Vựng Học Cơ Bản 1-3
No ratings yet
Từ Vựng Học Cơ Bản 1-3
60 pages
Random House Webster S Compact American Sign Language Dictionary 3 Compact Edition Elaine Costello Ph.D. Download
100% (1)
Random House Webster S Compact American Sign Language Dictionary 3 Compact Edition Elaine Costello Ph.D. Download
50 pages
Introduction To Linguistics
No ratings yet
Introduction To Linguistics
30 pages
LEXICOLOGY I - Lectures: Lecture 1 - Introduction
No ratings yet
LEXICOLOGY I - Lectures: Lecture 1 - Introduction
80 pages
NLP Unit1
No ratings yet
NLP Unit1
51 pages
Svfile 14032022035255
No ratings yet
Svfile 14032022035255
11 pages
CC 113 - Lecture Notes
No ratings yet
CC 113 - Lecture Notes
30 pages
NLP Session 1 Updated - Dr. Chetana Gavankar
No ratings yet
NLP Session 1 Updated - Dr. Chetana Gavankar
41 pages
Unit 2 - The Structure of Language (Linguistics Booklet - Miss Noelia - Miss Luciana) 2024
No ratings yet
Unit 2 - The Structure of Language (Linguistics Booklet - Miss Noelia - Miss Luciana) 2024
74 pages
Principles and THeories of Language Acquisition and LEarning - Module 2.1
No ratings yet
Principles and THeories of Language Acquisition and LEarning - Module 2.1
4 pages
Toba Batak Language Morphological System
No ratings yet
Toba Batak Language Morphological System
20 pages
Derivational Morphology
100% (1)
Derivational Morphology
13 pages
Course INTRODUCTION TO LANGUAGE AND LINGUISTICS
No ratings yet
Course INTRODUCTION TO LANGUAGE AND LINGUISTICS
3 pages
Chapter 4
No ratings yet
Chapter 4
6 pages
Structural Linguistics
No ratings yet
Structural Linguistics
17 pages
Participial Adjectives Ending in - Ing
No ratings yet
Participial Adjectives Ending in - Ing
5 pages
Kearns Whaley 2019
No ratings yet
Kearns Whaley 2019
14 pages
4.5 The Structure of Language
No ratings yet
4.5 The Structure of Language
5 pages
NLP 2
No ratings yet
NLP 2
29 pages
Japanese University Students' Order of Morpheme Acquisition
No ratings yet
Japanese University Students' Order of Morpheme Acquisition
13 pages
Assignment Two
No ratings yet
Assignment Two
5 pages
An Analysis of Compound Nouns That Found in "Désirée'S Baby" by Kate Chopin
No ratings yet
An Analysis of Compound Nouns That Found in "Désirée'S Baby" by Kate Chopin
18 pages
Dẫn luận nn
No ratings yet
Dẫn luận nn
9 pages
Tesol 120 Hours Course Part A
No ratings yet
Tesol 120 Hours Course Part A
33 pages

Word Level Analysis

Uploaded by

Word Level Analysis

Uploaded by

Word Level Analysis

● Regular expression is an algebraic formulae whose value is a pattern consisting

● Regular expressions (regex or regexp) are extremely useful in extracting information

● Fields of application range from validation to parsing/replacing strings,

( . ) wild card character can be used for counting characters

In order to apply disjunction(|) operator a specific pattern, we need to enclose in parenthesis

Finite Automaton can be deterministic or Non-deterministic.

These are important to all computational linguistics and have been

𝚺→ set of input symbols

Any regular expression can be represented by a finite automaton and

Understanding morphology is important to understand the syntactic and semantic

● Stem (main morpheme)

● This model is implemented with a finite-state automata called finite-

A finite state transducer is a 6-tuple (C,𝚺2,Q,ẟ,S,F)

𝚺1→ input alphabet

Fig shows a simple transducer

In case of boxes , through lexicon we will be able to find it out as not a

Example for orthographic rules:

Each of these steps can be implemented using transducers:

Same transducer can be used for both analysis and generation.

Spelling correction consists of detecting and correcting errors.

These subproblems are addressed in 2 ways

● Dynamic programming algorithms is implemented by creating an edit distance

Open word classes Closed word classes

(Input )sequence of words of NL and specified tags sets--> tagging algorithm

tagging algorithm→ parts of speech for each word(output)

● In tagging we try to determine the correct lexical category of a each word in a

You might also like