0% found this document useful (0 votes)
7 views

unit1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

unit1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Finite Automata in Natural Language Processing

Finite State Automata

The term automata, derived from the Greek word “αὐτόματα” meaning “self-
acting”, is the plural of automaton which may be defined as an abstract self-
propelled computing device that follows a predetermined sequence of operations
automatically.

An automaton having a finite number of states is called a Finite Automaton (FA)


or Finite State automata (FSA).

Mathematically, an automaton can be represented by 5-tuple (Q, Σ, δ,


q0, F), where −
► Q is a finite set of states.
► Σ is a finite set of symbols, called the alphabet of the automaton
► δ is the transition function
► q0 is the initial state from where any input is processed (q0 ∈ Q)
► F is a set of final state/states of Q (F ⊆ Q)

Types of Finite State Automation (FSA)

Finite state automation is of two types.

Deterministic Finite Automation (DFA)

It may be defined as the type of finite automation wherein, for every input symbol
we can determine the state to which the machine will move. It has a finite number
of states which is why the machine is called Deterministic Finite Automaton
(DFA).

Mathematically, a DFA can be represented by 5-tuple (Q, Σ, δ, q0,


F), where −
► Q is a finite set of states.
► Σ is a finite set of symbols, called the alphabet of the automaton.
► δ is the transition function where δ: Q × Σ → Q .
► q0 is the initial state from where any input is processed (q0 ∈ Q).
► F is a set of final state/states of Q (F ⊆ Q).

Whereas graphically, a DFA can be represented by diagraphs called


state diagrams where −
► The states are represented by vertices.
► The transitions are shown by labeled arcs.
► The initial state is represented by an empty incoming arc.
► The final state is represented by double circle.

Example of DFA —

Non-deterministic Finite Automation (NDFA)

It may be defined as the type of finite automation where for every input symbol we
cannot determine the state to which the machine will move i.e. the machine can
move to any combination of the states. It has a finite number of states which is
why the machine is called Non-deterministic Finite Automation (NDFA).

Mathematically, NDFA can be represented by 5-tuple (Q, Σ, δ, q0, F),


where −
► Q is a finite set of states.
► Σ is a finite set of symbols, called the alphabet of the automaton.
► δ is the transition function where δ: Q × Σ → 2 Q.
► q0 is the initial state from where any input is processed (q0 ∈ Q).
► F is a set of final state/states of Q (F ⊆ Q).

Whereas graphically (same as DFA), a NDFA can be represented by


diagraphs called state diagrams where −
► The states are represented by vertices.
► The transitions are shown by labeled arcs.
► The initial state is represented by an empty incoming arc.
► The final state is represented by double circle.

Example of NDFA —
Finite Automata in NLP

Finite State Transducer

A finite-state transducer (FST) is a finite-state machine with two memory tapes,


following the terminology for Turing machines: an input tape and an output tape.
This contrasts with an ordinary finite-state automaton, which has a single tape. An
FST is a type of finite-state automaton (FSA) that maps between two sets of
symbols. An FST is more general than an FSA. An FSA defines a formal language
by defining a set of accepted strings, while an FST defines relations between sets
of strings.

Language Recognizer

There are many tasks that need language recognizing mechanism. For example,
spelling checker, morphological analysis, language identification etc.. Finite state
machine are quite useful as a language recognizer. For a given word, a NFA can
be designed easily that recognize the word. For example, NFA for the words ‘boy’
and ‘bat’ is shown in the Figure below. Similarly for every word a NFA can be
designed and the different NFA’s can be combined to form spelling checker or
dictionary compilation for a language.

NFA for the words ‘boy’ and ‘bat’

For each category of words, we can form a separate NFA and then combine them
using transitions. For example, nouns and their plural can be recognized through
one NFA and verbs and their different forms can be recognized through another
NFA and finally, the two NFA can be combined. The figure below shows the NFA
for some words and their morphological variations.
NFA for some words and their morphological variations

A Simple Machine that can laugh

A finite state generator is a simple computing machine that outputs a sequence of


symbols.

It starts in some start state and then tries to reach a final state by making transitions
from one state to another. Every time it makes such a transition
it emits (or writes or generates) a symbol.
So, what does the generator in the pictures say? It laughs. It generates sequences of
symbols of the form ha! or haha! or hahaha! or hahahaha! and so on. Why does it
behave like that? Well, it first has to make a transition emitting h. The state that it
reaches through this transition is not a final state. So, it has to keep on going
emitting an a. Here, it has two possibilities: it can either follow the ! arrow,
emitting ! and then stopping in the final state (but remember, it can’t look ahead to
see that it would reach a final state with the ! transition) or it can follow the h
arrow emitting an h and going back to the state where it just came from.

Finite state generators can be thought of as directed graphs. And in fact finite state
generators are usually drawn as directed graphs. Here is our laughing machine as
we will from now on draw finite state generators:

Morphological parsing and generation

Morphological parsing is the process of producing a lexical structure of a word,


that is, breaking a word into stem and affixes and labeling the morphemes with
category label. For example the word ‘books’ can be parsed as book+s and further
as book + N+PL, the word ‘went’ can be parsed as go+V+PAST etc. generation is
the reverse process of parsing, that is combining the lexical form of a word to
produce the word. For example, box+N+Pl generates the word ‘boxes’. Finite state
transducers are quite useful in morphological parsing. Let us consider the lexicon
form of regular inflectional ‘girl +N +PL’. Figure below shows the transducer
which convert the lexicon form ‘girl+N+PL’ into the word ‘girls’. Let us assume
that x represents the word ‘girl’ for simplification purpose as for every regular
singular noun the input and output will remain same in a transducer, that’s why a
variable x can be used for a noun. The word ‘girl’ can be replaced by any other
regular noun like ‘boy’.

Generation of word from its lexical form through a transducer

Here simple regular nouns have been considered on exemplarily basis,


orthographic rules can be applied for irregular inflections and derivations and
NFAs can be designed for every word and its variations in a certain language.

In other sense, we can say that morphology is the study of −


► The formation of words.
► The origin of the words.
► Grammatical forms of the words.
► Use of prefixes and suffixes in the formation of words.
► How parts-of-speech (PoS) of a language are formed.

Finite state automata for recognition

The approach to spelling rules that is described here involves the use of finite state
transducers (FSTs). Rather than jumping straight into this, we will briefly consider
the simpler finite state automata and how they can be used in a simple recogniser.
Suppose we want to recognise dates (just day and month pairs) written in the
format day/month. The day and the month may be expressed as one or two digits
(e.g. 11/2, 1/12 etc). This format corresponds to the following simple FSA, where
each character corresponds to one transition:
This is a non-deterministic FSA: for instance, an input starting with the digit 3 will
move the FSA to both state 2 and state 3. This corresponds to a local ambiguity:
i.e., one that will be resolved by subsequent context. By convention, there must be
no ‘left over’ characters when the system is in the final state.

You might also like