0% found this document useful (0 votes)
13 views28 pages

2.chapter3 - Regular Expressions and Automata

Uploaded by

Minh Mai Ngọc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views28 pages

2.chapter3 - Regular Expressions and Automata

Uploaded by

Minh Mai Ngọc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

Natural Language Processing

AC3110E

1
Chapter 3: Regular expressions
and Automata
Lecturer: PhD. DO Thi Ngoc Diep
SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
Approaches of NLP

• Symbolic NLP (1950s – early 1990s)


• collection of rules , complex sets of hand-written rules.
• computer emulates NLP tasks by applying those rules to the data it confronts.
• hand-coding of a set of rules for manipulating symbols, coupled with a dictionary
lookup
• Statistical NLP (1990s–2010s)
• introduction of machine learning algorithms for language processing
• increase in computational power + enormous amount of data available
• Neural NLP (present)
• In the 2010s, deep neural network-style (featuring many hidden layers) machine
learning methods
• achieve state-of-the-art results in many natural language tasks

3
Approaches of NLP

• Symbolic NLP (1950s – early 1990s)


• collection of rules , complex sets of hand-written rules.
• computer emulates NLP tasks by applying those rules to the data it confronts.
• hand-coding of a set of rules for manipulating symbols, coupled with a dictionary
lookup
• Statistical NLP (1990s–2010s)
• introduction of machine learning algorithms for language processing
• increase in computational power + enormous amount of data available
• Neural NLP (present)
• In the 2010s, deep neural network-style (featuring many hidden layers) machine
learning methods
• achieve state-of-the-art results in many natural language tasks

4
3.1 Finite-State Automata (FSA)

• Example: List of words baa!


How to model these words?
baaa!
How to model a sequence of characters ?
baaaa!
How to model a sequence of states ?
baaaaa!
...

• Automaton
• A directed graph
• Vertices/nodes = States, start state, final state/accepting state State-transition table
• Arcs (links) = Transitions
• Used for recognizing (accept/refuse) strings/words

5
3.1 Finite-State Automata (FSA)

• A finite automaton is defined by five parameters:


• Q = q0q1q2...qN−1: a finite set of N states
• Σ: a finite input alphabet of symbols
• q0: the start state
• F: the set of final states, F Q
• δ(q,i): the transition function or
transition matrix between states. Q = {q0,q1,q2,q3,q4}
Σ = {a,b,!}
F = {q4}
δ(q,i):

6
3.1 Finite-State Automata (FSA)

• Deterministic FSA (DFSA)


• behavior is fully determined by the states and the input alphabets
• Non-deterministic FSAs (NFSA)
• may have more than one possible next state given an input (decision points)
• for any NFSA, there is an exactly equivalent DFSA

-transitions

7
3.1 Finite-State Automata (FSA)

Given a FSA, check if a string is accepted or not ? Solutions :


+ Look-ahead
+ Parallelism
+ Backtracking

• Look-ahead: look ahead in the input to decide which path to take.


• Parallelism: Whenever come to a choice point, look at every alternative path in
parallel
• Backup: at a choice point, put a marker to mark where it was in the input, and
what state the automaton was in. Then if it turns out that it took the wrong
choice, back up and try another path.

FOR SMALL PROBLEMS !

8
3.1 Finite-State Automata (FSA)

• Recognition strings <=> Search :


• explore all the possible paths
• on each iteration selects a partial path to explore and keeps track of any remaining
unexplored partial paths.
• State-space search algorithms
• creates a space of possible solutions
• explore this space to return an answer “accept/reject“
• to mark unexplored partial paths: implemented by a stack, or queue

9
Finite State Transducers

• A finite-state transducer (FST)


• like an automaton but it performs actions as it consumes an input string
• converts an input string into an output string
• The state transitions are labeled with
• The symbols that cause the state transition
• An action to perform as the transition is made
• A FST has both an input string and an output string.

10
Finite State Transducers

• An FST is defined with 7 parameters:


• Q = q0q1q2...qN−1: a finite set of N states
• Σ: a finite set corresponding to the input alphabet
• ∆: a finite set corresponding to the output alphabet
• q0: the start state
• F: the set of final states, F Q
• δ(q,w): the transition function or transition matrix between states
• σ(q,w): the output function
• Given a state q ∈ Q and a string w ∈ Σ∗, σ(q,w) gives a set of output strings, each string o ∈ ∆∗
• Properties
• The inversion of a transducer T (T−1)
• If T maps from I to O, T−1 maps from O to I.
• Composition:
• T1 ◦ T2(S) = T2(T1(S))
• If T1 maps from I1 to O1 and T2 maps from O1 to O2, then T1 ◦ T2 maps from I1 to O2.

11
Finite State Transducers

• Sequential transducers:
• subtype of transducers that are deterministic on their input
• the transitions out of each state are deterministic based on the state and the input
symbol
• can have epsilon symbols in the output string, but not on the input

• Sub-sequential transducer:
• generates an additional output string at the final states
• generates p additional output string at the final states  p-subsequential transducer
• useful to handle a finite amount of ambiguity

12
3.2 Regular Languages

• Formal Languages
• Uses algebra and set theory to define languages as sequences of symbols
• Each word composed of symbols, letters, or tokens from a finite number set
called an alphabet
• The set of all words over an alphabet Σ is denoted by Σ*
• empty word is denoted by
• A formal language L over an alphabet Σ is a subset of Σ*
L = {a, b, ab, cba}.

• Regular language
• a particular kind of formal language
• words are described by finite-state automata
• used to model part of a natural language (parts of the phonology,
morphology, or syntax)

L(m) = {baa!,baaa!,baaaa!,baaaaa!,baaaaaa!,...}

13
3.2 Regular Languages

• The collection of regular languages over Σ:


• 1. The empty language is a regular language
• 2. a Σ , the singleton language {a} is a regular language
• 3. If L1 and L2 are regular languages, then so are:
• (a) L1 • L2 = {xy| x L1, y L2}, the concatenation of L1 and L2
• (b) L1 L2, the union of L1 and L2
• (c) ∗ , the Kleene closure of L1
• Properties
• if L1 and L2 are regular languages, then so is L1 ∩ L2
• if L1 and L2 are regular languages, then so is L1 − L2
• If L1 is a regular language, then so is Σ − L1
• If L1 is a regular language, then so is
Σ : set of all possible strings formed from the alphabet Σ
set of reversals of all the strings in L1

14
Examples

• Draw an FSA for the words for English numbers 1–99

• Draw an FSA for the simple dollars and cents.

Jurafsky, Daniel, and James H. Martin. Speech and Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall; 2nd edition 15
Example

• Label the automaton so that it accepts strings such as

16
Examples

• Which of the following strings are accepted by the automaton?

Accept Reject Accept Reject Accept Accept Accept

• Describe the language of this automaton (in text)?


Strings start with any number (including zero) of the pair of characters "xy".
Next comes a "xz" or an "a". If "xz" then the string is done. If "a" then the string
contines with any number (including zero) of "b" and ends with a single "c".

17
3.3 Regular expressions

• Automaton: takes up space and is not easily used as program input.


• A regular language can be described using a regular expression
• Any regular expression can be implemented as a FSA (except RE with
memory feature)
• Any FSA can be described with a regular expression

 Regular expression /(xy)*(xz)|(ab*c)/

18
3.3 Regular expressions

• A regular expression is an algebraic notation for describing text patterns.


• Used to specify simple class of strings.
• Important throughout computer science and linguistics for specifying text
search strings
• Regular expression search
• A pattern: which want to search for
• A corpus of texts to search through
• Search function will search through corpus and return texts that contain the
pattern
• the first match
• all matches
• Regular expression is a powerful tool for pattern-matching

19
Basic RE Patterns in Perl syntax

• Exact match: //

• Choice of characters: []

Daniel Jurafsky, and James H. Martin. Speech and Language Processing20


Basic RE Patterns in Perl syntax

• Range: [ - ]

• Exclusions (cannot be): [^ ]

• Match any single character (except a carriage return): . wildcard


• \. => mean “period” and not the wildcard

21
Basic RE Patterns in Perl syntax

• “zero or one” instance of the previous character: ?

• “zero or more” occurrences: * (Kleene Star)


• /a*/ vs /aa*/ vs /[ab]*/
• “one or more ”: +
• /[0-9]+/ /baaa*!/ or /baa+!/
• “any string of characters”
• /.*/
• Anchor
• Start of line: /ˆ / /ˆThe dog\.$/
• End of line: / $/
• \b matches a word boundary /\bthe\b/ matches “the” or “other” ?
• \B matches a non-boundary /\b99\b/ matches “There are 99 bottles”, “$99 each”
or “There are 299 bottles” ?

22
Disjunction, Grouping, and Precedence

• Disjunction of string: / | /
• /cat|dog/ matches either cat or dog.
• Grouping : ( )
• makes it act like a single character
• /gupp(y|ies)/ matches both guppy and guppies
• Operator precedence (from highest to lowest)

• /the*/ vs. /the|any/

/\b$[0-9]+(\.[0-9][0-9])?\b/
/\b[0-9]+ *(MHz|[Mm]egahertz|GHz|[Gg]igahertz)\b/
/\b[0-9]+(\.[0-9]+)? *(Gb|[Gg]igabytes?)\b/

23
Other Operators

24
Greedy Matching

• Greedy Matching
• Strings are matched from left to right one character at a time.
• The +, *, ? quantifier matches as many characters as it can
without preventing the rest of the regular expression from matching.
• Examples:
• How /A+A+/ matches the “AA” string ?
• How /X+X+/ matches the “XXXX” string ?

25
Substitution

• Perl registers (numbered memories): use the number operator \# to refer


back to the previous parentheses () groups
• /the (.*)er they were, the \1er they will be/
• /the (.*)er they (.*), the \1er they \2/
• non-capturing group ?:
• /(?:some|a few) (people|cats) like some \1/
• Perl substitution operator: s/regexp1/new/
• s/([0-9]+)/<\1>/

• Search “aaaaaaaaaaaaXaaa” for substrings that match /(a*)X\1/ ?


• “aaaXaaa”

26
Application of RE

• Simple natural-language understanding programs: ELIZA - Weizenbaum,


1966.
• Simple pattern-based methods

27
28

You might also like