2.chapter3 - Regular Expressions and Automata
2.chapter3 - Regular Expressions and Automata
AC3110E
1
Chapter 3: Regular expressions
and Automata
Lecturer: PhD. DO Thi Ngoc Diep
SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
Approaches of NLP
3
Approaches of NLP
4
3.1 Finite-State Automata (FSA)
• Automaton
• A directed graph
• Vertices/nodes = States, start state, final state/accepting state State-transition table
• Arcs (links) = Transitions
• Used for recognizing (accept/refuse) strings/words
5
3.1 Finite-State Automata (FSA)
6
3.1 Finite-State Automata (FSA)
-transitions
7
3.1 Finite-State Automata (FSA)
8
3.1 Finite-State Automata (FSA)
9
Finite State Transducers
10
Finite State Transducers
11
Finite State Transducers
• Sequential transducers:
• subtype of transducers that are deterministic on their input
• the transitions out of each state are deterministic based on the state and the input
symbol
• can have epsilon symbols in the output string, but not on the input
• Sub-sequential transducer:
• generates an additional output string at the final states
• generates p additional output string at the final states p-subsequential transducer
• useful to handle a finite amount of ambiguity
12
3.2 Regular Languages
• Formal Languages
• Uses algebra and set theory to define languages as sequences of symbols
• Each word composed of symbols, letters, or tokens from a finite number set
called an alphabet
• The set of all words over an alphabet Σ is denoted by Σ*
• empty word is denoted by
• A formal language L over an alphabet Σ is a subset of Σ*
L = {a, b, ab, cba}.
• Regular language
• a particular kind of formal language
• words are described by finite-state automata
• used to model part of a natural language (parts of the phonology,
morphology, or syntax)
L(m) = {baa!,baaa!,baaaa!,baaaaa!,baaaaaa!,...}
13
3.2 Regular Languages
14
Examples
Jurafsky, Daniel, and James H. Martin. Speech and Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall; 2nd edition 15
Example
16
Examples
17
3.3 Regular expressions
18
3.3 Regular expressions
19
Basic RE Patterns in Perl syntax
• Exact match: //
• Choice of characters: []
• Range: [ - ]
21
Basic RE Patterns in Perl syntax
22
Disjunction, Grouping, and Precedence
• Disjunction of string: / | /
• /cat|dog/ matches either cat or dog.
• Grouping : ( )
• makes it act like a single character
• /gupp(y|ies)/ matches both guppy and guppies
• Operator precedence (from highest to lowest)
/\b$[0-9]+(\.[0-9][0-9])?\b/
/\b[0-9]+ *(MHz|[Mm]egahertz|GHz|[Gg]igahertz)\b/
/\b[0-9]+(\.[0-9]+)? *(Gb|[Gg]igabytes?)\b/
23
Other Operators
24
Greedy Matching
• Greedy Matching
• Strings are matched from left to right one character at a time.
• The +, *, ? quantifier matches as many characters as it can
without preventing the rest of the regular expression from matching.
• Examples:
• How /A+A+/ matches the “AA” string ?
• How /X+X+/ matches the “XXXX” string ?
25
Substitution
26
Application of RE
27
28