0% found this document useful (0 votes)

13 views28 pages

2.chapter3 - Regular Expressions and Automata

Uploaded by

Minh Mai Ngọc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views28 pages

2.chapter3 - Regular Expressions and Automata

Uploaded by

Minh Mai Ngọc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 28

Natural Language Processing

AC3110E

1
Chapter 3: Regular expressions
and Automata
Lecturer: PhD. DO Thi Ngoc Diep
SCHOOL OF ELECTRICAL AND ELECTRONIC ENGINEERING
HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY
Approaches of NLP

• Symbolic NLP (1950s – early 1990s)

• collection of rules , complex sets of hand-written rules.
• computer emulates NLP tasks by applying those rules to the data it confronts.
• hand-coding of a set of rules for manipulating symbols, coupled with a dictionary
lookup
• Statistical NLP (1990s–2010s)
• introduction of machine learning algorithms for language processing
• increase in computational power + enormous amount of data available
• Neural NLP (present)
• In the 2010s, deep neural network-style (featuring many hidden layers) machine
learning methods
• achieve state-of-the-art results in many natural language tasks

3
Approaches of NLP

• Symbolic NLP (1950s – early 1990s)

4
3.1 Finite-State Automata (FSA)

• Example: List of words baa!

How to model these words?
baaa!
How to model a sequence of characters ?
baaaa!
How to model a sequence of states ?
baaaaa!
...

• Automaton
• A directed graph
• Vertices/nodes = States, start state, final state/accepting state State-transition table
• Arcs (links) = Transitions
• Used for recognizing (accept/refuse) strings/words

5
3.1 Finite-State Automata (FSA)

• A finite automaton is defined by five parameters:

• Q = q0q1q2...qN−1: a finite set of N states
• Σ: a finite input alphabet of symbols
• q0: the start state
• F: the set of final states, F Q
• δ(q,i): the transition function or
transition matrix between states. Q = {q0,q1,q2,q3,q4}
Σ = {a,b,!}
F = {q4}
δ(q,i):

6
3.1 Finite-State Automata (FSA)

• Deterministic FSA (DFSA)

• behavior is fully determined by the states and the input alphabets
• Non-deterministic FSAs (NFSA)
• may have more than one possible next state given an input (decision points)
• for any NFSA, there is an exactly equivalent DFSA

-transitions

7
3.1 Finite-State Automata (FSA)

Given a FSA, check if a string is accepted or not ? Solutions :

+ Look-ahead
+ Parallelism
+ Backtracking

• Look-ahead: look ahead in the input to decide which path to take.

• Parallelism: Whenever come to a choice point, look at every alternative path in
parallel
• Backup: at a choice point, put a marker to mark where it was in the input, and
what state the automaton was in. Then if it turns out that it took the wrong
choice, back up and try another path.

FOR SMALL PROBLEMS !

8
3.1 Finite-State Automata (FSA)

• Recognition strings <=> Search :

• explore all the possible paths
• on each iteration selects a partial path to explore and keeps track of any remaining
unexplored partial paths.
• State-space search algorithms
• creates a space of possible solutions
• explore this space to return an answer “accept/reject“
• to mark unexplored partial paths: implemented by a stack, or queue

9
Finite State Transducers

• A finite-state transducer (FST)

• like an automaton but it performs actions as it consumes an input string
• converts an input string into an output string
• The state transitions are labeled with
• The symbols that cause the state transition
• An action to perform as the transition is made
• A FST has both an input string and an output string.

10
Finite State Transducers

• An FST is defined with 7 parameters:

• Q = q0q1q2...qN−1: a finite set of N states
• Σ: a finite set corresponding to the input alphabet
• ∆: a finite set corresponding to the output alphabet
• q0: the start state
• F: the set of final states, F Q
• δ(q,w): the transition function or transition matrix between states
• σ(q,w): the output function
• Given a state q ∈ Q and a string w ∈ Σ∗, σ(q,w) gives a set of output strings, each string o ∈ ∆∗
• Properties
• The inversion of a transducer T (T−1)
• If T maps from I to O, T−1 maps from O to I.
• Composition:
• T1 ◦ T2(S) = T2(T1(S))
• If T1 maps from I1 to O1 and T2 maps from O1 to O2, then T1 ◦ T2 maps from I1 to O2.

11
Finite State Transducers

• Sequential transducers:
• subtype of transducers that are deterministic on their input
• the transitions out of each state are deterministic based on the state and the input
symbol
• can have epsilon symbols in the output string, but not on the input

• Sub-sequential transducer:
• generates an additional output string at the final states
• generates p additional output string at the final states  p-subsequential transducer
• useful to handle a finite amount of ambiguity

12
3.2 Regular Languages

• Formal Languages
• Uses algebra and set theory to define languages as sequences of symbols
• Each word composed of symbols, letters, or tokens from a finite number set
called an alphabet
• The set of all words over an alphabet Σ is denoted by Σ*
• empty word is denoted by
• A formal language L over an alphabet Σ is a subset of Σ*
L = {a, b, ab, cba}.

• Regular language
• a particular kind of formal language
• words are described by finite-state automata
• used to model part of a natural language (parts of the phonology,
morphology, or syntax)

L(m) = {baa!,baaa!,baaaa!,baaaaa!,baaaaaa!,...}

13
3.2 Regular Languages

• The collection of regular languages over Σ:

• 1. The empty language is a regular language
• 2. a Σ , the singleton language {a} is a regular language
• 3. If L1 and L2 are regular languages, then so are:
• (a) L1 • L2 = {xy| x L1, y L2}, the concatenation of L1 and L2
• (b) L1 L2, the union of L1 and L2
• (c) ∗ , the Kleene closure of L1
• Properties
• if L1 and L2 are regular languages, then so is L1 ∩ L2
• if L1 and L2 are regular languages, then so is L1 − L2
• If L1 is a regular language, then so is Σ − L1
• If L1 is a regular language, then so is
Σ : set of all possible strings formed from the alphabet Σ
set of reversals of all the strings in L1

14
Examples

• Draw an FSA for the words for English numbers 1–99

• Draw an FSA for the simple dollars and cents.

Jurafsky, Daniel, and James H. Martin. Speech and Language Processing: An Introduction to Natural
Language Processing, Computational Linguistics, and Speech Recognition, Prentice Hall; 2nd edition 15
Example

• Label the automaton so that it accepts strings such as

16
Examples

• Which of the following strings are accepted by the automaton?

Accept Reject Accept Reject Accept Accept Accept

• Describe the language of this automaton (in text)?

Strings start with any number (including zero) of the pair of characters "xy".
Next comes a "xz" or an "a". If "xz" then the string is done. If "a" then the string
contines with any number (including zero) of "b" and ends with a single "c".

17
3.3 Regular expressions

• Automaton: takes up space and is not easily used as program input.

• A regular language can be described using a regular expression
• Any regular expression can be implemented as a FSA (except RE with
memory feature)
• Any FSA can be described with a regular expression

 Regular expression /(xy)(xz)|(abc)/

18
3.3 Regular expressions

• A regular expression is an algebraic notation for describing text patterns.

• Used to specify simple class of strings.
• Important throughout computer science and linguistics for specifying text
search strings
• Regular expression search
• A pattern: which want to search for
• A corpus of texts to search through
• Search function will search through corpus and return texts that contain the
pattern
• the first match
• all matches
• Regular expression is a powerful tool for pattern-matching

19
Basic RE Patterns in Perl syntax

• Exact match: //

• Choice of characters: []

Daniel Jurafsky, and James H. Martin. Speech and Language Processing20

Basic RE Patterns in Perl syntax

• Range: [ - ]

• Exclusions (cannot be): [^ ]

• Match any single character (except a carriage return): . wildcard

• \. => mean “period” and not the wildcard

21
Basic RE Patterns in Perl syntax

• “zero or one” instance of the previous character: ?

22
Disjunction, Grouping, and Precedence

• Disjunction of string: / | /
• /cat|dog/ matches either cat or dog.
• Grouping : ( )
• makes it act like a single character
• /gupp(y|ies)/ matches both guppy and guppies
• Operator precedence (from highest to lowest)

• /the*/ vs. /the|any/

/\b$[0-9]+(\.[0-9][0-9])?\b/
/\b[0-9]+ *(MHz|[Mm]egahertz|GHz|[Gg]igahertz)\b/
/\b[0-9]+(\.[0-9]+)? *(Gb|[Gg]igabytes?)\b/

23
Other Operators

24
Greedy Matching

• Greedy Matching
• Strings are matched from left to right one character at a time.
• The +, *, ? quantifier matches as many characters as it can
without preventing the rest of the regular expression from matching.
• Examples:
• How /A+A+/ matches the “AA” string ?
• How /X+X+/ matches the “XXXX” string ?

25
Substitution

• Perl registers (numbered memories): use the number operator \# to refer

back to the previous parentheses () groups
• /the (.*)er they were, the \1er they will be/
• /the (.*)er they (.*), the \1er they \2/
• non-capturing group ?:
• /(?:some|a few) (people|cats) like some \1/
• Perl substitution operator: s/regexp1/new/
• s/([0-9]+)/<\1>/

• Search “aaaaaaaaaaaaXaaa” for substrings that match /(a*)X\1/ ?

• “aaaXaaa”

26
Application of RE

• Simple natural-language understanding programs: ELIZA - Weizenbaum,

1966.
• Simple pattern-based methods

27
28

Unit-1 FiniteAutomata
No ratings yet
Unit-1 FiniteAutomata
89 pages
Module2 NLP BAD613B Notes
100% (1)
Module2 NLP BAD613B Notes
16 pages
Chapter 3 - Lexical Analysis
100% (3)
Chapter 3 - Lexical Analysis
51 pages
Assignment No 1 Toa
No ratings yet
Assignment No 1 Toa
16 pages
FLAT Module-I
No ratings yet
FLAT Module-I
123 pages
Regular Languages and Finite Automata: Lecture Notes On
No ratings yet
Regular Languages and Finite Automata: Lecture Notes On
56 pages
NLP Notes Complete
No ratings yet
NLP Notes Complete
99 pages
Lect#2 RegularExprAndAutomata
No ratings yet
Lect#2 RegularExprAndAutomata
39 pages
Compiler Construction Lecture Notes
No ratings yet
Compiler Construction Lecture Notes
27 pages
SLD 2
No ratings yet
SLD 2
67 pages
12 Compilers Grammars
No ratings yet
12 Compilers Grammars
10 pages
Automata and Complexity Theory
No ratings yet
Automata and Complexity Theory
19 pages
Computational Linguistics: Dr. Dina Khattab
No ratings yet
Computational Linguistics: Dr. Dina Khattab
16 pages
Automata Theory LecturesSlides Compressed
No ratings yet
Automata Theory LecturesSlides Compressed
141 pages
Regular Expression & Autometa
No ratings yet
Regular Expression & Autometa
62 pages
Code Source Tokens Scanner Parser IR
No ratings yet
Code Source Tokens Scanner Parser IR
26 pages
Automata & Compiler Design Handout
No ratings yet
Automata & Compiler Design Handout
59 pages
D Fanfare Gex
No ratings yet
D Fanfare Gex
35 pages
FMC Automata Lev. 4 Alqalam
No ratings yet
FMC Automata Lev. 4 Alqalam
101 pages
Finite Automata: Automaton
No ratings yet
Finite Automata: Automaton
40 pages
Regular Anguage
No ratings yet
Regular Anguage
38 pages
Main
No ratings yet
Main
330 pages
Baggage Delivery Business Proposal
100% (1)
Baggage Delivery Business Proposal
6 pages
AI6122 Topic 1.3 - FSA
No ratings yet
AI6122 Topic 1.3 - FSA
40 pages
TOA - Lecture 3
No ratings yet
TOA - Lecture 3
63 pages
Lecture 3-Finite Autometa
No ratings yet
Lecture 3-Finite Autometa
84 pages
Lec 1 IntroToAutomataTheory
No ratings yet
Lec 1 IntroToAutomataTheory
20 pages
Lecture 4 Regular Expression
No ratings yet
Lecture 4 Regular Expression
30 pages
Compiler Construction Lecture 3-4
No ratings yet
Compiler Construction Lecture 3-4
78 pages
Lec 03 - Finite Languages
No ratings yet
Lec 03 - Finite Languages
29 pages
Machines and Their Languages (G51MAL) Lecture Notes Spring 2003
No ratings yet
Machines and Their Languages (G51MAL) Lecture Notes Spring 2003
27 pages
Finite State Automata
No ratings yet
Finite State Automata
28 pages
FA MSC 2
No ratings yet
FA MSC 2
100 pages
13-Application - Complete Prediction-02-02-2024
No ratings yet
13-Application - Complete Prediction-02-02-2024
49 pages
Formal Language Theory
No ratings yet
Formal Language Theory
69 pages
Chapter 2 Finite Automata PDF
No ratings yet
Chapter 2 Finite Automata PDF
146 pages
Theoryof Computation
No ratings yet
Theoryof Computation
5 pages
Chapter Tworegular - Anguage
No ratings yet
Chapter Tworegular - Anguage
32 pages
Toa Lec 4
No ratings yet
Toa Lec 4
79 pages
CME4408 P4 RE FSA Morphology FST
No ratings yet
CME4408 P4 RE FSA Morphology FST
85 pages
Vision 2023 Toc Chapter 1 Introduction 56
No ratings yet
Vision 2023 Toc Chapter 1 Introduction 56
10 pages
Lect2 Regular Expressions
No ratings yet
Lect2 Regular Expressions
41 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
51 pages
Lecture 1
No ratings yet
Lecture 1
25 pages
Compiler 2
No ratings yet
Compiler 2
38 pages
CH-1 IntroToAutomataTheory
No ratings yet
CH-1 IntroToAutomataTheory
35 pages
Formal Languages & Finite Theory of Automata: BS Course
No ratings yet
Formal Languages & Finite Theory of Automata: BS Course
54 pages
Course
No ratings yet
Course
52 pages
Week4 Chapter2 Automata
No ratings yet
Week4 Chapter2 Automata
52 pages
File 1675742677 110405 LexicalAnalysis-Continue1
No ratings yet
File 1675742677 110405 LexicalAnalysis-Continue1
39 pages
Chapter 2
No ratings yet
Chapter 2
99 pages
4.word Level Analysis-Regular Expression
No ratings yet
4.word Level Analysis-Regular Expression
8 pages
3-Lexical Analysis Part2
No ratings yet
3-Lexical Analysis Part2
39 pages
CHAPTER2 AUTOMATA Landscape
No ratings yet
CHAPTER2 AUTOMATA Landscape
112 pages
Lec 1
No ratings yet
Lec 1
11 pages
Dbms Unit 5 Final
No ratings yet
Dbms Unit 5 Final
16 pages
Formal Language & Automata Theory
No ratings yet
Formal Language & Automata Theory
32 pages
Token, Lexemes and Regular Expression
No ratings yet
Token, Lexemes and Regular Expression
22 pages
NLP Module 2 - 1
No ratings yet
NLP Module 2 - 1
86 pages
Theory of Computation
No ratings yet
Theory of Computation
279 pages
Influential Penetration Testers: Who Are Making The World Safer
No ratings yet
Influential Penetration Testers: Who Are Making The World Safer
47 pages
2.PC Jotun Chart1011 PDF
No ratings yet
2.PC Jotun Chart1011 PDF
3 pages
Technical Bulletin - GPS ROLLOVER: Furuno Sat-C / Mini-C
100% (1)
Technical Bulletin - GPS ROLLOVER: Furuno Sat-C / Mini-C
3 pages
14.chapter10 AdvancedDeepLearningForText
No ratings yet
14.chapter10 AdvancedDeepLearningForText
22 pages
Maths Course Outline Y8
No ratings yet
Maths Course Outline Y8
5 pages
MATLAB Report
No ratings yet
MATLAB Report
17 pages
Maharishi - Resume Lam
No ratings yet
Maharishi - Resume Lam
5 pages
Powerpoint Lesson 3 Working With Visual Elements: Microsoft Office 2010 Introductory
No ratings yet
Powerpoint Lesson 3 Working With Visual Elements: Microsoft Office 2010 Introductory
29 pages
Math10 Chapter Notes 2
No ratings yet
Math10 Chapter Notes 2
40 pages
11.chapter8 WordEmbedding
No ratings yet
11.chapter8 WordEmbedding
17 pages
ES Syllabus (E-Next - In)
No ratings yet
ES Syllabus (E-Next - In)
2 pages
12-13.chapter9 DeepLearningInNLP
No ratings yet
12-13.chapter9 DeepLearningInNLP
45 pages
Structured Query Language (SQL) : Textbook Reference Database Management Systems: Chapter 5
No ratings yet
Structured Query Language (SQL) : Textbook Reference Database Management Systems: Chapter 5
146 pages
1.chapter1 Introduction Chapter2 LanguageCharacteristics
No ratings yet
1.chapter1 Introduction Chapter2 LanguageCharacteristics
35 pages
15.chapter11 NLPApplications
No ratings yet
15.chapter11 NLPApplications
25 pages
Brochure Template 4 - 2 July
No ratings yet
Brochure Template 4 - 2 July
18 pages
Chapter 3: Solving Systems of Linear Equations Using Gaussian Elimination
No ratings yet
Chapter 3: Solving Systems of Linear Equations Using Gaussian Elimination
13 pages
Namma Kalvi 11th Maths Chapter 7 Question Paper
No ratings yet
Namma Kalvi 11th Maths Chapter 7 Question Paper
3 pages
Fair Federated Learning For Digital Healthcare
No ratings yet
Fair Federated Learning For Digital Healthcare
15 pages
DS Dwa-171 D1 Eng
No ratings yet
DS Dwa-171 D1 Eng
3 pages
Graphics Chapter Two
No ratings yet
Graphics Chapter Two
31 pages
Introduction To Embedded Systems: Printed Book
No ratings yet
Introduction To Embedded Systems: Printed Book
1 page
Fundamentals of Applets in Java
No ratings yet
Fundamentals of Applets in Java
14 pages
Manual
No ratings yet
Manual
17 pages
Updated Dbms Lab Obe
No ratings yet
Updated Dbms Lab Obe
4 pages
Open Source Intelligence Techniques Resources For Searching and Analyzing Online Information 6th Edition Michael Bazzell Download
No ratings yet
Open Source Intelligence Techniques Resources For Searching and Analyzing Online Information 6th Edition Michael Bazzell Download
86 pages
Dump State
No ratings yet
Dump State
10 pages
Aleph 20 Syslib Guide - Search
No ratings yet
Aleph 20 Syslib Guide - Search
21 pages
Electronic Communications Act 2000
No ratings yet
Electronic Communications Act 2000
7 pages
Advidia Catalogue
No ratings yet
Advidia Catalogue
7 pages
Project File (Varun Kalura)
No ratings yet
Project File (Varun Kalura)
5 pages
Mid Sem 1 Portions
No ratings yet
Mid Sem 1 Portions
3 pages
Python Programming June July 2022
No ratings yet
Python Programming June July 2022
1 page
Indian Porn Sex Archita Pukham Viral Video Clip Full Original Video Social ...
No ratings yet
Indian Porn Sex Archita Pukham Viral Video Clip Full Original Video Social ...
4 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Creating Melodies
From Everand
Creating Melodies
Stefan Hollos
No ratings yet

2.chapter3 - Regular Expressions and Automata

Uploaded by

2.chapter3 - Regular Expressions and Automata

Uploaded by

Natural Language Processing

• Symbolic NLP (1950s – early 1990s)

• Symbolic NLP (1950s – early 1990s)

• Example: List of words baa!

• A finite automaton is defined by five parameters:

• Deterministic FSA (DFSA)

Given a FSA, check if a string is accepted or not ? Solutions :

• Look-ahead: look ahead in the input to decide which path to take.

FOR SMALL PROBLEMS !

• Recognition strings <=> Search :

• A finite-state transducer (FST)

• An FST is defined with 7 parameters:

• The collection of regular languages over Σ:

• Draw an FSA for the words for English numbers 1–99

• Draw an FSA for the simple dollars and cents.

• Label the automaton so that it accepts strings such as

• Which of the following strings are accepted by the automaton?

Accept Reject Accept Reject Accept Accept Accept

• Describe the language of this automaton (in text)?

• Automaton: takes up space and is not easily used as program input.

 Regular expression /(xy)(xz)|(abc)/

• A regular expression is an algebraic notation for describing text patterns.

Daniel Jurafsky, and James H. Martin. Speech and Language Processing20

• Exclusions (cannot be): [^ ]

• Match any single character (except a carriage return): . wildcard

• “zero or one” instance of the previous character: ?

• “zero or more” occurrences: * (Kleene Star)

• /the*/ vs. /the|any/

• Perl registers (numbered memories): use the number operator \# to refer

• Search “aaaaaaaaaaaaXaaa” for substrings that match /(a*)X\1/ ?

• Simple natural-language understanding programs: ELIZA - Weizenbaum,

You might also like

2.chapter3 - Regular Expressions and Automata

Uploaded by

2.chapter3 - Regular Expressions and Automata

Uploaded by

Natural Language Processing

• Symbolic NLP (1950s – early 1990s)

• Symbolic NLP (1950s – early 1990s)

• Example: List of words baa!

• A finite automaton is defined by five parameters:

• Deterministic FSA (DFSA)

Given a FSA, check if a string is accepted or not ? Solutions :

• Look-ahead: look ahead in the input to decide which path to take.

FOR SMALL PROBLEMS !

• Recognition strings <=> Search :

• A finite-state transducer (FST)

• An FST is defined with 7 parameters:

• The collection of regular languages over Σ:

• Draw an FSA for the words for English numbers 1–99

• Draw an FSA for the simple dollars and cents.

• Label the automaton so that it accepts strings such as

• Which of the following strings are accepted by the automaton?

Accept Reject Accept Reject Accept Accept Accept

• Describe the language of this automaton (in text)?

• Automaton: takes up space and is not easily used as program input.

 Regular expression /(xy)*(xz)|(ab*c)/

• A regular expression is an algebraic notation for describing text patterns.

Daniel Jurafsky, and James H. Martin. Speech and Language Processing20

• Exclusions (cannot be): [^ ]

• Match any single character (except a carriage return): . wildcard

• “zero or one” instance of the previous character: ?

• “zero or more” occurrences: * (Kleene Star)

• /the*/ vs. /the|any/

• Perl registers (numbered memories): use the number operator \# to refer

• Search “aaaaaaaaaaaaXaaa” for substrings that match /(a*)X\1/ ?

• Simple natural-language understanding programs: ELIZA - Weizenbaum,

You might also like

 Regular expression /(xy)(xz)|(abc)/