Lect2 Regex
Lect2 Regex
Lecture #2
Andrew McCallum
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Automatically discovering English versus Japanese word order by grammar induction. Neural Network learners go through the same periods mistakes on irregular verbs as children do. ...and others.
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Ed Hovys thing?
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
A Language
Some sentences in the language The man took the book. From [Chomsky, 1956], his rst context-free parse tree. The purple giraffe hopped through the clouds. This sentence is false. Some sentences not in the language *The girl, the sidewalk, the chalk, drew. *Backwards is sentence this. *loDvaD tlhIngan Hol ghojmoH be.
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Chomsky Hierarchy
Type 0 languages (Turing-equivalent) Rewrite rules a b Context-sensitive languages Rewrite rules aXb acb Context-free languages Rewrite rules X a
where X, a, b as above
TAGs
Mo r a g a e d e ta in la il o n te r . a ll t h
is
PSGs
FSAs
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Terminals:
m, o
Rules:
S mX X oY Yo Y
Start symbol:
S
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
In the language: ba!, baa!, baaaaa! Not in the language: ba, b!, ab!, bbaaa!, alibaba!
Finite-state Automata b s1 s2 a s3 a ! s4
double circle indicates accept state
Regular Expression
baa*
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Recognizer
A recognizer for a language is a program that takes as input a string W and answers yes if W is a sentence in the language, and answers no otherwise. We can think of this as a machine that emits only two possible responses it input.
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Finite-state Automata
machinery for accepting
Regular Expressions
a way to type the automata
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
a q1 q2
! q3
q0
State marker
q1
Input tape
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
We will later return to a probabilistic version of this with Hidden Markov Models!
Transition Table,
Input State
b
1
a
2 2
!
3
0 1 2 3
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Concatenation Disjunction
abc a ad b bbd
Kleene star
The empty string Regular expressions / Finite-state automata are closed under these operations
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Meta-characters
Special characters that allow you to combine REs in various ways Example: a denotes a a* denotes or a or aa or aaa or ...
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Character Concat Alternatives disjunc. negation wildcard char Loops & skips one or more zero or one
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Aliases (shorthand)
Examples
\d+ dollars \w*oo\w* 3 dollars, 50 dollars, 982 dollars food, boo, oodles
Escape character
\ is the general escape character; e.g. \. is not a wildcard, but matches a period . if you want to use \ in a string it has to be escaped \\
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
beginning of line end of line word boundary, i.e. location with \w on one side but not on the other. ???
the together
Example
(?<=[Hh]e) \w+ed (?=\w+ly)
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
# Anchor to start of line # maybe some spaces # user: group of word characters # the domain: # some non-space characters # finally, a space character
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Implemented with regular expression substitution! s/.* YOU ARE (depressed|sad) .*/I AM SORRY TO HEAR THAT YOU ARE \1/ s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE?/
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Non-deterministic FSAs
a b q0 q1 a q2
More than one outgoing transition on a
! q3
Input State
b
1
a
1,2
!
3
Transition relation, rather than transition function.
0 1 2 3
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Ubiquitous problem in this course: How to efficiently search through various possible paths (parses) to find one that works / the most likely one, etc.
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Solutions
Look-ahead
Peek ahead to help decide which path to take.
Parallelism
At each choice, take every path in parallel.
Backup
At each choice point, mark the input / state If we fail, go back and try another path Need a stack (or queue) of markers Marker = Machine state Collection of current state & markers = Search state Depth-first search (or Breadth-first search).
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Morphology
The study of the sub-word units of meaning.
disconnect
not
Making a word plural: If word is regular, If word ends in y, If word ends in x, ...
to attach
Examples: dog dogs baby babies fox foxes
Recognizing that foxes breaks down into morphemes fox and -es called Morphological Parsing
Parsing = taking an input and producing some sort of structure for it.
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Morphology, briefly
morpheme: minimal meaning-bearing unit
stem: main morpheme of a word, e.g. fox affixes: add additional meanings, e.g. +es includes prefixes, suffixes, infixes, circumfixes, e.g. un-, -ly, ... ... concatenative morphology, non-concatenative
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
FSTs can be used to transform a word surface form into morphemes (or vice-versa!) An entire lexicon can be encoded as a FST.
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Reversal: If L1 is regular, so is the language consisting of the set of all reversals of strings in L1. Intersection: if L1 and L2 are regular languages, so is the language consisting of all strings that are in both L1 and L2. Difference: If L1 and L2 are regular languages, so is the language consisting of all strings in L1 that are not in L2. Complementation: If L1 is a regular language, so is the set of all possible strings that are not in L1.
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Friday, February 3, 2005 3:30 - 5:00 PM CMPS 150/151 (Computer Science Building) Refreshments will be served.
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Feel free to come do it in office hours! Due next Thursday, one week from today. (Dont wait until Wednesday to install Python! Recommended schedule:
Idea by Saturday Coded/tested by Monday Write-up by Wednesday
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner
Thank you!
Andrew McCallum, UMass Amherst, including material from Chris Manning and Jason Eisner