CS312 NLP Lecture 2 Basic Text Processing
CS312 NLP Lecture 2 Basic Text Processing
1
Regular expressions
➢ A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence
of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms
for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are
developed in theoretical computer science and formal language theory.
The concept of regular expressions began in the 1950s, when the American mathematician Stephen Cole Kleene
formalized the concept of a regular language. They came into common use with Unix text-processing utilities.
Different syntaxes for writing regular expressions have existed since the 1980s, one being the POSIX standard
and another, widely used, being the Perl syntax. Regular expressions are used in search engines, in search and
replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK, and in
lexical analysis. Regular expressions are supported in many programming languages. (Wikipedia)
○ regular expression
○ regular expressions
○ Regular expression
○ Regular expressions
2
Regular expressions
➢ A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence
of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms
for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are
developed in theoretical computer science and formal language theory.
The concept of regular expressions began in the 1950s, when the American mathematician Stephen Cole Kleene
formalized the concept of a regular language. They came into common use with Unix text-processing utilities.
Different syntaxes for writing regular expressions have existed since the 1980s, one being the POSIX standard
and another, widely used, being the Perl syntax. Regular expressions are used in search engines, in search and
replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK, and in
lexical analysis. Regular expressions are supported in many programming languages. (Wikipedia)
○ regular expression
○ regular expressions
○ Regular expression
○ Regular expressions
3
Regular expressions
➢ A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence
of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms
for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are
developed in theoretical computer science and formal language theory.
The concept of regular expressions began in the 1950s, when the American mathematician Stephen Cole Kleene
formalized the concept of a regular language. They came into common use with Unix text-processing utilities.
Different syntaxes for writing regular expressions have existed since the 1980s, one being the POSIX standard
and another, widely used, being the Perl syntax. Regular expressions are used in search engines, in search and
replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK, and in
lexical analysis. Regular expressions are supported in many programming languages. (Wikipedia)
○ regular expression
○ regular expressions
○ Regular expression
○ Regular expressions
4
Regular expressions
➢ Fundamental operators
○ Concatenation (implicit)
○ Disjunctions
■ [] or |
■ Ranges ([A-Z], [a-z], [0-9])
○ Kleen/star operators
■ *, +
➢ Others
○ Negations: (e.g. [^a], [^a-z])
○ ?, .
○ Anchors (^, $)
5
Substitutions & Simple chatbots
➢
6
Substitutions & Simple chatbots
➢
7
Substitutions & Simple chatbots
➢
8
Substitutions & Simple chatbots
➢
9
Summary
➢ Regular expressions
➢ Word tokenization
➢ Word normalization
➢ Sentence segmentation
10