0% found this document useful (0 votes)
2 views

CS312 NLP Lecture 2 Basic Text Processing

Uploaded by

anna tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

CS312 NLP Lecture 2 Basic Text Processing

Uploaded by

anna tran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Lecture 2: Basic text processing

CS312 - Natural language processing


Spring 2024

1
Regular expressions
➢ A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence
of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms
for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are
developed in theoretical computer science and formal language theory.
The concept of regular expressions began in the 1950s, when the American mathematician Stephen Cole Kleene
formalized the concept of a regular language. They came into common use with Unix text-processing utilities.
Different syntaxes for writing regular expressions have existed since the 1980s, one being the POSIX standard
and another, widely used, being the Perl syntax. Regular expressions are used in search engines, in search and
replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK, and in
lexical analysis. Regular expressions are supported in many programming languages. (Wikipedia)
○ regular expression

○ regular expressions

○ Regular expression

○ Regular expressions
2
Regular expressions
➢ A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence
of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms
for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are
developed in theoretical computer science and formal language theory.
The concept of regular expressions began in the 1950s, when the American mathematician Stephen Cole Kleene
formalized the concept of a regular language. They came into common use with Unix text-processing utilities.
Different syntaxes for writing regular expressions have existed since the 1980s, one being the POSIX standard
and another, widely used, being the Perl syntax. Regular expressions are used in search engines, in search and
replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK, and in
lexical analysis. Regular expressions are supported in many programming languages. (Wikipedia)
○ regular expression

○ regular expressions

○ Regular expression

○ Regular expressions
3
Regular expressions
➢ A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence
of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms
for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are
developed in theoretical computer science and formal language theory.
The concept of regular expressions began in the 1950s, when the American mathematician Stephen Cole Kleene
formalized the concept of a regular language. They came into common use with Unix text-processing utilities.
Different syntaxes for writing regular expressions have existed since the 1980s, one being the POSIX standard
and another, widely used, being the Perl syntax. Regular expressions are used in search engines, in search and
replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK, and in
lexical analysis. Regular expressions are supported in many programming languages. (Wikipedia)
○ regular expression

○ regular expressions

○ Regular expression

○ Regular expressions
4
Regular expressions
➢ Fundamental operators
○ Concatenation (implicit)
○ Disjunctions
■ [] or |
■ Ranges ([A-Z], [a-z], [0-9])
○ Kleen/star operators
■ *, +

➢ Others
○ Negations: (e.g. [^a], [^a-z])
○ ?, .
○ Anchors (^, $)
5
Substitutions & Simple chatbots

6
Substitutions & Simple chatbots

7
Substitutions & Simple chatbots

8
Substitutions & Simple chatbots

9
Summary
➢ Regular expressions

➢ Substitution & ELIZA

➢ Word tokenization

➢ Word normalization

➢ Sentence segmentation

10

You might also like