Chapter 2 - Lexical Analysis - Regular Expressions
Chapter 2 - Lexical Analysis - Regular Expressions
CSCE 354
Dr.Razauddin
University of Hail, Kingdom of Saudi Arabia
2024-2025
Chapter 2
• The front end of the compiler performs analysis; the back end
does synthesis. The analysis is usually broken up into
Example of tokens:
•Type token (id, number, real, . . . )
•Punctuation tokens (IF, void, return, . . . )
•Alphabetic tokens (keywords)
Lexical Analysis is the first phase of the compiler also known as a
scanner.
A programming language classifies lexical tokens into a finite set of token types.
For example, some of the token types of a typical programming language are
Examples of nontokens
comment /* try again */
preprocessor directive #include<stdio.h>
macro NUMS
REGULAR EXPRESSIONS
Symbol:
For each symbol a in the alphabet of the language, the regular expression a denotes the
language containing just the string a.
Alternation:
Given two regular expressions M and N, the alternation operator written as a vertical bar
makes a new regular expression M N. A string is in the language of M N if it is in the language
of M or in the language of N. Thus, the language of a b contains the two strings a and b.
Concatenation:
Given two regular expressions M and N, the concatenation operator · makes a
new regular expression M · N. A string is in the language of M · N if it is the
concatenation of any two strings and such that is in the language of M and is in
the language of N. Thus, the regular expression (a b) · a defines the language
containing the two strings aa and ba.
Epsilon: The regular expression represents a language whose only string is the
empty string. Thus, (a · b) represents the language {"", "ab"}.
Examples
Operations
letter = [a – z] or [A – Z]
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
sign = [ + | - ]
Decimal = (sign)?(digit)+
The comments for this lexer begin with two dashes, contain only alphabetic characters, and end with
newline.
Finally, a lexical specification should be complete, always matching some initial substring of the
input; we can always achieve this by having a rule that matches any single character (and in this case,
prints an "illegal character" error message and continues).
There are two important disambiguation rules used by Lex, JavaCC, SableCC, and other similar lexical-
analyzer generators:
Regular expressions for some tokens.
There are two important disambiguation rules used by Lex, JavaCC, SableCC, and other
similar lexical-analyzer generators:
Longest match: The longest initial substring of the input that can match any regular
expression is taken as the next token.
Rule priority: For a particular longest initial substring, the first regular expression that
can match determines its token-type. This means that the order of writing down the
regular-expression rules has significance.
For example, does if8 match as a single identifier or as the two tokens if and 8?