0% found this document useful (0 votes)
32 views27 pages

Chapter 2 - Lexical Analysis - Regular Expressions

Uploaded by

bbtmk.fm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views27 pages

Chapter 2 - Lexical Analysis - Regular Expressions

Uploaded by

bbtmk.fm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Compiler Design

CSCE 354

Dr.Razauddin
University of Hail, Kingdom of Saudi Arabia

2024-2025
Chapter 2

Lexical Analysis: Application of


regular expressions in lexical
scanners
OVERVIEW
• To translate a program from one language into another, a
compiler must first pull it apart and understand its structure
and meaning, then put it together in a different way.

• The front end of the compiler performs analysis; the back end
does synthesis. The analysis is usually broken up into

Lexical analysis breaking the input into individual words or "tokens";

Syntax analysis parsing the phrase structure of the program;

Semantic analysis calculating the program's meaning.


What is a Token?

A lexical token is a sequence of characters that can be


treated as a unit in the grammar of the programming
languages.

Example of tokens:
•Type token (id, number, real, . . . )
•Punctuation tokens (IF, void, return, . . . )
•Alphabetic tokens (keywords)
Lexical Analysis is the first phase of the compiler also known as a
scanner.

It converts the High level input program into a sequence of Tokens.


1.Lexical Analysis can be implemented with the Deterministic finite
Automata.
2.The output is a sequence of tokens that is sent to the parser for
syntax analysis.
LEXICAL TOKENS
 A lexical token is a sequence of characters that can be treated as a unit in the
grammar of a programming language.

 A programming language classifies lexical tokens into a finite set of token types.

 For example, some of the token types of a typical programming language are

Type Examples Type Examples


ID foo n14 last COMMA ,
NUM 73 0 00 515 082 NOTEQ !=
REAL 66.1 .5 10. 1e67 5.5e-10 LPAREN (
IF if RPAREN )
Punctuation tokens such as IF, VOID, RETURN constructed from alphabetic characters
are called reserved words and, in most languages, cannot be used as identifiers.

Examples of nontokens
comment /* try again */
preprocessor directive #include<stdio.h>

preprocessor directive #define NUMS 5, 6

macro NUMS
REGULAR EXPRESSIONS

Each regular expression stands for a set of strings.

Symbol:

For each symbol a in the alphabet of the language, the regular expression a denotes the
language containing just the string a.

Alternation:

Given two regular expressions M and N, the alternation operator written as a vertical bar
makes a new regular expression M N. A string is in the language of M N if it is in the language
of M or in the language of N. Thus, the language of a b contains the two strings a and b.
Concatenation:
Given two regular expressions M and N, the concatenation operator · makes a
new regular expression M · N. A string is in the language of M · N if it is the
concatenation of any two strings and such that is in the language of M and is in
the language of N. Thus, the regular expression (a b) · a defines the language
containing the two strings aa and ba.

Epsilon: The regular expression represents a language whose only string is the
empty string. Thus, (a · b) represents the language {"", "ab"}.

Repetition: Given a regular expression M, its Kleene closure is M*. A string is in


M* if it is the concatenation of zero or more strings, all of which are in M. Thus,
((a b) · a)* represents the infinite set {"", "aa", "ba", "aaaa", "baaa", "aaba",
"baba", "aaaaaa", }.
Using symbols, alternation, concatenation, epsilon, and Kleene closure we can
specify the set of ASCII characters corresponding to the lexical tokens of a
programming language.

Examples

(0 | 1)* · 0 Binary numbers that are multiples of two.

b*(abb*)*(a| ) Strings of a's and b's with no consecutive a's.

(a|b)*aa(a|b)* Strings of a's and b's containing consecutive a's.


EXAMPLES
ab | c (a · b) | c
[abcd] (a | b | c | d)
[b-g] [bcdefg]
[b-gM-Qkr] [ bcdefgMNOPQkr]
The empty string. Another way to write the empty string.
M|N Alternation, choosing from M or N.
M·N Concatenation, an M followed by an N.
MN Another way to write concatenation.
M* Repetition (zero or more times).
M+ Repetition, one or more times.
M? Optional, zero or one occurrence of M.
[a zA Z] Character set alternation.
. A period stands for any single character except newline.

"a.+*" Quotation, a string in quotes stands for itself literally.


Regular expressions have the capability to express finite languages by
defining a pattern for finite strings of symbols.

The grammar defined by regular expressions is known as regular


grammar. The language defined by regular grammar is known as regular
language.

Operations

The various operations on languages are:

 Union of two languages L and M is written as


L U M = {s | s is in L or s is in M}

 Concatenation of two languages L and M is written as


LM = {st | s is in L and t is in M}
 The Kleene Closure of a language L is written as
L* = Zero or more occurrence of language L.
Notations
If r and s are regular expressions denoting the languages L(r) and
L(s), then

 Union : (r)|(s) is a regular expression denoting L(r) U L(s)

 Concatenation : (r)(s) is a regular expression denoting L(r)L(s)

 Kleene closure : (r)* is a regular expression denoting (L(r))*

 (r) is a regular expression denoting L(r)


Precedence and Associativity

 *, concatenation (.), and | (pipe sign) are left


associative

 * has the highest precedence

 Concatenation (.) has the second highest precedence.

 | (pipe sign) has the lowest precedence of all.


Representation occurrence of symbols using regular expressions

letter = [a – z] or [A – Z]

digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]

sign = [ + | - ]

Representation of language tokens using regular expressions

Decimal = (sign)?(digit)+

Identifier = (letter)(letter | digit)*


Regular expressions for some tokens.
 Comments or white space does not report back to the parser.

 Instead, the white space is discarded and the lexer resumed.

 The comments for this lexer begin with two dashes, contain only alphabetic characters, and end with
newline.

 Finally, a lexical specification should be complete, always matching some initial substring of the
input; we can always achieve this by having a rule that matches any single character (and in this case,
prints an "illegal character" error message and continues).

 These rules are a bit ambiguous.


 For example, does if8 match as a single identifier or as the two tokens if and 8?

 There are two important disambiguation rules used by Lex, JavaCC, SableCC, and other similar lexical-
analyzer generators:
Regular expressions for some tokens.
 There are two important disambiguation rules used by Lex, JavaCC, SableCC, and other
similar lexical-analyzer generators:

 Longest match: The longest initial substring of the input that can match any regular
expression is taken as the next token.

 Rule priority: For a particular longest initial substring, the first regular expression that
can match determines its token-type. This means that the order of writing down the
regular-expression rules has significance.

 For example, does if8 match as a single identifier or as the two tokens if and 8?

 Thus, if8 matches as an identifier by the longest-match rule, and if matches as a


reserved word by rule-priority.
Exercise
• The input source code is converted into a sequence of tokens.
sum = a + b;
• Steps in Lexical Analysis:
• Input Source Code:
The lexical analyzer (lexer) reads the source code as a sequence of
characters.
• Tokenization:
The lexer divides this sequence into a list of meaningful symbols, known
as tokens. Each token represents a basic element of the programming
language.
• Int – keyword
• Sum – Identifier
• = - Assigned operator
• a - identifier
• + - Addition operator
• b – identifier
• ; - Semicolon (Delimiter)
• Output Tokens:
For the given source code, the lexer might produce the following tokens:
Key Points:
Whitespace and Comments: The lexer typically ignores whitespace and comments, as
they do not affect the meaning of the code.
Error Handling: If the lexer encounters an invalid sequence of characters that cannot
be classified into a valid token, it raises an error. For example, if the input contains an
unexpected character like #, which is not valid in the given context, an error is
reported.
• Output:
KEYWORD: int
ID: sum
ASSIGN: =
ID: a
PLUS: +
ID: b
SEMICOLON: ;
Exercise
• The input source code is converted into a sequence of tokens.
def greet(name):
print("Hello, " + name + "!")
greet("Alice")
• Steps in Lexical Analysis:
• Input Source Code:
The lexical analyzer (lexer) reads the source code as a sequence of
characters.
• Tokenization:
The lexer divides this sequence into a list of meaningful symbols, known as
tokens. Each token represents a basic element of the programming
language.
• def, print – keyword
• greet, name – Identifier
• + - operator
• (), “ - Punctuation
• “Hello” - Literal
• b – identifier
• - White Space (Delimiter)
• Output Tokens:
For the given source code, the lexer might produce the following tokens:
Key Points:
Whitespace and Comments: The lexer typically ignores whitespace and comments, as
they do not affect the meaning of the code.
Error Handling: If the lexer encounters an invalid sequence of characters that cannot
be classified into a valid token, it raises an error. For example, if the input contains an
unexpected character like #, which is not valid in the given context, an error is
reported.
By Students:
• Convert input source code into sequence of token?
1. total = 3.14 * radius * radius;
2. if (x >= 10) {
y = x * 2;
}
3. x = 5 + 3;
4. a = b - 2;

You might also like