0% found this document useful (0 votes)
17 views64 pages

Chapter2-Lexical Analysis

Uploaded by

Boi Phúc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views64 pages

Chapter2-Lexical Analysis

Uploaded by

Boi Phúc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Chapter 2 – Lexical Analysis

Compiler
• Compiler translates from one language to another

Source code Front End Back End Target code

• Front End: Analysis


• Takes input source code
• Returns Abstract Syntax Tree and symbol table
• Back End: Synthesis
• Takes AST and symbol table
• Returns machine-executable binary code, or virtual machine code
Front End

Lexical Syntax Semantic


Analysis Analysis Analysis

• Lexical Analysis: breaks input into individual words – “tokens”


• Syntax Analysis: parses the phrase structure of program
• Semantic Analysis: calculates meaning of program
The role of the Lexical Analysis

-> read the input characters of the source program


-> group them into lexemes
-> produce as output a sequence of tokens for each lexeme in the source
program
Lexing & Parsing
• From strings to data structures

Abstract
Strings/Files Tokens
Syntax Trees

Lexing Parsing
Interactions between the lexical analyzer
and the parser
Tokens, Patterns and Lexemes
• A pattern is a description of the form that the lexemes of a token may take
(the set of rule that define a TOKEN).

• A lexeme is a sequence of characters in the source program that matches


the pattern for a token and is identfied by the lexical analyzer as an instance of that
token

• A token is a pair consisting of a token name and an optional attribute


value.
• Common token names are
• identifiers: names the programmer chooses
• keywords: names already in the programming language
• separators (also known as punctuators): punctuation characters and paired-delimiters
• operators: symbols that operate on arguments and produce results
• literals: numeric, logical, textual, reference literals
• ………..
Tokens, Patterns and Lexemes
• Consider this expression in the programming language C:
sum=3+2;
• Tokenized and represented by the following table:
Lexeme Token Name
sum Identifier
= Operator
3 Literal
+ Operator
2 Literal
; Seperator
Tokens, Patterns and Lexemes
Lexeme Token Name
if (y <= t) y = y - 3; if Keyword
( Open parenthesis
y Identifier
<= Comparison operator
t Identifier
) Close parenthesis
y Identifier
= Assignment operator
y Identifier
- Arithmatic operator
3 Integer
; semicolon
Attributes for Tokens
• When more than one lexeme can match a pattern, the lexical analyzer
must provide the subsequent compiler phases additional information
about the particular lexeme that matched.

• For example, the pattern for token number matches both 0 and 1, but it is
extremely important for the code generator to know which lexeme was
found in the source program.
Tokens, Patterns and Lexemes
cout << 3+2+3;
Lexeme The following tokens are returned by
scanner to parser in specified order
cout <identifier, ‘cout’>
<< <operator, ‘<<‘>
3 <literal, ‘3’>
+ <operator, ‘+’>
2 <literal, ‘2’>
+ <operator, ‘+’>
3 <literal, ‘3’>
; <punctuator, ‘;’>
Tokens
if (num1 == num2)
result = 1;
else
result = 0;

\tif (num1 == num2)\n\t\tresult = 1;\n\telse\n\t\tresult = 0;


Tokens
• Token class
• In English: noun, verb, adjective, …..

• In a programming language: identifier, keyword, (, ), number, …


Tokens
• Token classes correspond to sets of strings.

• Identifier:
- Identifiers are strings of letters, digits, and underscores, starting with a letter or an
underscore
num1, result, name20, _result, …..
• Integer:
- A non-empty string of digits
10, 89, 001, 00, …….
• Keyword:
- A fix set of reserved words
if, else, for, while, ….
• Whitespace:
- A non-empty sequence of blanks, newlines, and tabs
Lexical Analysis

Tokens Abstract
Strings/Files
<name, attribute> Syntax Trees

Lexing Parsing
Lexical Analysis

<id, ‘result’> Abstract


result=50 <op, ‘=‘>
Syntax Trees
<int, ’50’>

Lexing Parsing
Lexical Analysis
\tif (num1 == num2)\n\t\tresult = 1;\n\telse\n\t\tresult = 0;

=> Go through and identify the tokens of the substrings.

Whitespace: A non-empty sequence of blanks, newlines, and tabs


Keywords: A fix set of reserved words
Identifiers: Identifiers are strings of letters, digits, and underscores, starting with a letter or an
underscore
Numbers
Operators
OpenParenthesis
CloseParenthesis
Semicolon
Lexical Analysis: Regular expression
• Lexical structure = token classes

• Token classes correspond to sets of strings.


- Use regular expressions to specify which set of strings belongs to each token class
Lexical Analysis: Regular expressions
• Single character
‘a’ = {“a”}
• Epsilon
ε = {“”}
• Union
A + B = {a | a∈A} ∪ {b | b ∈B}
• Concatenation
AB = {ab | a∈A ∧ b ∈B}
• Iteration
A* = 𝑖≥0 𝐴𝑖 , 𝐴𝑖 = A……….A (i times)
𝐴0 = ε
Lexical Analysis: Regular expressions
• The regular expression over Σ are the smallest set of expressions including

R = ε
| ‘a’ where c ∈ Σ
| A+B where A, B are regular expressions over Σ
| AB where A, B are regular expressions over Σ
| A* where A is a regular expression over Σ
Lexical Analysis: Regular expressions
Σ = {0, 1}

𝑖
1* = 𝑖≥0 1 = ε + 1 + 11 + 111 + 1111 + ……..

(1+0)1 = {ab | a ∈ 1+0 ∧ b ∈ 1} = 11 + 01

0* + 1* = {0^𝑖 | 𝑖≥0} ∪ {1^𝑖 | 𝑖≥0} = ε + 0 + 00 + 000 + 0000 + ……….


+ ε + 1 + 11 + 111 + 1111 + ……..
(0+1)* = 𝑖≥0(0 + 1)𝑖
= ε + (0+1) + (0+1) (0+1) + (0+1) …… (0+1)
= all strings of 0’s and 1’s
= Σ*
Lexical Analysis
Meaning function L maps syntax to semantics

L(e) = M

Regular Set of
expression strings
L(regular_expression)
L(regular_expression) -> set of strings

‘a’ = {“a”} => L(‘a’) = {“a”}


ε = {“”} => L(ε) = {“”}
A+B=A∪B => L(A + B) = L(A) ∪ L(B)
AB = {ab | a∈A ∧ b ∈B} => L(AB) = {ab | a∈L(A) ∧ b ∈L(B)}
A* = 𝑖≥0 𝐴𝑖 , => L(A*) = 𝑖≥0 𝐿(𝐴𝑖 )
Regular Expression
• keyword: A fix set of reserved words (“if” or “else” or “for” or …..)
Regular expression for if: ‘i’’f’
Regular expression for else: ‘e’’l’’s’’e’
Regular expression for for: ‘f’’o’’r’

Regular expression for keyword:


‘i’’f’ + ‘e’’l’’s’’e’ + ‘f’’o’’r’ + ……….
=> ‘if’ + ‘else’ + ‘for’ + ……….
Regular Expression
• Integer: a non-empty string of digits

- regular expression for the set of strings corresponding to all the single
digits

digit = ‘0’ + ‘1’ + ‘2’ + ‘3’ + ‘4’ + ‘5’ + ‘6’ + ‘7’ + ‘8’ + ‘9’

integer = digit digit* = digit+


Identifier: strings of letters, digits, and underscores, starting with a letter or
an underscore.

digit = ‘0’ + ‘1’ + ‘2’ + ‘3’ + ‘4’ + ‘5’ + ‘6’ + ‘7’ + ‘8’ + ‘9’
= [0-9]
letter_ = [a-zA-Z_]
identifier = letter_(letter_ + digit)*
Whitespace: a non-empty sequence of blanks, newlines, and tabs

whitespace = (‘ ‘ + ‘\n’ + ‘\t’)+


[email protected]

=> Make regular expression for this email address:

letter+’@’letter+’.’letter+’.’letter+
Regular Expression
• At least one: AA*  A+

• Union: A|B A+B

• Option: A+ε  A?

• Range: ‘a’ + ‘b’ + …+ ‘z’  [a-z]

• Excluded range: complement of [a-z]  [^a-z]


Number in Pascal: A floating point number can have some digits, an

optional fraction and an optional exponent (3.15E+10, 8E-3, 15.6, …)


digit = ‘0’+’1’+’2’+’3’+’4’+’5’+’6’+’7’+’8’+’9’
digits = digit+
opt_fraction = (‘.’digits) + ε = (‘.’digits)?
opt_exponent = (‘E’(‘+’ + ’-’ + ε)digits) + ε = (‘E’(‘+’ + ‘-’)?digits)?
num = digits opt_fraction opt_exponent
Regular Expression
• Regular expressions describe many useful languages

• Regular languages are a language specification


• We still need an implementation
Regular Expressions => Lexical Spec
1. Write a regular expressions for the lexemes of each token class
• number = digit+
• keyword = ‘if’ + ‘else’ + …
• identifier = letter_(letter_ + digit)*
• openPar = ‘(‘
• closePar = ‘)’
• ………..

2. Construct R, matching all lexemes for all tokens


R = keyword + identifier + number + …..
= R1 + R2 + ….
• (This step is done automatically by tools like flex)
3. Let input be x1 ….xn
For 1  i  n check x1…..xi L(R) ?

4. If success, then we know that


x1…..xi L(Rj) for some j

R = R1 + R2 + R3 + …..

5. Remove x1 ….xn from input and go to (3)


How much input is used?

If x1 ….xi L(R)
And x1 ….xj L(R)
ij

Rule: Pick longest possible string in L(R)


– Pick k if k > i
– The “maximal munch”
Which token is used?
x1 ….xi L(Rj)
x1 ….xi L(Rk) => which token is used?

Keywords = ‘if’ + ‘else’ + ….


Identifiers = letter(letter + digit)*

if L(Keywords)
if L(Identifiers)
=> Choose the rule listed FIRST.
• What if no rule matches?
x1 ….xi L(R)

Error = all strings not in the language of our lexical specification

Make a regular expression for error strings and PUT IT LAST IN PRIORITY
(lowest priority)
• Regular expressions are a concise notation for string patterns

• Use in lexical analysis requires small extensions


• To resolve ambiguities
• Matches as long as possible
• Highest priority match
• To handle errors
• Make a regular expression for error strings and PUT IT LAST IN PRIORITY.
Make a regular expression for:
• Keyword is a reserved word whose meaning is already defined by the
programming language. We cannot use keyword for any other purpose
inside programming. Every programming language have some set of
keywords.
Examples: int, do, while, void, return, …………
Make a regular expression for:
• Identifiers
Identifiers are the name given to different programming elements. Either
name given to a variable or a function or any other programming element,
all follow some basic naming conventions listed below:

1.Keywords must not be used as an identifier.


2.Identifier must begin with an alphabet a-z A-Z or an underscore_ symbol.
3.Identifier can contains alphabets a-z A-Z, digits 0-9 and underscore _ symbol.
4.Identifier must not contain any special character (e.g. !@$*.'[] etc.) except
underscore _.
Make a regular expression for:
• Operator
Operators are the symbol given to any arithmetical or logical operations.
Various programming languages provides various sets of operators some
common operators are:
• Arithmetic operator (+, -, *, / %)
• Assignment operator (=)
• Relational operator (>, <, >=, <=, ==, !=)
• Logical operator (&&, ||, !)
• Bitwise operator (&, |, ^, ~, <<, >>)
• Increment/Decrement operator (++, --)
• Conditional/Ternary operator (? :)
Make a regular expression for:
• Literals
Literals are constant values that are used for performing various operations and
calculations. There are basically three types of literals:
1.Integer literal
An integer literal represents integer or numeric values.
Example: 1, 100, -12312 etc
2.Floating point literal
Floating point literal represents fractional values.
Example: 2.123, 1.02, -2.33, 13e54, -23.3 etc
3.Character literal
Character literal represent character values. Single character are enclosed in a single
quote(' ') while sequence of character are enclosed in double quotes(" ")
Example: 'a', 'n', "Hello", "Hello123" etc.
Finite Automata
• Regular expressions = specification
• Finite automata = implementation

• A finite automata consists of


• An input alphabet 
• A finite set of states S
• A start state q0
• A set of accepting states F  S
• A set of transitions δ input
state state
Finite Automata
• Transition
s1 a s2
• Is read
In state s1 on input a go to state s2

• If end of input and in accepting state => accept


• Otherwise => reject
• Terminates in state s  F
• Get stuck
Finite Automata
• A state

• The start state

• An accepting state

a
• A transition
Finite Automata
• A finite automata that accepts only “a”

a
q0 q1

• What happen if input strings are:


• “a”
• “b”
• “ab”

• Language of a finite automata is set of accepted strings.


Finite Automata
• A finite automata accepting any number of 0’s followed by a single 1.
0
1
q0 q1

STATE INPUT STATE INPUT

q0 001 q0 011

q0 001 q0 011

q0 001 q1 011

q1 001

Accept Reject
Regular Expressions to non-deterministic
finite automata (NFA)

Non-
Deterministic
deterministic Table-driven
Lexical Regular Finite
Finite Implementat
Specification Expressions Automata
Automata ion of DFA
(DFA)
(NFA)
Regular Expressions to NFA
• For each kind of regular expression, define an equivalent NFA that accepts
exactly the same language as the language of a regular expression.
NFA for regular expression M
M

• For ε 

• For input a a
Regular Expressions to NFA
• Concatenation
• For RS R  S

 R 
• Union
• For R + S  
S

• Iteration  
R
• For R*

Regular Expressions to NFA
• Consider the regular expression (0+1)(01)*
0
• For 0

1
• For 1

0
ε ε
• For 0 + 1
ε 1 ε
Regular Expressions to NFA
• Consider the regular expression (0+1)(01)*

• For 01 0 ε 1

ε
• For (01)*
ε 0 ε 1 ε

ε
Regular Expressions to NFA
• Consider the regular expression (0+1)(01)*

0 ε
B C
ε ε
ε ε 0 ε 1 ε
A F G H I J K L
ε 1 ε ε
D E
Regular Expressions to non-deterministic
finite automata (NFA)

Non-
Deterministic
deterministic Table-driven
Lexical Regular Finite
Finite Implementati
Specification Expressions Automata
Automata on of DFA
(DFA)
(NFA)
NFA to DFA
• Simulate the NFA
• Each state of DFA
= a non-empty subset of states of the NFA
• Start state of DFA
= the set of NFA states reachable through -moves from NFA start state
• Add a transition S a S’ to DFA if
– S’ is the set of NFA states reachable from any
state in S after seeing the input a, considering -moves as well
• Final state of DFA
= the set includes the final state of the NFA
NFA to DFA
• NFA for (0+1)(01)*

0 ε
B C
ε ε
ε ε 0 ε 1 ε
A F G H I J K L
ε 1 ε ε
D E
NFA to DFA
• NFA for (0+1)(01)*
0 ε
B C
ε ε
ε ε 0 ε 1 ε
A F G H I J K L
ε 1 ε ε
D E

0 CFGHL 0
1
ABD IJ KLH
1 EFGHL 0
0
Regular Expressions to non-deterministic
finite automata (NFA)

Non-
Deterministic Table-driven
Lexical Regular deterministic
Finite Automata Implementation
Specification Expressions Finite Automata
(DFA) of DFA
(NFA)
Implementation of DFA
• A DFA can be implemented by a 2D table T
– One dimension is “states”
– Other dimension is “input symbol”

a
– For every transition Si Sk define T[i,a] = k
a b
Input symbols
i k
j
states k
l
Implementation of DFA
• DFA for (0+1)(01)*
0 S1 0
S0 S3 1 S4
1 S2 0
0 0 1
S0 S1 S2
S1 S3
S2 S3
S3 S4
S4 S3
Implementation of DFA
i = 0;
state = 0;
0 1
while (input[i]){
state = T[state, input[i++]]; S0 S1 S2
} S1 S3
S2 S3
S3 S4
S4 S3
Implementation of DFA
• DFA for (0+1)(01)* 0 S1 0 1
S0 S3 S4
1 S2 0 0

0 1 0 1
S0 S1 S2 S0 S1 S2
S1 S3 S1
S3
S2 S3 S2
S3 S4 S3 S4
S4 S3 S4
Implementation of NFA
0 ε
0 1 ε B C
ε ε
A {B, D} ε ε 0 ε 1 ε
A F G H I J K L
B {C} ε 1 ε ε
D E
C {F}
D {E}
E {F}
F {G}
G {H, L}
H {I}
I {J}
J {K}
K {L, H}
Summarize
• Conversion of NFA to DFA is the key
• DFAs are faster and less compact so the tables can be very large
• NFAs are slower to implement but more concise.
• In practice, tools provide tradeoffs between speed and space.
• Tools give generally a series of options in the form of configuration files or
command lines which allow you to choose whether you want to be closer
to a full DFA or to a pure NFA.
Assignment 1 (Lexical Analyzer)

You might also like