0% found this document useful (0 votes)
27 views27 pages

Chapter 2

Uploaded by

Senay Mekonnen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views27 pages

Chapter 2

Uploaded by

Senay Mekonnen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Department of Computer Science

Compiler Design (COSC 4103)


2. Lexical Analysis

1
Lexical Analyzer
toke
source lexical n
analyzer parser
program
get
next
 LEXICAL ANALYZER token
symbol
 Scan Input table
 PARSER
 Remove WS, NL, …
 Perform Syntax Analysis
 Identify Tokens
 Actions Dictated by Token
 Create Symbol Table
Order
 Insert Tokens into ST  Update Symbol Table

 Generate Errors Entries


 Create Abstract Rep. of
 Send Tokens to
Source
Parser
2  Generate Errors
TOKEN, PATTERN and LEXEME
What are Major Terms for Lexical Analysis?
TOKEN
A classification for a common set of strings
Examples Include Identifier, Integer, Float, Assign,
etc.
PATTERN
The rules which characterize the set of strings for a
token
 EX: integers [0-9]+
Recall File and OS Wildcards ([A-Z]*.*)
LEXEME
Actual sequence of characters that matches pattern
and is classified by a token
Identifiers: x, count, name, etc…
3 Integers: 345, 20 -12, etc.
TOKEN, PATTERN and LEXEME con…

Token Sample Lexemes Informal Description of Pattern


const const const
If If Characters I, f
E, l, s, e
Else Else
< or <= or = or < > or >= or >
relation <, <=, =, < >, >,
>= letter followed by letters and digits
id
pi, count, D2 any numeric constant
num
3.1416, 0, 6.02E23 any characters between “ and “
literal
except “
“core dumped”

Actual values are critical. Info is :


Classifies
1. Stored in symbol table
Pattern
2. Returned to parser
4
Examples of Non-Tokens
Examples of non-tokens
comment: /* do not change */
preprocessor directive: #include <stdio.h>
preprocessor directive: #define NUM 5
blanks,
tabs,
newlines

5
Handling Lexical Errors
Error Handling is very localized, with Respect
to Input Source
For example: whil ( x := 0 ) do
generates no lexical errors
In what Situations do Errors Occur?
Prefix of remaining input doesn’t match any
defined token
Possible error recovery actions:
Deleting or Inserting Input Characters
Replacing or Transposing Characters
Or, skip over to next separator to “ignore”
problem

6
Input Buffering
to find the end of token, LA may need to go
one or more characters beyond the next
lexeme
E.g., to find ID or >, =, ==
Buffer Pairs
Concerns with efficiency issues
Used with a lookahead on the
E = M * C * * 2 e
input o
f

lexemeBegin ptr Forward ptr


7 Using a pair of input buffers
Basic Scanning technique
Use 1 character of look-ahead
Obtain char with getc()
Do a case analysis
Based on lookahead char
Based on current lexeme
Outcome
If char can extend lexeme, all is well, go on.
If char cannot extend lexeme:
Figure out what the complete lexeme is and
return its token
Put the lookahead back into the symbol stream

8
Lexical Analyzer: Implementation Approaches

 General Approach to implement Lexical Analyzer (LA)

1. Tool such as Lex


2. Write the LA using Programming Languages
3. Write LA in assembly language (difficult but efficient)

9
Formalizing Token
Definition
DEFINITIONS:

ALPHABET :Finite set of symbols {0,1}, or {a,b,c}, or


{n,m, … , z}
STRING : Finite sequence of symbols from an
alphabet.
0011 or abbca or AABBC …
A.K.A. word / sentence
If S is a string, then |S| is the length of S, i.e. the
number of symbols in the string S.
 : Empty String, with |  | = 0

10
Language Concepts
A language, L, is simply any set of strings
overAlphabet
a fixed alphabet.
Languages
{0,1}
{0,10,100,1000,100000…}

{0,1,00,11,000,111,…}
{a,b,c}
{abc,aabbcc,aaabbbccc,…}
{A, … ,Z} {TEE,FORE,BALL,…}
{FOR,WHILE,GOTO,
…}
{A,…,Z,a,…,z,0,…9, { All legal C/C++
Special Languages:  - EMPTY LANGUAGE
progs}
11
+,-,…,<,>,…} {- All
contains  string
grammatically
only
Language & Regular Expressions

 A Regular Expression is a Set of Rules /


Techniques for Constructing Sequences of
Symbols (Strings) From an Alphabet.
Let  Be an Alphabet, r a Regular

Expression Then L(r) is the Language That


is Characterized by the Rules of R

12
Towards Token Definition
Regular Definitions: Associate names with
Regular Expressions
For Example : C/C++ IDs
letter  A | B | C | … | Z | a | b | … | z
digit  0 | 1 | 2 | … | 9
id  letter ( letter | digit )*
Shorthand Notation:
“+” : one or more r* = r+ |  & r+ = r
r*
“?” : zero or one
[range] : set range of characters (replaces
“|” )
Using Shorthand[A-Z]: =
C/C++
A | B |IDs
C|…|Z
id  [A-Za-z][A-Za-z0-9]*
13
We’ll Use Both Techniques
Token Recognition
How can we use concepts developed so far to assist
in recognizing tokens of a source language ?
Assume Following Tokens:
if, then, else, relop,
id, num
What language construct are
Giventhey usedWhat
Tokens, for ?are
if
Patterns ?if
then  then
else  else
relop  < | <= | > | >= | = | <>
id  letter ( letter | digit )*
num  digit + (. digit + ) ? ( E(+ | -) ? digit +

14 )?
What does this represent ?
Constructing Transition Diagrams for Tokens
• Transition Diagrams (TD) are used to represent
the tokens – these are automatons!
• As characters are read, the relevant TDs are
used to attempt to match lexeme to a pattern
• Each TD has:
• States : Represented by Circles
• Actions : Represented by Arrows between states
• Start State : Beginning of a pattern (Arrowhead)
• Final State(s) : End of pattern (Concentric Circles)

• Each TD is Deterministic - No need to choose


between 2 different actions !
15
Example : All RELOPs And Id
start < =
0 1 2 return(relop, LE)
>
3 return(relop, NE)
othe
=
r
4 * return(relop, LT)

5 return(relop, EQ)
>

=
6 7 return(relop, GE)
othe
r 8 * return(relop, GT)

id :
letter or
digit
star lett oth *
0 1 return(id,
t er er
16
2 lexeme)
Important Final Notes on Transition
Diagrams & Lexical Analyzers
state = 0;
token nexttoken() •How does this work?
{ while(1) {
switch (state) {
case 0: c = nextchar();
/* c is lookahead character */
if (c== blank || c==tab || c== newline) {
state = 0;
lexeme_beginning++;
What /* advance beginning of lexeme */
does }
else if (c == ‘<‘) state = 1;
this
else if (c == ‘=‘) state = 5;
do? else if (c == ‘>’) state = 6;
else state = fail();
break;

17
Tokens / Patterns / Regular Expressions
Lexical Analysis - searches for matches of lexeme
to pattern
Lexical
ForAnalyzer returns:<actual
Token lexeme, symbolic
identifier Symbolic ID
of token>
Example:
if 1
Set of all regular then 2
expressions plus else 3
symbolic ids plus>,>=,<,… 4
analyzer define := 5
required id 6
functionality.
int 7
real 8

18 REs --- NFA --- DFA (program for


Automata & Language
Theory
 Terminology
 FSA
A recognizer that takes an input string and
determines whether it’s a valid string of the
language.
 Non-Deterministic FSA (NFA)
Has several alternative actions for the same input
symbol
 Deterministic FSA (DFA)
Has at most 1 action for any given input symbol
 Bottom Line
 expressive power(NFA) == expressive power(DFA)
 Conversion can be automated
19
Finite Automata & Language Theory

Finite A recognizer that takes an input


Automata : string & determines whether it’s a
valid sentence of the language

Non-Deterministic
Has more than one alternative
: action for the same input symbol.
Can’t utilize algorithm !

DeterministicHas
: at most one action for a given
input symbol.
Both types are used to recognize regular
expressions.

20
Representing NFAs
Number
Transition Diagrams : states (circles),
arcs, final states, …

Transition Tables: More suitable to


representation within a
S = { 0, 1, 2, computer
3} a
start a b b
s0 = 0 0 1 2 3

F={3}
b
Fig:3
 = { a,i bn }
put
b  (null) moves
s possible
0 { 0, 1 { 0 } i  j
t }
a 1 -- {2}
Switch state but do
t 2 -- {3}
21 not use any input
e
symbol
NFA- Regular Expressions & Compilation
Problems with NFAs for Regular Expressions:
1. Valid input might not be accepted
2. NFA may behave differently on the same input
Example: for Fig 3 aabb is accepted along path : 0 → 0 →
1→2→3
BUT… it is not
Relationship of accepted
NFAs to along the valid path: 0 → 0 → 0 → 0
Compilation:
→0
1. Regular expression “recognized” by NFA
2. Regular expression is “pattern” for a “token”
3. Tokens are building blocks for lexical analysis
4. Lexical analyzer can be described by a collection
of NFAs. Each NFA is for a language token.
22
Deterministic Finite Automata (DFA)

 A DFA is an NFA with a few restrictions


 No epsilon transitions
 For every state s, there is only one transition
(s,x) from s for any symbol x in Σ
 Corollaries
 Easy to implement a DFA with an
algorithm!
 Deterministic behavior

23
NFA to DFA Conversion
 Look at the state reachable without consuming
any input, and Aggregate them in macro states

24 • A state is final IFF one of the NFA


Deterministic Finite Automata
A DFA is an NFA with the following
restrictions:
•  moves are not allowed
• For every state s S, there is one and
only one path from s for every input
Since transition
symbol a tables
 . don’t have any alternative
options, DFAs are easily simulated via an
algorithm.s  s0
c  nextchar;
while c  eof do
s  move(s,c);
c  nextchar;
end;
if s is in F then return “yes”
else return “no”

25
Conversion : NFA  DFA Algorithm

• Algorithm Constructs a Transition Table for


DFA from NFA
• Each state in DFA corresponds to a SET of
states of the NFA
• Why does this occur ?
•  moves
• non-determinism
Both require us to characterize multiple
situations that occur for accepting the same
string.
26 (Recall : Same input can have multiple paths
Converting NFA to DFA

2 a 3 b 4

 
0  1 5  8

 

6 c 7


From State 0, Where can we move without consuming
any input ?
27 This forms a new state: 0,1,2,6,8 What transitions are
defined for this new state ?

You might also like