0% found this document useful (0 votes)

5 views30 pages

Lec2 LexicalAnalyser

The document discusses the role and functions of a lexical analyzer, which is the first phase of a compiler that reads input characters, groups them into lexemes, and produces tokens. It outlines tasks such as scanning, lexical analysis, and error recovery, as well as key terminologies like lexemes, patterns, and tokens. Additionally, it covers the construction of patterns using regular expressions and the implementation of transition diagrams for recognizing tokens.

Uploaded by

Abhilasha Goyal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views30 pages

Lec2 LexicalAnalyser

Uploaded by

Abhilasha Goyal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

CSN-352

Lexical Analyzer
Lexical Analyzer
●
First phase of a compiler
●
Read the input characters of the source program, group them into lexemes.
●
Produce as output a sequence of tokens for each lexeme in the source
program.

Interaction
between
getNextToken command, parser and lex
causes the lexical analyzer to analyzer
read characters from its input
until it can identify the next
lexeme and produce for it the
next token, which it returns to
the parser.
Lexical Analyzer
●
Tasks of Lexical Analyzer
– Scanning: Stripping comments and white spaces
– Lexical Analysis: Identifying lexemes and produce tokens from the
output of the scanner.
– Correlate error messages by compiler with the source program.
Lexical Analyzer
●
Three terminologies
– Lexemes: A lexeme is a sequence of characters in the source program that matches the pattern
for a token.
– Pattern: is a description of the form that the lexemes of a token may take.
– Token: A pair of token-name (An abstract symbol name (eg., id for identifier) and optional
attribute value.
●
Attribute value: differentiates tokens from each other, an attribute value describes the
lexeme represented by the token, for example, number is a token with value 3.14 and
number is another token with value 6.02.
●
Token name influence parsing decision, while attribute value influences translation of
tokens after the parse.
●
Token-name = id (identifier)
– Attribute value will be the pointer to the symbol table entry for the token-name ‘id’.
– Associated information in the symbol table – lexeme, position first found, type, etc
Lexical Analyzer
●
Patterns are to cover all the tokens.
– One token for each keyword. The pattern for a keyword is the same as the keyword itself.
– Tokens for the operators, either individually or in classes such as the token comparison
mentioned for all comparative operators.
– One token representing all identiers.
– One or more tokens representing constants, such as numbers and literal strings.
– Tokens for each punctuation symbol, such as left and right parentheses, comma, and semicolon.
Lexical Analyzer
●
E = M * C ** 2

//Tokens generated by
lexical analyzer
Lexical Error
ofr i =1 to 10
{
//for loop block
}

‘ofr’ can get a token ‘id’, while it is a keyword token ‘for’.

Possible error recovery actions:

- Delete successive characters from the input, until the lexical analyzer
can find a well-formed token at the beginning of what input is left.
- Delete one character from the remaining input.
- Insert a missing character into the remaining input.
- Replace a character by another character.
- Transpose two adjacent characters.
Lookahead function by Lexical
Analyzer
●
We often have to look one or more characters beyond the next lexeme to take the right
decision about the token-name of lexeme.
●
‘i <= 3’, here <= is a lexeme, need to check until encouters ‘3’
●
Two-buffers scheme handles large lookaheads safely with two pointers,
●
‘lexemeBegin’ to indicate begining of the matched character for token pattern,
●
‘forward’ pointer will move over the buffer to find end of the lexeme.
●
eof indicates end of the characters in the source program.
●
Pointer positions after observing a lexeme as per the pattern:
●
Once the next lexeme is determined, forward is set to the character at its right end.
●
lexemeBegin is set to the character immediately after the lexeme just found.
Lookahead function by Lexical
Analyzer

// Lookahead code for

moving ‘forward’ pointer
between the buffers
Lookahead function by Lexical
Analyzer
Sentinels: The sentinel is a special character to indicate end of the buffer, preferable choice is
‘eof’.

Buffer pairs with sentinels

Constructing Patterns
➢
Alphabet: Finite set of symbols
➢
Types of alphabets
➢
Binary {0,1}
➢
ASCII {0-9, a-z, A-Z, some symbols}
➢
Unicode - An extension of ASCII that allows many more characters to be represented
➢
String: A finite sequence of symbols drawn from alphabet.
➢
Operations over strings-
➢
Prefix, suffix, subsequence, concatenation, exponential, etc.
➢
Define s0 to be ϵ, and for all i > 0, define si to be si-1s.
➢
Since ϵs = s, it follows that s1 = s. Then s2 = ss, s3 = sss, and so on.
➢
Language: A language is any countable set of strings over some fixed alphabet.
Constructing Patterns
➢
Operations on Languages

Concatenating L zero or
more times.
Constructing Patterns
L= {A; B; : : : ; Z; a; b; : : : ; z}
D = {0,1,2,.......,9}
Two languages with string length 1.

- L ᵁ D is the language with 62 strings of length one, each of which strings

is either one letter or one digit.
- LD is the set of 520 strings of length two, each consisting of one letter
followed by one digit
- L4 is the set of all 4-letter strings.
- D+ is the set of all strings of one or more digits.
Constructing Patterns using Regular
Expressions
Define language for valid C identifier. - Regular expression provides a way to define
languages using language operators over symbols of alphabet.

- we use italics for symbols of language, and boldface for their corresponding regular
expression.

r = L_( L_ | D )*

L_ - any letter or _,
() - group subexpressions,
| - concatenation
* - zero or more occurrences of

r is a regular expression for language L(r).

Fundamental rules:
- ϵ is a regular expression that L(ϵ) is ϵ, the language whose sole member is the empty
string.
- If a is a symbol in , then a is a regular expression, and L(a) = {a}, that is, the language
with one string, of length one, with a in its one position.
Constructing Patterns using Regular
Expressions
Making large regular expression from smaller regular expressions, r and s.
➢
(r)|(s) is a regular expression denoting the language L(r) |[ L(s).
➢
(r)(s) is a regular expression denoting the language L(r)L(s).
➢
(r)* is a regular expression denoting L(r)*.
➢
(r)+ is a regular expression denoting L(r)+.

Let Σ = {a,b} //alphabet

- r = a|b, L(a|b)={a, b}.

- r = (a|b)(a|b), L = {aa, ab, ba, bb}, Another regular expression for the same language is r =
aa|ab|ba|bb.

language consisting of all strings of zero or more a's, that

is, r = a*, {a, aa, aaa,......}.

(a|b)? Another regular expression for the same language is (ab)?

a|a*b denotes the language {a,b,ab,aab, aaab,.............}, that is, the string a
and all strings consisting of zero or more a's and ending in b.
Algebraic laws for regular expression
Regular Definition
- For convinence, we give name to regular expressions to be used in sebsequent expressions,
called assigning definition.

- Regular defition is a sequence of definition:

d1 -> r1
d2 -> r2

dn -> rn

- Each di is a new symbol not in alphabet Σ, and not in any other ds.
- Each ri is a regular expression over the alphabet Σ U {d1,d2;,.......di-1}

//Regular definition for the language

of C identifiers
Extension operators for REs
-
One or more instances using positive closure: r +

- One or more instances using kleene closure: r*

- r* = r+|^^ & r+ = rr* = r*r

- Zero or one instance: r?

- Character classes: r= a1|^a2|^.... |^an can be written as [a1a2...an] or [a1-

an], where ai are each symbol of the alphabet

//Regular definition for the language

of C identifiers
Other regular expression operators (\ or ^)
Back Slash (\): Special characters to be turned off if they are part of the
string.
- By using double quote around the special character “**”.
or
- By using back slash before each special character \*\*.

Caret (^): We use ^ to represent complemented character class - any

character except the ones listed in the character class.

- [^A-Za-z] matches any character that is not an uppercase or

lowercase letter,
- [^\^] represents any character but the caret (or newline, since newline
cannot be in any character class)

- ^the[a-z]*, here ^ matches begining of the line. //^ outside the class []
Other regular expression operators
Recognition of Tokens
Grammar for branching statements language:

Possible Terminals (can be interpreted as tokens): if, else, then, relop, id, number

Lexical analyzer will extract tokens based on

Patterns for tokens, in addition it will strip out
white spaces: -
Transition Diagrams
- Transition diagram –
- An intermediate step to recognize a string to be lexeme.
- An Lexical Analyser implements Transition Diagram from RE.

- Transition diagrams have two components:

- States: collections of nodes/circles called states.
- Edges: connecting one state to another state, having a symbol as label.

-Initial State: The tranition diagram start with an initial/start state having an input edge with
‘start’ label.

- Final state: final/ accetping state, indicating a lexeme has been found and having an
assocated aciton if required. Represented by double circle.

- Double circle with * - At times it is necessary to retract the forward pointer one position
(forward pointer position) beyond the lexeme to decide the lexeme, but not part of the current
lexeme. Place * with final state indicating any one character after the lexeme. It can be more
than one * for more than one characters.
Transition Diagrams
Relop -> <|<=|<>|=|>|>=

Symbol table entry

indicating type of relop

// Note, the state 4 has a * to indicate that we

must retract the input one position.

Transition diagram for relop

Transition Diagrams: differentiating
keywords and identifiers
- keywords (eg., if, else, for, while, etc) are the reserved words of any programming language,
but they fit the pattern of an identifier.

- A field of the symbol table has an entry for reserved words of a programming language and
provides respective token names (not part of lexical analysis process) .

- How to differentiate keywords from identifiers in a transition diagram?

Method-1
//We retract one position to
get the actual lexeme.

A transition diagram for ids and keywords

- installID() - will check whether it already exists in the symbol table, if not, make an entry in the
symbol table for the lexeme and return a pointer to the symbol table entry.
- getToken() - will return the right token-name from the symbol table, either id or one of the keyword
tokens that was initially installed in the table.

Method-2: Create separate transition diagram for each keyword.

- In this case, we prioritize the keyword transition diagram over identifiers.
Lex Compiler
- Lex is a tool called lex compiler.
- specify a lexical analyzer by specifying regular expressions to describe patterns for tokens.
- It generate transition diagram in the background and creates a file lex.yy.c. //yy refers to
yacc parser generator bound with lex.

Creating a lexical analyzer with lex

Lex Compiler: Structure of the Lex
Program
Has menifest
constants and
regular
defintion
Has token

Translation Rules has the form: Pattern {action}

Transition Graph: Automata

- Lex translates all Res into Automata in the background to check if a string belongs
to the RE or not.
- Automata: Stats and Edges - Nodes are states and labeled edges are transition
functions.
- Very similar to Transition diagram, except
a) The same symbol can label edges from one state to several dierent states,
and
b) An edge may be labeled by , the empty string, instead of, or in addition to,
symbols from the input alphabet.

Two types of finite Automata:

1. Non Deterministic Finite Automata – more than one transition edge per alphabet
symbol from a state
2. Deteministic Finite Automata – one transition edge per alphabet symbol from a
state
Transition Graph: Automata

Automata consists of:

1. Finite set of states
2. Input alphabet
3. Transition function
4. Start state
5. Finite state
Transition Table: states in
rows and symbols
(including ^) in columns,
- all strings of a's and b's ending in the particular string abb entry into a cell is the
transition function value for
state-input pair

NFA
Transition Graph: Automata
- The lexical analyzer software are supposed to implement automata in the background.
- Making NFA is more straightforward than DFA on paper.
- While simulating NFA is less straightforward than DFA.
- Hence, all NFA is to be converted into DFA.

Compiler Design Chapter-2
60% (5)
Compiler Design Chapter-2
105 pages
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
No ratings yet
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
35 pages
Chapter 2 - Lexical Analysis
100% (1)
Chapter 2 - Lexical Analysis
69 pages
Compiler-Lexical Analysis
100% (1)
Compiler-Lexical Analysis
59 pages
Coex Ex e MB Power Supply Unit: Data Sheet
No ratings yet
Coex Ex e MB Power Supply Unit: Data Sheet
2 pages
Chapter 3 Lexical Analysis
No ratings yet
Chapter 3 Lexical Analysis
5 pages
Lexical Analysis 3
No ratings yet
Lexical Analysis 3
27 pages
Ch2+3 Compiler
No ratings yet
Ch2+3 Compiler
21 pages
CD ch2
No ratings yet
CD ch2
104 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
Lexical Analysis
No ratings yet
Lexical Analysis
44 pages
ch-2 Compiler Design
No ratings yet
ch-2 Compiler Design
9 pages
Lexical Analyzer 1
No ratings yet
Lexical Analyzer 1
37 pages
Unit 01 - PART 2
No ratings yet
Unit 01 - PART 2
25 pages
WINSEM2024-25 CSI2005 TH VL2024250502429 2024-12-14 Reference-Material-II
No ratings yet
WINSEM2024-25 CSI2005 TH VL2024250502429 2024-12-14 Reference-Material-II
84 pages
Ch3 - Lexical Analysis
No ratings yet
Ch3 - Lexical Analysis
52 pages
Lect2 Lexical
No ratings yet
Lect2 Lexical
9 pages
Chapter 2
No ratings yet
Chapter 2
77 pages
Compilers - Week 2
No ratings yet
Compilers - Week 2
14 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
ch3 M.PPTX - 0
No ratings yet
ch3 M.PPTX - 0
46 pages
Chapter Two (3) (Autosaved)
No ratings yet
Chapter Two (3) (Autosaved)
29 pages
Chapter - 2 Lexical Analysis
No ratings yet
Chapter - 2 Lexical Analysis
160 pages
Lexical Analyzer 2023
No ratings yet
Lexical Analyzer 2023
38 pages
Unit 2 Lexical Analysis
No ratings yet
Unit 2 Lexical Analysis
94 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
84 pages
4 Lexical Analysis
No ratings yet
4 Lexical Analysis
60 pages
Lexical Analysis: CD: Compiler Design
No ratings yet
Lexical Analysis: CD: Compiler Design
122 pages
Chapter 7 Lexical Analysis
No ratings yet
Chapter 7 Lexical Analysis
61 pages
CD KCS502 Unit 1 B
No ratings yet
CD KCS502 Unit 1 B
12 pages
Chapter2-Lexical Analysis
No ratings yet
Chapter2-Lexical Analysis
64 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Lexical Analysis
No ratings yet
Lexical Analysis
41 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
CD Unit-2
No ratings yet
CD Unit-2
64 pages
Muc 8051 - Automatic School Bell
No ratings yet
Muc 8051 - Automatic School Bell
5 pages
Unit 1-REGULAR LANGUAGES
No ratings yet
Unit 1-REGULAR LANGUAGES
27 pages
Lexical Analysis
No ratings yet
Lexical Analysis
45 pages
Unit22pdf 2021 03 13 13 38 11
No ratings yet
Unit22pdf 2021 03 13 13 38 11
114 pages
Lexical Analysis
No ratings yet
Lexical Analysis
153 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
Lec 02
No ratings yet
Lec 02
17 pages
2 - Lexical Analysis
No ratings yet
2 - Lexical Analysis
52 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
62 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
56 pages
Lexi Cal A Analyzer
No ratings yet
Lexi Cal A Analyzer
38 pages
Ch2 Lexical Analysis
No ratings yet
Ch2 Lexical Analysis
11 pages
Lecture3 E
No ratings yet
Lecture3 E
153 pages
Scanner (Lexical Analyzer) : The Structure of A Compiler
No ratings yet
Scanner (Lexical Analyzer) : The Structure of A Compiler
109 pages
Lexical Analysis
No ratings yet
Lexical Analysis
31 pages
Unit 2
No ratings yet
Unit 2
89 pages
rkCD-Chapter 2 - LEXICAL ANALYSIS
No ratings yet
rkCD-Chapter 2 - LEXICAL ANALYSIS
9 pages
CD 1
No ratings yet
CD 1
92 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Unit-2 Lexical Analysis
No ratings yet
Unit-2 Lexical Analysis
36 pages
Lexical Analyzer in Perspective: Parser Source Program Token
No ratings yet
Lexical Analyzer in Perspective: Parser Source Program Token
22 pages
2 Lex
No ratings yet
2 Lex
45 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
Lexical Analysis
No ratings yet
Lexical Analysis
121 pages
Compiler
No ratings yet
Compiler
60 pages
Cs8592-Object Oriented Analysis and Design
No ratings yet
Cs8592-Object Oriented Analysis and Design
8 pages
Ebooks Implementation Guide Sme
No ratings yet
Ebooks Implementation Guide Sme
35 pages
Junos Release Notes 22.4r3
No ratings yet
Junos Release Notes 22.4r3
177 pages
Computational Mathematics
No ratings yet
Computational Mathematics
49 pages
Message
No ratings yet
Message
2 pages
Algebraic Expressions: Ms. Jiecel Maedeen Aquino Santos
No ratings yet
Algebraic Expressions: Ms. Jiecel Maedeen Aquino Santos
18 pages
GIS-Based Application For DepEd Schools in The Philippines Using Spatial Data Analysis
No ratings yet
GIS-Based Application For DepEd Schools in The Philippines Using Spatial Data Analysis
5 pages
Dell Wembley-Mt-Dt-Ra01-Pd Optiplex 980
No ratings yet
Dell Wembley-Mt-Dt-Ra01-Pd Optiplex 980
61 pages
Data Warehousing & Data Mining
No ratings yet
Data Warehousing & Data Mining
16 pages
TIM-94N / TIM-94N-B / TIM-94N-BN: Description
No ratings yet
TIM-94N / TIM-94N-B / TIM-94N-BN: Description
5 pages
Final Examination - Spring 2021 Semester Sajid Ali - 40760: Faculty of Engineering, Sciences and Technology
No ratings yet
Final Examination - Spring 2021 Semester Sajid Ali - 40760: Faculty of Engineering, Sciences and Technology
4 pages
Hox Correctipon
No ratings yet
Hox Correctipon
79 pages
Lec1 Introduction To Compiler
No ratings yet
Lec1 Introduction To Compiler
16 pages
HikCentral Access Control Brochure
No ratings yet
HikCentral Access Control Brochure
2 pages
Kmu Cat Rollnoslip 333580
No ratings yet
Kmu Cat Rollnoslip 333580
1 page
Eula
No ratings yet
Eula
13 pages
Lec3 SyntaxAnalysis Part4
No ratings yet
Lec3 SyntaxAnalysis Part4
23 pages
Raspberry Pi ArduCam System Instruction Manual
No ratings yet
Raspberry Pi ArduCam System Instruction Manual
34 pages
The Robotic Process Automation Handbook: A Guide To Implementing RPA Systems 1st Edition Tom Taulli Download
No ratings yet
The Robotic Process Automation Handbook: A Guide To Implementing RPA Systems 1st Edition Tom Taulli Download
47 pages
Lec3 SyntaxAnalysis
No ratings yet
Lec3 SyntaxAnalysis
17 pages
Lec3 SyntaxAnalysis Part3
No ratings yet
Lec3 SyntaxAnalysis Part3
9 pages
Lab 10 210309
No ratings yet
Lab 10 210309
5 pages
FIRST SEMESTER 2022-2023: of Programming Languages 10 Edition, Pearson, 2012.
No ratings yet
FIRST SEMESTER 2022-2023: of Programming Languages 10 Edition, Pearson, 2012.
3 pages
The Role of Artificial Intelligence in Project Management
No ratings yet
The Role of Artificial Intelligence in Project Management
3 pages
Module 2 Unit 4 Lesson 5 Homework
No ratings yet
Module 2 Unit 4 Lesson 5 Homework
5 pages
HollySys - Introduction V1.2 - 2021
No ratings yet
HollySys - Introduction V1.2 - 2021
42 pages
ABSTRACT
No ratings yet
ABSTRACT
18 pages
Aaditya Bank Project
No ratings yet
Aaditya Bank Project
36 pages
SQL Class-12
No ratings yet
SQL Class-12
12 pages
T2-Orion Throttle Base-Manual en V 1.0
No ratings yet
T2-Orion Throttle Base-Manual en V 1.0
8 pages
Physio@Home: Exploring Visual Guidance and Feedback Techniques For Physiotherapy Exercises
No ratings yet
Physio@Home: Exploring Visual Guidance and Feedback Techniques For Physiotherapy Exercises
10 pages
Mohammad Kausar Uddin
No ratings yet
Mohammad Kausar Uddin
3 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)

Lec2 LexicalAnalyser

Uploaded by

Lec2 LexicalAnalyser

Uploaded by

CSN-352

‘ofr’ can get a token ‘id’, while it is a keyword token ‘for’.

Possible error recovery actions:

// Lookahead code for

Buffer pairs with sentinels

- L ᵁ D is the language with 62 strings of length one, each of which strings

r is a regular expression for language L(r).

Let Σ = {a,b} //alphabet

- r = a|b, L(a|b)={a, b}.

language consisting of all strings of zero or more a's, that

(a|b)*? Another regular expression for the same language is (a*b*)*?

- Regular defition is a sequence of definition:

//Regular definition for the language

- One or more instances using kleene closure: r*

- r* = r+|^^ & r+ = rr* = r*r

- Zero or one instance: r?

- Character classes: r= a1|^a2|^.... |^an can be written as [a1a2...an] or [a1-

an], where ai are each symbol of the alphabet

//Regular definition for the language

Caret (^): We use ^ to represent complemented character class - any

- [^A-Za-z] matches any character that is not an uppercase or

Lexical analyzer will extract tokens based on

- Transition diagrams have two components:

Symbol table entry

// Note, the state 4 has a * to indicate that we

Transition diagram for relop

- How to differentiate keywords from identifiers in a transition diagram?

A transition diagram for ids and keywords

Method-2: Create separate transition diagram for each keyword.

Creating a lexical analyzer with lex

Translation Rules has the form: Pattern {action}

Two types of finite Automata:

Automata consists of:

You might also like

(a|b)? Another regular expression for the same language is (ab)?