0% found this document useful (0 votes)
5 views30 pages

Lec2 LexicalAnalyser

The document discusses the role and functions of a lexical analyzer, which is the first phase of a compiler that reads input characters, groups them into lexemes, and produces tokens. It outlines tasks such as scanning, lexical analysis, and error recovery, as well as key terminologies like lexemes, patterns, and tokens. Additionally, it covers the construction of patterns using regular expressions and the implementation of transition diagrams for recognizing tokens.

Uploaded by

Abhilasha Goyal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views30 pages

Lec2 LexicalAnalyser

The document discusses the role and functions of a lexical analyzer, which is the first phase of a compiler that reads input characters, groups them into lexemes, and produces tokens. It outlines tasks such as scanning, lexical analysis, and error recovery, as well as key terminologies like lexemes, patterns, and tokens. Additionally, it covers the construction of patterns using regular expressions and the implementation of transition diagrams for recognizing tokens.

Uploaded by

Abhilasha Goyal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

CSN-352

Lexical Analyzer
Lexical Analyzer

First phase of a compiler

Read the input characters of the source program, group them into lexemes.

Produce as output a sequence of tokens for each lexeme in the source
program.

Interaction
between
getNextToken command, parser and lex
causes the lexical analyzer to analyzer
read characters from its input
until it can identify the next
lexeme and produce for it the
next token, which it returns to
the parser.
Lexical Analyzer

Tasks of Lexical Analyzer
– Scanning: Stripping comments and white spaces
– Lexical Analysis: Identifying lexemes and produce tokens from the
output of the scanner.
– Correlate error messages by compiler with the source program.
Lexical Analyzer

Three terminologies
– Lexemes: A lexeme is a sequence of characters in the source program that matches the pattern
for a token.
– Pattern: is a description of the form that the lexemes of a token may take.
– Token: A pair of token-name (An abstract symbol name (eg., id for identifier) and optional
attribute value.

Attribute value: differentiates tokens from each other, an attribute value describes the
lexeme represented by the token, for example, number is a token with value 3.14 and
number is another token with value 6.02.

Token name influence parsing decision, while attribute value influences translation of
tokens after the parse.

Token-name = id (identifier)
– Attribute value will be the pointer to the symbol table entry for the token-name ‘id’.
– Associated information in the symbol table – lexeme, position first found, type, etc
Lexical Analyzer

Patterns are to cover all the tokens.
– One token for each keyword. The pattern for a keyword is the same as the keyword itself.
– Tokens for the operators, either individually or in classes such as the token comparison
mentioned for all comparative operators.
– One token representing all identiers.
– One or more tokens representing constants, such as numbers and literal strings.
– Tokens for each punctuation symbol, such as left and right parentheses, comma, and semicolon.
Lexical Analyzer

E = M * C ** 2

//Tokens generated by
lexical analyzer
Lexical Error
ofr i =1 to 10
{
//for loop block
}

‘ofr’ can get a token ‘id’, while it is a keyword token ‘for’.

Possible error recovery actions:


- Delete successive characters from the input, until the lexical analyzer
can find a well-formed token at the beginning of what input is left.
- Delete one character from the remaining input.
- Insert a missing character into the remaining input.
- Replace a character by another character.
- Transpose two adjacent characters.
Lookahead function by Lexical
Analyzer

We often have to look one or more characters beyond the next lexeme to take the right
decision about the token-name of lexeme.

‘i <= 3’, here <= is a lexeme, need to check until encouters ‘3’

Two-buffers scheme handles large lookaheads safely with two pointers,

‘lexemeBegin’ to indicate begining of the matched character for token pattern,

‘forward’ pointer will move over the buffer to find end of the lexeme.

eof indicates end of the characters in the source program.

Pointer positions after observing a lexeme as per the pattern:

Once the next lexeme is determined, forward is set to the character at its right end.

lexemeBegin is set to the character immediately after the lexeme just found.
Lookahead function by Lexical
Analyzer

// Lookahead code for


moving ‘forward’ pointer
between the buffers
Lookahead function by Lexical
Analyzer
Sentinels: The sentinel is a special character to indicate end of the buffer, preferable choice is
‘eof’.

Buffer pairs with sentinels


Constructing Patterns

Alphabet: Finite set of symbols

Types of alphabets

Binary {0,1}

ASCII {0-9, a-z, A-Z, some symbols}

Unicode - An extension of ASCII that allows many more characters to be represented

String: A finite sequence of symbols drawn from alphabet.

Operations over strings-

Prefix, suffix, subsequence, concatenation, exponential, etc.

Define s0 to be ϵ, and for all i > 0, define si to be si-1s.

Since ϵs = s, it follows that s1 = s. Then s2 = ss, s3 = sss, and so on.

Language: A language is any countable set of strings over some fixed alphabet.
Constructing Patterns

Operations on Languages

Concatenating L zero or
more times.
Constructing Patterns
L= {A; B; : : : ; Z; a; b; : : : ; z}
D = {0,1,2,.......,9}
Two languages with string length 1.

- L ᵁ D is the language with 62 strings of length one, each of which strings


is either one letter or one digit.
- LD is the set of 520 strings of length two, each consisting of one letter
followed by one digit
- L4 is the set of all 4-letter strings.
- D+ is the set of all strings of one or more digits.
Constructing Patterns using Regular
Expressions
Define language for valid C identifier. - Regular expression provides a way to define
languages using language operators over symbols of alphabet.

- we use italics for symbols of language, and boldface for their corresponding regular
expression.

r = L_( L_ | D )*

L_ - any letter or _,
() - group subexpressions,
| - concatenation
* - zero or more occurrences of

r is a regular expression for language L(r).

Fundamental rules:
- ϵ is a regular expression that L(ϵ) is ϵ, the language whose sole member is the empty
string.
- If a is a symbol in , then a is a regular expression, and L(a) = {a}, that is, the language
with one string, of length one, with a in its one position.
Constructing Patterns using Regular
Expressions
Making large regular expression from smaller regular expressions, r and s.

(r)|(s) is a regular expression denoting the language L(r) |[ L(s).

(r)(s) is a regular expression denoting the language L(r)L(s).

(r)* is a regular expression denoting L(r)*.

(r)+ is a regular expression denoting L(r)+.

Let Σ = {a,b} //alphabet

- r = a|b, L(a|b)={a, b}.

- r = (a|b)(a|b), L = {aa, ab, ba, bb}, Another regular expression for the same language is r =
aa|ab|ba|bb.

language consisting of all strings of zero or more a's, that


is, r = a*, {a, aa, aaa,......}.

(a|b)*? Another regular expression for the same language is (a*b*)*?

a|a*b denotes the language {a,b,ab,aab, aaab,.............}, that is, the string a
and all strings consisting of zero or more a's and ending in b.
Algebraic laws for regular expression
Regular Definition
- For convinence, we give name to regular expressions to be used in sebsequent expressions,
called assigning definition.

- Regular defition is a sequence of definition:

d1 -> r1
d2 -> r2

dn -> rn

- Each di is a new symbol not in alphabet Σ, and not in any other ds.
- Each ri is a regular expression over the alphabet Σ U {d1,d2;,.......di-1}

//Regular definition for the language


of C identifiers
Extension operators for REs
-
One or more instances using positive closure: r +

- One or more instances using kleene closure: r*

- r* = r+|^^ & r+ = rr* = r*r

- Zero or one instance: r?

- Character classes: r= a1|^a2|^.... |^an can be written as [a1a2...an] or [a1-

an], where ai are each symbol of the alphabet

//Regular definition for the language


of C identifiers
Other regular expression operators (\ or ^)
Back Slash (\): Special characters to be turned off if they are part of the
string.
- By using double quote around the special character “**”.
or
- By using back slash before each special character \*\*.

Caret (^): We use ^ to represent complemented character class - any


character except the ones listed in the character class.

- [^A-Za-z] matches any character that is not an uppercase or


lowercase letter,
- [^\^] represents any character but the caret (or newline, since newline
cannot be in any character class)

- ^the[a-z]*, here ^ matches begining of the line. //^ outside the class []
Other regular expression operators
Recognition of Tokens
Grammar for branching statements language:

Possible Terminals (can be interpreted as tokens): if, else, then, relop, id, number

Lexical analyzer will extract tokens based on


Patterns for tokens, in addition it will strip out
white spaces: -
Transition Diagrams
- Transition diagram –
- An intermediate step to recognize a string to be lexeme.
- An Lexical Analyser implements Transition Diagram from RE.

- Transition diagrams have two components:


- States: collections of nodes/circles called states.
- Edges: connecting one state to another state, having a symbol as label.

-Initial State: The tranition diagram start with an initial/start state having an input edge with
‘start’ label.

- Final state: final/ accetping state, indicating a lexeme has been found and having an
assocated aciton if required. Represented by double circle.

- Double circle with * - At times it is necessary to retract the forward pointer one position
(forward pointer position) beyond the lexeme to decide the lexeme, but not part of the current
lexeme. Place * with final state indicating any one character after the lexeme. It can be more
than one * for more than one characters.
Transition Diagrams
Relop -> <|<=|<>|=|>|>=

Symbol table entry


indicating type of relop

// Note, the state 4 has a * to indicate that we


must retract the input one position.

Transition diagram for relop


Transition Diagrams: differentiating
keywords and identifiers
- keywords (eg., if, else, for, while, etc) are the reserved words of any programming language,
but they fit the pattern of an identifier.

- A field of the symbol table has an entry for reserved words of a programming language and
provides respective token names (not part of lexical analysis process) .

- How to differentiate keywords from identifiers in a transition diagram?

Method-1
//We retract one position to
get the actual lexeme.

A transition diagram for ids and keywords

- installID() - will check whether it already exists in the symbol table, if not, make an entry in the
symbol table for the lexeme and return a pointer to the symbol table entry.
- getToken() - will return the right token-name from the symbol table, either id or one of the keyword
tokens that was initially installed in the table.

Method-2: Create separate transition diagram for each keyword.


- In this case, we prioritize the keyword transition diagram over identifiers.
Lex Compiler
- Lex is a tool called lex compiler.
- specify a lexical analyzer by specifying regular expressions to describe patterns for tokens.
- It generate transition diagram in the background and creates a file lex.yy.c. //yy refers to
yacc parser generator bound with lex.

Creating a lexical analyzer with lex


Lex Compiler: Structure of the Lex
Program
Has menifest
constants and
regular
defintion
Has token

Translation Rules has the form: Pattern {action}


Transition Graph: Automata

- Lex translates all Res into Automata in the background to check if a string belongs
to the RE or not.
- Automata: Stats and Edges - Nodes are states and labeled edges are transition
functions.
- Very similar to Transition diagram, except
a) The same symbol can label edges from one state to several dierent states,
and
b) An edge may be labeled by , the empty string, instead of, or in addition to,
symbols from the input alphabet.

Two types of finite Automata:


1. Non Deterministic Finite Automata – more than one transition edge per alphabet
symbol from a state
2. Deteministic Finite Automata – one transition edge per alphabet symbol from a
state
Transition Graph: Automata

Automata consists of:


1. Finite set of states
2. Input alphabet
3. Transition function
4. Start state
5. Finite state
Transition Table: states in
rows and symbols
(including ^) in columns,
- all strings of a's and b's ending in the particular string abb entry into a cell is the
transition function value for
state-input pair

NFA
Transition Graph: Automata
- The lexical analyzer software are supposed to implement automata in the background.
- Making NFA is more straightforward than DFA on paper.
- While simulating NFA is less straightforward than DFA.
- Hence, all NFA is to be converted into DFA.

You might also like