0% found this document useful (0 votes)
26 views38 pages

Compiler Design CS - 2

The document provides an overview of lexical analyzers in compiler design, detailing their role in reading source text and producing tokens while tracking source coordinates for debugging. It explains key terminologies such as tokens, lexemes, and patterns, and emphasizes the importance of a separate lexical analysis phase for compiler efficiency and portability. Additionally, it discusses regular expressions and their use in defining languages and tokens in programming, along with examples and problems for constructing regular expressions.

Uploaded by

aryanvarshney782
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views38 pages

Compiler Design CS - 2

The document provides an overview of lexical analyzers in compiler design, detailing their role in reading source text and producing tokens while tracking source coordinates for debugging. It explains key terminologies such as tokens, lexemes, and patterns, and emphasizes the importance of a separate lexical analysis phase for compiler efficiency and portability. Additionally, it discusses regular expressions and their use in defining languages and tokens in programming, along with examples and problems for constructing regular expressions.

Uploaded by

aryanvarshney782
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Compiler Design

BITS Pilani
Pilani | Dubai | Goa | Hyderabad
BITS Pilani
Pilani | Dubai | Goa | Hyderabad

Contact Session - 2
Introduction to Lexical Analyzer
Lexical Analyzer (A.k.a. Scanner)
• The only part of a compiler that looks at each character of
the source text and does a linear analysis
• Reads source text and produces TOKENS
• Also keeps track of the source-coordinates of each token -
which file name, line number and position
– (This is useful for debugging & error indication purposes.)
• Advantages of a separate Lexical Analyzer:
– Keeps Compiler design simple
– Improves Efficiency and
– Increases Portability

3
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
The Role of a Lexical Analyzer
Lexical analyzer

next char next token


Syntax
get next analyzer
get next token
char
Source symbol
Program table
(Contains a record
for each identifier)
4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
• Some terminologies:
Token: a group of characters having a collective meaning.
A lexeme is a particular instant of a token.
• E.g. token: identifier, lexeme: pi, etc.
pattern: the rule describing how a token can be formed.
• E.g: identifier: ([a-z]|[A-Z]) ([a-z]|[A-Z]|[0-9])*

• Lexical analyzer does not have to be an individual


phase. But having a separate phase simplifies the
design and improves the efficiency and portability
.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Tokens, Patterns and Lexemes
• What are Tokens ?
– The basic lexical units of the language
– A sequence of Abstract Characters that can be treated
as a unit in the grammar of the language
– A programming language classifies the tokens into a
finite set of token types Some tokens may have attributes
• A note on Terminology integer constant token will have the
actual integer (17, 42) as an
Some texts refer to attribute;
– token types as tokens &Identifiers will have a string with the
– tokens as lexemes actual id

We will stick to the terms Tokens and Token Types


6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Tokens Example
• Let us Consider the program segment:
void main() { printf("Hello World\n"); }
• The tokens of this program segment are:
1. void, 7. (,
2. main, 8. "Hello World\n",
3. (, 9. ),
4. ), 10. ; and
5. { 11. }
6. printf,
7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Specifications of Tokens
String Words and Sentences
1. Prefix of s A string obtained by deleting
trailing symbols
2. suffix of s A string obtained by deleting
leading symbols
3. Substring of s A string obtained by deleting
a prefix & a suffix
4. Proper A prefix, suffix or sub string
that is nonempty s.t s = x
5. Subsequence of s A string obtained by deleting
symbols not necessarily contiguous 8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
The Principle of Longest match
• In most languages, the scanner should pick the
longest possible string to make up the next
token if there is a choice
• Example
return foobar != hohum;
should be recognized as 5 tokens
RETURN ID(foobar)0 NEQ ID(hohum) SCOLON

not more (i.e., not parts of words or identifiers,


or ! and = as separate tokens)
9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Typical Tokens in Programming Languages
• Operators & Punctuation
– + - * / ( ) { } [ ] ; : :: < <= == = != ! …
– Each of these is a distinct lexical class ( or token type )
• Keywords
– if while for goto return switch void …
– Each of these is also a distinct lexical class (not a strin
g)
• Identifiers
– A single ID lexical class, but parameterized by actual id
• Integer constants
– A single INT lexical class, but parameterized by int val
ue
• 10Other constants, etc.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Tokens of a Typical Language
TYPE EXAMPLE

ID foo, n14, a, temp……


NUM 73 , 0 , 00 , 515 , +2 ……..

REAL 66.1 .5 10. 1e67 5.5e-10 ……..

KEYWORDS IF DO WHILE INT ………

SYMBOLS , (Comma) != (Noteq) ( (Lparen) …….


11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Tokens of a Typical Language
TYPE EXAMPLE

ID foo, n14, a, temp……


NUM 73 , 0 , 00 , 515 , +2 ……..

REAL 66.1 .5 10. 1e67 5.5e-10 ……..

KEYWORDS IF DO WHILE INT ………

SYMBOLS , (Comma) != (Noteq) ( (Lparen) …….


12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Formal Definition of Languages
• Alphabet  A finite (non-empty) set of symbols denoted by Σ
• String  A finite sequence of symbols from an alphabet which
includes even the empty sequence (denoted by λ )
• Language  A set ( often infinite) of finite strings
 The set of all possible finite strings of elements of
alphabet Σ ( including λ ) is denoted by Σ*
◼ Finite specifications of (possibly infinite) languages is
possible with
1. Automaton – a recognizer; a machine that accepts all strings
in a language (and rejects all other strings)
2. Grammar – a generator; a system for producing all strings in
the language (and no other strings)
13
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Formal Language Definition ( Contd. )
• As already defined A language L over an alphabet
Σ is a collection of strings of elements of Σ
– The PASCAL Language is the set of all strings that cons
titute legal PASCAL programs (infinite set)
– The Language of primes is a set of all decimal digit stri
ngs that constitute prime numbers (infinite set)
– The language of C reserved words is the set of all alph
abetic strings that can not be used as identifiers in the
C programming language (finite set)
• To specify some of these (possibly infinite) langua
ges with finite description we use the notation of
Regular Expressions
14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regular Expressions
• Is always defined over some alphabet Σ
(For programming languages, it is commonly ASCII
or Unicode)
• If E is a regular expression, L(E ) is the “language”
(set of strings) generated by E
• For Example – For each symbol ‘a’ in the alphabet
of the language the regular expression {a} denotes
the language containing just the string a
( Known as symbol)
• A regular expression generated with empty seque
nce λ is denoted by ε
15
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
• Formal definition of Regular expression:f
• Given an alphabet  ,
• (1)  is a regular expression that denote {  }, the
set that contains the empty string.
• (2) For each a   , a is a regular expression denot
e {a}, the set containing the string a.
• (3) r and s are regular expressions denoting the lang
uage (set) L(r ) and L(s ). Then
– ( r ) | ( s ) is a regular expression denoting L( r ) U L( s )
– ( r ) ( s ) is a regular expression denoting L( r ) L ( s )
– ( r )* is a regular expression denoting (L ( r )) *

• Regular expression is defined together with the


language it denotes.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


• Examples:
– let  = {a, b}
a|b
(a | b) (a | b)
a*
(a | b)*
a | a*b

– We assume that ‘*’ has the highest precedence and is


left associative. Concatenation has second highest
precedence and is left associative and ‘|’ has the
lowest precedence and is left associative
• (a) | ((b)*(c ) ) = a | b*c

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Operations with Regular Expressions
• Given 2 regular expressions M & N
• Alternation ( denoted by | )
makes a new regular expression M | N denoting a
UNION” of languages L(M) and L(N). { L(M) L(N) }
• Concatenation ( denoted by . Or )
makes a new regular expression MN denoting a
language L(M) followed by L(N).
• The Repetiton ( denoted y * )
makes a new expression denoting a language that
has
18
0 or more occurrences (Kleene closure) of L(M)*
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regular Expression Example
Expression Language Example Words
a|b { a, b } a,b
ab * a {a} {b} * {a) aa , aba , abba , abbba
(ab)* { ab} * ε , ab , abab , ababab , …
abba { abba } abba
(0 | 1) * 0 { {0} {1} } * {0} 0 , 00 , 10, 010, 110,
( All binary Even numbers)
b*(abb*)*(a | ε) Strings of a and b with NO consecutive a
Similarly, using symbols, | , . ,* and ε, we can specify the regular
expressions corresponding to the lexical tokens of a programming
19 language using rules ( A.k.a. Productions)
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Table of Operators & Abbreviations
Notation Description
a An ordinary character that stands for itself
ε The empty String
M|N Alternation; Choosing from M OR N
MN Concatenation : An M followed by N
M* Repetition ( Zero or more Times)
M+ Repetition ( one or more times)
M? Optional (Zero or one Occurrence of M)
[a–z A–z ] Character set alteration
[abxyz] One of the given characters (a|b|x|y|z)
. Stands for a single character ( except New line)
20
‘a.+*’ Quotation: A string in quotes stands for itself literally
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Problems
Design Regular expression for the following
Languages.
1) L = {w ϵ {a, b} * : every a in w is immediately
preceded and followed by b}.
2) L = {w ϵ {a, b} * : |w| is even}.
3) L = {w ϵ {0, 1} * : the third character is 0}.
4) L = {w ϵ {a, b} * :w contains an odd number of
a’s}.
21
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
5) L = {w ϵ {0, 1} * atmost one pair consecutive
1’s}.
6) L = {w ϵ {a, b} * : more than two letters with
beginning and ending same letters}.
7) L = {w ϵ {0, 1} * : every pair of adjacent 0’s
appears before any pair of adjacent 1’s}.
8) L = {w ϵ {0, 1} * :length of w is odd }.

22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Solution
1) (b | bab)*
2) ((a | b) (a |b))* or (aa |ab |ba |bb)*
3) (0 |1) (0 |1) 0 (0 |1)*
4)b*(ab*ab*)* a b* or b*a b* (ab*ab*)*
5) (1 | 01)* 00 1*
6) a (a | b)* a | b (a |b)* b
7) (1 | 01 )* ( 0 | ϵ) (0 | 10)* (1 | ϵ)
8) (ab | ba |aa |bb)* (a | b)
23
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regular Expression Construction
• Problem : Specify a set of unsigned numbers as a
regular expression. (Examples: 1997, 19.97)
• Observations on numbers:
1. Could be made up of one or more digits from set
(0 – 9)
2.Optionally Can have a decimal point in the end
followed by 0 or more digits “.”(0 – 9)*
3.A number can also start with a Point followed by one or
more digits
[ (0 – 9)+ [“.”(0 – 9)*] ? ] | [“.”(0 – 9) +]
24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regular Expressions for Some Tokens
of a Programming Language
Regular Expression Token Type

if [ Return IF; ]

[ a – z ] [ a – z 0 – 9 ]* [ return ID ]

[0–9]+ [ return NUM ]

( [0 – 9 ] + ‘ . ’[ 0 – 9 ] * ) | ( [ ‘ . ’[ 0 – 9 ] +) Return REAL

(‘\*’ [ a – z ] * ‘\n’ ) | (‘ ’)| ‘\n’ | ‘*/’)+ return Comment

. return ERROR
25
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
• Regular definition.
– gives names to regular expressions to construct more complicate
regular expressions.
d1 -> r1
d2 -> r2

dn ->rn
– example:
letter -> A | B | C | … | Z | a | b | …. | z
digit -> 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
identifier -> letter (letter | digit) *

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


A regular Expression Recognizer
• Given an input string,
The function of a “regular Expression Analyzer” is to say
– “YES, the input is part of the language
generated from the regular expression”
– “NO, the input isn’t part of the language
generated from the regular expression”
• Using results from Finite Automata theory and theory
of algorithms, we can automate construction of such
recognizers from Regular Expressions
27
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Languages Associated with Regular Expressions

28
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example

29
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example

30
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example

31
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
More than One Regular Expression for a Language

32
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
RegEx - idioms

33
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Analyzing a Simple Regular Expression

34
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Another Simple Regular Expression

35
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Given a Language, Find a Regular Expression

36
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regular Expressions
• https://www.javatpoint.com/examples-of-reg
ular-expression
• https://www.w3schools.com/python/python_
regex.asp

37
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank you

38
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

You might also like