100% found this document useful (1 vote)
85 views37 pages

Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq

The document discusses the process of lexical analysis, which is the first phase of compilation where the source code is broken down into tokens. It describes how a lexical analyzer identifies tokens by removing whitespace and comments, and groups characters into lexemes that are then classified and returned as tokens along with possible attributes. The document also covers the different types of tokens like keywords, identifiers, literals, and operators as well as how lexical errors are handled.

Uploaded by

Faisal Shehzad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
85 views37 pages

Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq

The document discusses the process of lexical analysis, which is the first phase of compilation where the source code is broken down into tokens. It describes how a lexical analyzer identifies tokens by removing whitespace and comments, and groups characters into lexemes that are then classified and returned as tokens along with possible attributes. The document also covers the different types of tokens like keywords, identifiers, literals, and operators as well as how lexical errors are handled.

Uploaded by

Faisal Shehzad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Compiler Construction

CS-4207

Instructor Name: Atif Ishaq


Lecture 4-5
Today’s Lecture

 Lexical Analysis

 Attributes of Token

 Specification of Token

 Recognition of Token

2
Lexical Analysis

 Lexical Analysis is the first phase of a compiler. It takes the modified source

code from language preprocessor that are written in the form of sentences.

 Lexical Analysis phase is also referred as scanning phase

 The lexical analyzer breaks these syntaxes into a series of tokens, by removing

any whitespaces or comments in the source code

Source : https://www.tutorialspoint.com/compiler_design/compiler_design_lexical_analysis.htm 3
https://www.guru99.com/compiler-design-lexical-analysis.html
Role of Lexical Analyzer

 Lexical Analyzer read the input characters of source program, group them into

lexemes, and produce as sequence the series of tokens for each lexeme in the

source program

 Lexical Analyzer interacts with the symbol table also during this process

 When Lexical Analyzer discovers a lexeme constituting an identifier, it needs to

enter that lexeme into the symbol table

 Lexical Analyzer strips out comments and whitespaces (blank, newline, tabs,

and other characters that are used to separate tokens in the input)

 Lexical Analyzer correlate error message generated by the compiler with source

program – associates a line number with each error message

4
Role of Lexical Analyzer

 Lexical Analyzer are divided into a cascade of two processes scanning and

Lexical Analysis

 Scanning consists of simple processes that does not require tokenization of the

input, such as deletion of comments and compaction of consecutive white space

character into one

 Lexical analysis proper is the more complex portion, where the scanner

produces the sequence of tokens as output

5
Tokens, Patterns, and Lexemes

A Token is a pair consisting of token name and optional attribute value.

A token name is an abstract symbol representing a kind of lexical unit, e.g., a

particular keyword, or a sequence of input characters donating identifier.

6
Tokens, Patterns, and Lexemes

A Pattern is a description of the form that the lexeme of token may take.

A Pattern explains what can be a token, and these pattern are defined by means of

regular expressions

In case of a keyword as token, the pattern is just the sequence of characters that

form the keyword

A lexeme is a sequence of characters in the source program that matches the

pattern for a token and is identified by the lexical analyzer as an instance of that

token.

7
Tokens, Patterns, and Lexemes

TOKEN INFORMAL DESCRIPTION SAMPLE LEXEME


if characters i , f if
else characters e , l, s , e else
comparison < or > , or <= or >= or == or != <= , !=
id letter followed by letters and digits pi , score , D2
number any numeric constant 3.14 , 0 , 6.02e23
literal Anything but “ , surrounded by “ “some thing”

Example : printf(“Total = %d\n”,score);

printf and score are lexemes matching the pattern for token id

“Total = %d\n” is lexeme matching literal


8
Tokens – Punctuation : Separators

Tokens for each punctuation symbol, such as left and right parentheses, comma,

and semicolon

 Typically individual special characters such as ( { } : .. (two dots)


 Sometimes double characters : lexical scanner looks for longest token

( * , /* -- comment openers in various language


 Returned just as identity (kind) of token

 And perhaps location for error message and debugging purpose

9
Tokens – Operators

Tokens for operators, either individually or in classes such the token comparison

 They are like punctuations

 No real difference for lexical analyzer

 Typically single or double special chars

 Operator : + - == <=

 Operations : := =>

 Returned as kind of token

 And perhaps location

10
Tokens – Keywords

One token for each keyword. The pattern for a keyword is same as the keyword

itself

 They are reserved identifiers

 Examples are if , do , while , try , catch

 May be distinguished from identifiers

 Returned as kind of token

 With possible location information

11
Tokens – Identifiers

One token representing all identifiers

 Rules differ

 Length , allowed characters , separators

 Needs to build a names table (identifier table)

 Single entry for all occurrences of Var1

 Language may be case insensitive : same entry for VAR1, vAR1, Var1

 Typical structure Hash Table

 Lexical Analyzer returns token kind

 And key (index) to table entry

 Table entry includes location information


12
Organization of Names Table

Most common structure of names table is hash Table

 With fixed number of headers

 Chain according to hash code

 Serial search on one chain

 Hash code computed from characters (e.g. sum mod table size)

 No hash code is perfect : expect collision

 Avoid any arbitrary limits on table or chain size

13
Tokens – String Literals

One or more tokens representing constants, such as numbers, literals and strings

 Text must be stored

 Actual characters are important

 Not like identifiers must preserve casing

 Character set issues : uniform internal representation

 Table needed

 Lexical analyzer returns key into table (in addition to kind)

 May or may not be worth hashing to avoid duplicates

14
Tokens – Character Literals

One or more tokens representing constants, such as numbers, literals and strings

 Similar issues to string literals

 Lexical analyzer returns

 Token kind

 Identity of characters

 Cannot assume character set of host machine, may be different

15
Tokens – Numeric Literals

One or more tokens representing constants, such as numbers, literals and strings

 Use a table to store numeric literals

 For example 123=0123=01_23 (Ada)

 But can not use predefined types for values

o Because may have different bounds

 Floating point represents much more complex

 Denormals, correct rounding

 Very delicate to computer precise values

 Host/target issues

16
Handling Comments

 Comments have no effect on program

 Can be eliminated by scanner

 But may need to be retrieved by tools

 Error detection issues

 Unclosed comments

 Scanner skips over comments and returns next meaningful token

17
Attributes of Tokens

 When more then one lexeme match a pattern

 Additional information provided to next phase about particular lexeme

 For example the pattern for token number matches both 0 and 1

 important for the code generator to know which lexeme was found in

source code

 Information about an identifier – for example its lexeme, its type and the

location it is first found is kept in symbol table

 Lexical analyzer return parser a token name and an attribute value

 `\token name influences parsing decision

 attribute value influences translation after parsing - also a pointer to the


18
symbol table entry for that identifier
Lexical Errors

 A lexical error is one when a character sequence is not possible to scan into any

valid token

 Facts about Lexical Error

 Lexical errors are not very common, but it should be managed by a scanner

 Misspelling of identifier, operators, keyword are considered as lexical error

 Lexical error in general are caused by the appearance of some illegal

character, mostly at the beginning of token

 A lexical analyzer without the help of other components can not determine

that an error exists in the source code

19
Error Recovery in Lexical Analyzer

 If the string fi is encountered for the first time in a C program in the context:

fi ( a == fx()) ….

 Difficult for lexical analyzer to determine whether fi is misspelling of the

keyword if or an undeclared function identifier.

 fi is a valid lexeme for the token id – this id return to parser – handle an error

due to disposition of letters

 In situation when a lexical analyzer is unable to proceed because none of the

pattern for token matches any prefix of the remaining input – the simplest

recovery is “panic mode” recovery

 Start deleting successive characters from the remaining input – until a token at
20
the beginning of what input is left – may confuse parser
Error Recovery in Lexical Analyzer

 Other possible recovery actions are

1. Remove/delete one character from the remaining input

2. In the panic mode, the successive character are always ignored until we

reach a well-formed token

3. Insert a missing character into the remaining input

4. Replace a character with another character

5. Transpose two serial characters

21
Input Buffering

 Compilation of large program requires time to process characters and large


number of characters

 Specialized buffering techniques developed to reduce the amount of


overhead required to process a single input character

 Two buffers are used – reload alternatively

 Each buffer is of size N and N is size of disk block (4096 bytes)

 One system read command can read N characters into buffer rather than one
character

 eof is used to mark end of source file

22
Input Buffering

 String of characters between two pointers is current lexeme

 Two pointers to the input are maintained

1. Pointer lexemeBegin, marks the beginning of the current lexeme, whose

extent we are attempting to determine.

2. Pointer forward scans ahead until a pattern match is found.

23
Specification of Tokens

 Regular languages are the most popular for specifying tokens because

 These are based on simple and useful theory

 These are easy to understand

 Efficient implementation exists for generating lexical analyzer based on

such language

24
Specification of Tokens : Strings and Languages

 An alphabet is any finite set of symbols

 Letters, digits , punctuations

 The set { 0 , 1 } is binary alphabet

 An string over an alphabet is a finite sequence of symbols drawn from that

alphabet.

 Length of string s is defined as |s|

 Empty string denoted by e

 A language is any countable set of strings over some fixed alphabets

 Abstract language like 0 , the empty set, or {e}, the set containing only the

empty string, are language under this definition


25
Specification of Tokens : Strings and Languages

The following string related terms are commonly used

1. A prefix of string s is any string obtained by removing zero or more symbols


from the end of s. For example, ban, banana, and ϵ are prefixes of banana.

2. A suffix of string s is any string obtained by removing zero or more symbols


from the beginning of s. For example, nana, banana, and ϵ are suffixes of
banana.

3. A substring of s is obtained by deleting any prefix and any suffix from s. For
instance, banana, nan, and ϵ are substrings of banana.

4. The proper prefixes, suffixes, and substrings of a string s are those, prefixes,
suffixes, and substrings, respectively, of s that are not ϵ or not equal to s itself.

5. A subsequence of s is any string formed by deleting zero or more not


necessarily consecutive positions of s. For example, baan is a subsequence of
26
banana.
Specification of Tokens : Operations on
Languages
In lexical analysis the most important operations on language are union,
concatenation, and closure

 Union
L  M = {s  s  L or s  M}

 Concatenation
LM = {xy  x  L and y  M}

 Exponentiation
L0 = {}; Li = Li-1L

 Kleene closure
L* = i=0,…, Li

 Positive closure
L+ = i=1,…, Li 27
Specification of Tokens : Regular Expression

 The regular expressions are built recursively out of smaller regular


expressions using some rules.

 Each regular expression r denotes a language L(r) which is also defined


recursively from the language denoted by r’s subexpressions

  is a regular expression denoting language {}

 a   is a regular expression denoting {a}

 If r and s are regular expressions denoting languages L(r) and M(s)


respectively, then

 rs is a regular expression denoting L(r)  M(s)

 rs is a regular expression denoting L(r)M(s)

 r* is a regular expression denoting L(r)*

 (r) is a regular expression denoting L(r) 28


Specification of Tokens : Regular Expression

 A language defined by regular expression is called a regular set

 The parenthesis may drop if we follow the conventions that

1. The unary operator * has highest precedence and is left associative

2. Concatenation has second highest precedence and is left associative

3. | has lowest precedence and is left associative

 Under these conventions the regular expression

(a) | ( (b) * (c) )

by

a|b*c

29
Specification of Tokens : Regular Expression

 A language defined by regular expression is called a regular set

 If two regular expression r and s denote the same regular set then we say they
are equivalent and we write r = s –- example (a | b ) = ) b | a )

30
Specification of Tokens : Regular Definition

 If Σ is an alphabet of basic symbols, then a regular definition is a sequence of


definition of the form

Where :

 Each di is a new symbol, not in Σ and not the same as any other of the d's, and

 Each ri is a regular expression over the alphabet Σ U {dl, d2,. . . , di-l).

31
Specification of Tokens : Regular Definition

 Example

letter  AB…Zab…z
digit  01…9
id  letter ( letterdigit )*
A regular definition can not be recursive
digit  digit digitdigit (wrong)
 The following shorthand are often used:
r+ = rr*
r? = r
[a-z] = abc…z
[abc] = abc

 Examples:
digit  [0-9]
num  digit+ (. digit+)? ( E (+-)? digit+ )? 32
Recognition of Tokens

 Grammar that describes the simple form of branching statement and


conditional expression

stmt  if expr then stmt

|if expr then stmt else stmt

|
expr  term relop term
 term
term  id
 num The terminals of the grammar are
if, then, else, relop, id and num

For relop we use comparison operators of language, where = is “equals” and <>
is “not equals”
33
Recognition of Tokens

 Patterns of tokens of the grammar described using regular definition


digit  [0-9]
digits  digit+
number  digits (. digits)?(E(+|-)?digits)?
letter  [A-Z][a-z]
id  letter (letter | digit )*
if  if
then  then
else  else
relop  < | > | <= | >= | <> | =

The token ws for recognizing the white space is defined by

ws  (blank | tab | new line) this token is not return to the parser
34
Recognition of Tokens

Token ws is different from the other tokens in that, when we recognize it, we do
not return it to the parser, but rather restart the lexical analysis from the character
that follows the whitespace. It is the following token that gets returned to the
parser.

35
Lecture Outcome

 Lexical Analyzer and Its Role

 Comprehensive discussion on Token

 Relationship between specification of token and Regular


expressions

36
37

You might also like