Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
Compiler Construction CS-4207: Lecture 4-5 Instructor Name: Atif Ishaq
CS-4207
Lexical Analysis
Attributes of Token
Specification of Token
Recognition of Token
2
Lexical Analysis
Lexical Analysis is the first phase of a compiler. It takes the modified source
code from language preprocessor that are written in the form of sentences.
The lexical analyzer breaks these syntaxes into a series of tokens, by removing
Source : https://www.tutorialspoint.com/compiler_design/compiler_design_lexical_analysis.htm 3
https://www.guru99.com/compiler-design-lexical-analysis.html
Role of Lexical Analyzer
Lexical Analyzer read the input characters of source program, group them into
lexemes, and produce as sequence the series of tokens for each lexeme in the
source program
Lexical Analyzer interacts with the symbol table also during this process
Lexical Analyzer strips out comments and whitespaces (blank, newline, tabs,
and other characters that are used to separate tokens in the input)
Lexical Analyzer correlate error message generated by the compiler with source
4
Role of Lexical Analyzer
Lexical Analyzer are divided into a cascade of two processes scanning and
Lexical Analysis
Scanning consists of simple processes that does not require tokenization of the
Lexical analysis proper is the more complex portion, where the scanner
5
Tokens, Patterns, and Lexemes
6
Tokens, Patterns, and Lexemes
A Pattern is a description of the form that the lexeme of token may take.
A Pattern explains what can be a token, and these pattern are defined by means of
regular expressions
In case of a keyword as token, the pattern is just the sequence of characters that
pattern for a token and is identified by the lexical analyzer as an instance of that
token.
7
Tokens, Patterns, and Lexemes
printf and score are lexemes matching the pattern for token id
Tokens for each punctuation symbol, such as left and right parentheses, comma,
and semicolon
9
Tokens – Operators
Tokens for operators, either individually or in classes such the token comparison
Operator : + - == <=
Operations : := =>
10
Tokens – Keywords
One token for each keyword. The pattern for a keyword is same as the keyword
itself
11
Tokens – Identifiers
Rules differ
Language may be case insensitive : same entry for VAR1, vAR1, Var1
Hash code computed from characters (e.g. sum mod table size)
13
Tokens – String Literals
One or more tokens representing constants, such as numbers, literals and strings
Table needed
14
Tokens – Character Literals
One or more tokens representing constants, such as numbers, literals and strings
Token kind
Identity of characters
15
Tokens – Numeric Literals
One or more tokens representing constants, such as numbers, literals and strings
Host/target issues
16
Handling Comments
Unclosed comments
17
Attributes of Tokens
For example the pattern for token number matches both 0 and 1
important for the code generator to know which lexeme was found in
source code
Information about an identifier – for example its lexeme, its type and the
A lexical error is one when a character sequence is not possible to scan into any
valid token
Lexical errors are not very common, but it should be managed by a scanner
A lexical analyzer without the help of other components can not determine
19
Error Recovery in Lexical Analyzer
If the string fi is encountered for the first time in a C program in the context:
fi ( a == fx()) ….
fi is a valid lexeme for the token id – this id return to parser – handle an error
pattern for token matches any prefix of the remaining input – the simplest
Start deleting successive characters from the remaining input – until a token at
20
the beginning of what input is left – may confuse parser
Error Recovery in Lexical Analyzer
2. In the panic mode, the successive character are always ignored until we
21
Input Buffering
One system read command can read N characters into buffer rather than one
character
22
Input Buffering
23
Specification of Tokens
Regular languages are the most popular for specifying tokens because
such language
24
Specification of Tokens : Strings and Languages
alphabet.
Abstract language like 0 , the empty set, or {e}, the set containing only the
3. A substring of s is obtained by deleting any prefix and any suffix from s. For
instance, banana, nan, and ϵ are substrings of banana.
4. The proper prefixes, suffixes, and substrings of a string s are those, prefixes,
suffixes, and substrings, respectively, of s that are not ϵ or not equal to s itself.
Union
L M = {s s L or s M}
Concatenation
LM = {xy x L and y M}
Exponentiation
L0 = {}; Li = Li-1L
Kleene closure
L* = i=0,…, Li
Positive closure
L+ = i=1,…, Li 27
Specification of Tokens : Regular Expression
by
a|b*c
29
Specification of Tokens : Regular Expression
If two regular expression r and s denote the same regular set then we say they
are equivalent and we write r = s –- example (a | b ) = ) b | a )
30
Specification of Tokens : Regular Definition
Where :
Each di is a new symbol, not in Σ and not the same as any other of the d's, and
31
Specification of Tokens : Regular Definition
Example
letter AB…Zab…z
digit 01…9
id letter ( letterdigit )*
A regular definition can not be recursive
digit digit digitdigit (wrong)
The following shorthand are often used:
r+ = rr*
r? = r
[a-z] = abc…z
[abc] = abc
Examples:
digit [0-9]
num digit+ (. digit+)? ( E (+-)? digit+ )? 32
Recognition of Tokens
|
expr term relop term
term
term id
num The terminals of the grammar are
if, then, else, relop, id and num
For relop we use comparison operators of language, where = is “equals” and <>
is “not equals”
33
Recognition of Tokens
ws (blank | tab | new line) this token is not return to the parser
34
Recognition of Tokens
Token ws is different from the other tokens in that, when we recognize it, we do
not return it to the parser, but rather restart the lexical analysis from the character
that follows the whitespace. It is the following token that gets returned to the
parser.
35
Lecture Outcome
36
37