0% found this document useful (0 votes)
6 views37 pages

Lec - 2. Scanning (Lexical Analysis) Part 1

The document discusses the scanning process in compiler construction, focusing on lexical analysis and the formation of tokens from source code. It explains the categories of tokens, their attributes, and the use of regular expressions to define patterns for these tokens. Additionally, it covers practical issues related to scanners, including token records and the handling of comments and ambiguities in token recognition.

Uploaded by

Hesham MosaAd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views37 pages

Lec - 2. Scanning (Lexical Analysis) Part 1

The document discusses the scanning process in compiler construction, focusing on lexical analysis and the formation of tokens from source code. It explains the categories of tokens, their attributes, and the use of regular expressions to define patterns for these tokens. Additionally, it covers practical issues related to scanners, including token records and the handling of comments and ambiguities in token recognition.

Uploaded by

Hesham MosaAd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Lecture 02 Scanning 1

COMPILER CONSTRUCTION

Principles and Practice

Kenneth C. Louden

Downloaded by Mamdouh Farghaly ([email protected])


2. Scanning (Lexical Analysis)

PART ONE

Downloaded by Mamdouh Farghaly ([email protected])


Contents
PART ONE
2.1 The Scanning Process
2.2 Regular Expression

Downloaded by Mamdouh Farghaly ([email protected])


2.1 The Scanning Process

Downloaded by Mamdouh Farghaly ([email protected])


The Function of a Scanner
• Reading characters from the source code
and form them into logical units called
tokens
• Tokens are logical entities defined as an
enumerated type
– Typedef enum
{IF, THEN, ELSE, PLUS, MINUS, NUM, ID,…}
TokenType;

Downloaded by Mamdouh Farghaly ([email protected])


The Categories of Tokens
• RESERVED WORDS
– Such as IF and THEN, which represent the strings of
characters “if” and “then”
• SPECIAL SYMBOLS
– Such as PLUS and MINUS, which represent the
characters “+” and “-“
• OTHER TOKENS
– Such as NUM and ID, which represent numbers and
identifiers

Downloaded by Mamdouh Farghaly ([email protected])


Relationship between Tokens and its
String
• The string is called STRING VALUE or
LEXEME of token
• Some tokens have only one lexeme, such as
reserved words
• A token may have infinitely many lexemes,
such as the token ID

Downloaded by Mamdouh Farghaly ([email protected])


Relationship between Tokens and its
String
• Any value associated to a token is called an attributes of a
token
– String value is an example of an attribute.
– A NUM token may have a string value such as “32767” and actual
value 32767
– A PLUS token has the string value “+” as well as arithmetic
operation +
• The token can be viewed as the collection of all of its
attributes
– Only need to compute as many attributes as necessary to allow
further processing
– The numeric value of a NUM token need not compute immediately

Downloaded by Mamdouh Farghaly ([email protected])


Some Practical Issues of the Scanner

• One structured data type to collect all the


attributes of a token, called a token record
– Typedef struct
{TokenType tokenval;
char *stringval;
int numval;
} TokenRecord

Downloaded by Mamdouh Farghaly ([email protected])


Some Practical Issues of the Scanner
• The scanner returns the token value only and
places the other attributes in variables
TokeType getToken(void)
• As an example of operation of getToken,
consider the following line of C code.
A[index] = 4+2

a [ i n d e x ] = 4 + 2

a [ i n d e x ] = 4 + 2 RET
Downloaded by Mamdouh Farghaly ([email protected])
2.2 Regular Expression

Downloaded by Mamdouh Farghaly ([email protected])


Some Relative Basic Concepts
• Regular expressions
– represent patterns of strings of characters.
• A regular expression r
– completely defined by the set of strings it matches.
– The set is called the language of r written as L(r)
• The set elements
– referred to as symbols
• This set of legal symbols
– called the alphabet and written as the Greek symbol ∑

Downloaded by Mamdouh Farghaly ([email protected])


Some Relative Basic Concepts
• A regular expression r
– contains characters from the alphabet, indicating
patterns, such a is the character a used as a pattern
• A regular expression r
– may contain special characters called meta-characters
or meta-symbols
• An escape character can be used to turn off the
special meaning of a meta-character.
– Such as backslash and quotes

Downloaded by Mamdouh Farghaly ([email protected])


More About Regular Expression

2.2.1 Definition of Regular Expression [Open]


2.2.2 Extension to Regular Expression [Open]
2.2.3 Regular Expressions for Programming
Language Tokens [Open]

Downloaded by Mamdouh Farghaly ([email protected])


2.2.1 Definition of Regular
Expressions

Downloaded by Mamdouh Farghaly ([email protected])


Basic Regular Expressions
• The single characters from alphabet
matching themselves
– a matches the character by writing L(a)={ a }
– ε denotes the empty string, by L(ε)={ε}

Downloaded by Mamdouh Farghaly ([email protected])


Regular Expression Operations
• Choice among alternatives, indicated by
the meta-character |
• Concatenation, indicated by juxtaposition
• Repetition or “closure”, indicated by the
meta-character *

Downloaded by Mamdouh Farghaly ([email protected])


Choice Among Alternatives
• If r and s are regular expressions, then r|s is a
regular expression which matches any string that
is matched either by r or by s.
• In terms of languages, the language r|s is the union
of language r and s, or L(r|s) = L(r) U L(s)
• A simple example, L(a|b) = L(a) U (b) = {a, b}
• Choice can be extended to more than one
alternative.

Downloaded by Mamdouh Farghaly ([email protected])


Concatenation
• If r and s are regular expression, the rs is their
concatenation which matches any string that is the
concatenation of two strings, the first of which
matches r and the second of which matches s.
• In term of generated languages, the concatenation
set of strings S1S2 is the set of strings of S1
appended by all the strings of S2.
• A simple example, (a|b)c matches ac and bc
• Concatenation can also be extended to more than
two regular expressions.

Downloaded by Mamdouh Farghaly ([email protected])


Repetition
• The repetition operation of a regular expression,
called (Kleene) closure, is written r*, where r is a
regular expression. The regular expression r*
matches any finite concatenation of strings, each
of which matches r.
• A simple example, a* matches the strings epsilon,
a, aa, aaa,…
• In term of generated language, given a set of S of
string, S* is a infinite set union, but each element
in it is a finite concatenation of string from S

Downloaded by Mamdouh Farghaly ([email protected])


Precedence of Operation and Use of
Parentheses
• The standard convention
Repetition * has highest precedence
Concatenation is given the next highest
| is given the lowest
A simple example
a|bc* is interpreted as a|(b(c*))
Parentheses is used to indicate a different
precedence

Downloaded by Mamdouh Farghaly ([email protected])


Name for regular expression
• Give a name to a long regular expression
– digit = 0|1|2|3|4……|9
– (0|1|2|3……|9)(0|1|2|3……|9)* digit
digit*

Downloaded by Mamdouh Farghaly ([email protected])


Definition of Regular Expression
• A regular expression is one of the following:
(1) A basic regular expression, a single legal character a from
alphabet ∑ or meta-character ε.
(2) The form r|s, where r and s are regular expressions
(3) The form rs, where r and s are regular expressions
(4) The form r*, where r is a regular expression
(5) The form (r), where r is a regular expression

• Parentheses do not change the language.

Downloaded by Mamdouh Farghaly ([email protected])


Examples of Regular Expressions
Example 1:
– ∑={ a,b,c} Set of all strings that can be used
– the set of all strings over this alphabet that contain exactly one b.
– (a|c)*b(a|c)*

Example 2:
– ∑={ a,b,c}
– the set of all strings that contain at most one b.
– (a|c)*|(a|c)*b(a|c)* (a|c)*(b|ε)(a|c)*
– the same language may be generated by many different regular
expressions.

Downloaded by Mamdouh Farghaly ([email protected])


Examples of Regular Expressions
Example 3:
– ∑={ a,b}
– the set of strings consists of a single b surrounded by the same
number of a’s.
– S = {b, aba, aabaa,aaabaaa,……} = { anban | n≠0}
– This set can not be described by a regular expression.
• “regular expression can’t count ”

– not all sets of strings can be generated by regular expressions.


– a regular set : a set of strings that is the language for a regular
expression is distinguished from other sets.

Downloaded by Mamdouh Farghaly ([email protected])


Examples of Regular Expressions
Example 4:
– ∑={ a,b,c}
– The strings contain no two consecutive b’s
– ( (a|c)* | (b(a|c))* )*
– ( (a | c ) | (b( a | c )) )* or (a | c | ba | bc)*
• Not yet the correct answer
The correct regular expression
– (a | c | ba | bc)* (b |ε)
– ((b |ε) (a | c | ab| cb )*
– (not b |b not b)*(b|ε) not b = a|c

Downloaded by Mamdouh Farghaly ([email protected])


Examples of Regular Expressions
Example 5:
– ∑={ a,b,c}
– ((b|c)* a(b|c)*a)* (b|c)*
– Determine a concise English description of
the language
– the strings contain an even number of a’s
(nota* a nota* a)* nota*

BACK
Downloaded by Mamdouh Farghaly ([email protected])
2.2.2 Extensions to Regular
Expression

Downloaded by Mamdouh Farghaly ([email protected])


List of New Operations
1) one or more repetitions
r+
2) any character
period “.”
3) a range of characters
[0-9], [a-zA-Z]
List of New Operations
4) any character not in a given set
(a|b|c) a character not either a or b or c
[^abc] in Lex
5) optional sub-expressions
– r? the strings matched by r are optional
2.2.3 Regular Expressions for
Programming Language Tokens

Downloaded by Mamdouh Farghaly ([email protected])


Number, Reserved word and
Identifiers
Numbers
– nat = [0-9]+
– signedNat = (+|-)?nat
– number = signedNat(“.”nat)? (E signedNat)?
Reserved Words and Identifiers
– reserved = if | while | do |………
– letter = [a-z A-Z]
– digit = [0-9]
– identifier = letter(letter|digit)*
Comments
Several forms:
{ this is a pascal comment } {(  })*}

; this is a schema comment


-- this is an Ada comment --(newline)* newline

/* this is a C comment */
can not written as ba(~(ab))*ab, ~ restricted to single character
one solution for ~(ab) : b*(a*(a|b)b*)*a*

Because of the complexity of regular expression, the comments will be


handled by ad hoc methods in actual scanners.
Ambiguity
Ambiguity: some strings can be matched
by several different regular expressions.
– either an identifier or a keyword, keyword
interpretation preferred.
– a single token or a sequence of several tokens,
the single-token preferred.( the principle of
longest sub-string.)
White Space and Lookahead
White space:
– Delimiters: characters that are unambiguously part of
other tokens are delimiters.
– whitespace = ( newline | blank | tab | commen )+
– free format or fixed format
Lookahead:
– buffering of input characters , marking places for
backtracking
DO99I=1,10
DO99I=1.10

Downloaded by Mamdouh Farghaly ([email protected])


Downloaded by Mamdouh Farghaly ([email protected])

You might also like