0% found this document useful (0 votes)
11 views

Week 5-6

Slides of Compiler Construction chapter 5-6

Uploaded by

Malik Zohaib
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Week 5-6

Slides of Compiler Construction chapter 5-6

Uploaded by

Malik Zohaib
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 33

MODULE # 3: COMPILER CONSTRUCTION

INSTRUCTOR: DR. SAKEENA JAVAID

1
OUTLINE OF THE TOPICS TO BE COVERED TODAY

 Chomsky's classification of grammars


 Lexical analysis
 Tokens and types of tokens
 Regular Expressions and DFA’s

2
CHOMSKY'S CLASSIFICATION OF GRAMMARS

 Type-0 grammars include all formal grammar. These


languages are also known as the Recursively Enumerable
languages.
 Type-1 grammars generate context-sensitive languages. It
should be Type 0.
 Type-2 grammars generate context-free languages. In Type
2: First of all, it should be Type 1.
 Type-3 grammars generate regular languages. Type 3 is
the most restricted form of grammar.

3
LEXICAL ANALYSIS

 First phase of a compiler is called lexical analysis or scanning


 Lexical analyzer reads the stream of characters making up the source program
 Groups the characters into meaningful sequences called lexemes
 For each lexeme, the lexical analyzer produces as output a token of the form

<token-name; attribute-value> In the token, the first component token-name is an abstract


 It passes on to the subsequent phase syntax analysis symbol that is used during syntax analysis, and the second
component attribute-value points to an entry in the symbol table
for this token

Information from the symbol-table entry is needed for


semantic analysis and code generation.
4
LEXICAL ANALYSIS

 For example:
 Suppose a source program contains the assignment statement
position = initial + rate * 60 … (1)
 Equation 2 shows the representation of the assignment equation (1) after lexical
analysis as the sequence of tokens
<id; 1> <=> <id; 2> <+> <id; 3> <*> <60> … (2)
 Visual representation of the afore-mentioned equation will be seen next

5
LEXICAL ANALYSIS

 Visual representation
 All stages during compilation of the code

6
LEXICAL ANALYSIS

 Interactions between the Lexical Analyzer and


the Parser
 Lexical analysers are divided into a cascade of two
processes:
a) Scanning consists of the simple processes that
do not require tokenization of the input, such
as deletion of comments and compaction of
consecutive whitespace characters into one.
b) Lexical analysis is the more complex portion,
which produces tokens from the output of the
scanner.

7
LEXICAL ANALYSIS

 Lexical Analysis Versus Parsing


 A number of reasons why the analysis portion of a compiler is normally separated into lexical
analysis and parsing (syntax analysis) phases
 Simplicity of design is the most important consideration.
 While designing a new language, separating lexical and syntactic concerns can lead to a
cleaner overall language design
 Compiler efficiency is improved (using lexical analysis and buffering techniques)
 Compiler portability is enhanced (Input-device-specific peculiarities can be restricted to
the lexical analyser)

8
LEXICAL ANALYSIS

 Tokens, Patterns, and Lexemes


 When discussing lexical analysis, we use three related but distinct terms
 Token is a pair consisting of a token name and an optional attribute value
 e.g., a particular keyword, or a sequence of input characters denoting an identifier. The token names are the input
symbols that the parser processes. In what follows, we shall generally write the name of a token in boldface. We will
often refer to a token by its token name.
 A pattern is a description of the form that the lexemes of a token may take.
 The pattern is just the sequence of characters that form the keyword. For identifiers and some other tokens, the pattern
is a more complex structure that is matched by many strings
 A lexeme is a sequence of characters in the source program that matches the pattern for a token and is identified by the
lexical analyzer as an instance of that token

9
LEXICAL ANALYSIS

 Example: Find some typical tokens, their informally described patterns,


and some sample lexemes. To see how these concepts are used in
practice, in the C statement:
printf ("Total = %d\n",
score);
 both printf and score are lexemes matching the pattern for token id, and

 "Total = %d\n" is a lexeme matching literal

10
LEXICAL ANALYSIS
Examples of Tokens and Lexemes

11
LEXICAL ANALYSIS

 Categorization of Tokens:
 In many programming languages, the following classes cover most or all of the
tokens:
 One token for each keyword. The pattern for a keyword is the same as the keyword itself
 Tokens for the operators, either individually or in classes such as the token comparison
 One token representing all identifiers
 One or more tokens representing constants, such as numbers and literal strings
 Tokens for each punctuation symbol, i.e., left and right parentheses, comma, and semicolon

12
LEXICAL ANALYSIS

 Attributes for Tokens:


 When more than one lexeme can match a token,
 The lexical analyser must provide the subsequent compiler phases additional information about the
particular lexeme that matched. For example, the pattern for token number matches both 0 and 1 .
 in many cases the lexical analyser returns to the parser not only a token name, but an attribute value
that describes the lexeme represented by the token;
 information about an identifier, e.g., its lexeme, its type, and the location at which it is first found

13
LEXICAL ANALYSIS

 Example 2 :
 The token names and associated attribute values for the Fortran statement
E = M * C *2
 Please solve it with respect to tokens…

14
LEXICAL ANALYSIS

15
LEXICAL ANALYSIS

 Lexical Errors:
 It is hard for a lexical analyser to tell without the aid of other components, that there is a source-code error.
 For instance, if the string fi is encountered for the first time in a C program in the context:
 fi ( a == f(x)) ...
 a lexical analyser cannot tell whether fi is a misspelling of the keyword if
 or an undeclared function identifier
 Since fi is a valid lexeme for the token id, the lexical analyser must return the token id to the parser
and let some other phase of the compiler
 probably the parser in this case (handle an error due to transposition of the letters)

16
LEXICAL ANALYSIS

 A situation arises in which the lexical analyser is unable to proceed


 because none of the patterns for tokens matches any prefix of the remaining input
 The simplest recovery strategy is “panic mode" recovery.
 We delete successive characters from the remaining input, until the lexical analyser can find a
well-formed token at the beginning of what input is left.
 This recovery technique may confuse the parser, but in an interactive computing environment it
may be quite adequate.

17
LEXICAL ANALYSIS

 Other possible error-recovery actions are:


 Delete one character from the remaining input.
 Insert a missing character into the remaining input.
 Replace a character by another character.
 Transpose two adjacent characters.

18
REGULAR EXPRESSIONS AND DFA

 Specifications of Tokens:
 Regular expressions are an important notation for specifying lexeme patterns.
 They cannot express all possible patterns, however, they are very effective in specifying those types of
patterns that we actually need for tokens.
 We shall study the formal notation for regular expressions
 Strings and languages
 An alphabet is any finite set of symbols. Typical examples of symbols are letters, digits, and
punctuation.
 The set {0, 1} is the binary alphabet. ASCII is an important example of an alphabet; it is used in many
software systems. Unicode is another example.

19
REGULAR EXPRESSIONS AND DFA’S

 A string over an alphabet is a finite sequence of symbols drawn from that alphabet .
 In language theory, the terms “sentence" and “word" are often used as synonyms for
“string."
 Length of a string s, or |s|, is the number of occurrences of symbols in s.
 For example, banana is a string of length six. The empty string, denoted , is the string of
length zero

20
REGULAR EXPRESSIONS AND DFA’S

 A language is any countable set of strings over some fixed alphabet. This definition is
very broad.
 Abstract languages like , the empty set {}, the set containing only the empty string,
are languages under this definition.
 Note that the definition of “language" does not require that any meaning be ascribed
to the strings in the language

21
REGULAR EXPRESSIONS AND DFA’S

 Operations on languages
 In lexical analysis, the most important operations on languages are union, concatenation, and closure
 Union is the familiar operation on sets. The concatenation of languages is all strings formed by taking a string
from the first language and a string from the second language, in all possible ways, and concatenating them.
 The (Kleene) closure of a language L, denoted as , is the set of strings you get by concatenating L zero or more
times.
 Note that , the concatenation of L zero times," is defined to be {}
 Finally, the positive closure, denoted , is the same as the Kleene closure, but without the term . That is {}, will
not be in L+ unless it is in L itself.

22
REGULAR EXPRESSIONS AND DFA’S

23
REGULAR EXPRESSIONS AND DFA’S

 Regular Expressions
 Here are the rules that define the regular expressions over some alphabet and the languages that those expressions
denote.
 BASIS: There are two rules that form the basis:

 1. is a regular expression, and L() is {}, that is, the language whose sole member is the empty string.

 2. If a is a symbol in , then a is a regular expression, that is, the language with one string, of length one, with a in
its one position.
 Note that by convention, we use italics for symbols, and boldface for their corresponding regular expression

24
REGULAR EXPRESSIONS AND DFA’S

 INDUCTION:

 There are four parts to the induction whereby larger regular expressions are built from smaller ones.

 Suppose r and s are regular expressions denoting languages L(r) and L(s), respectively.

 1. (r)|(s) is a regular expression denoting the language L(r) U L(s).

 2. (r)(s) is a regular expression denoting the language L(r)L(s).

 3. is a regular expression denoting ((L.

 4. (r) is a regular expression denoting L(r).

 This last rule says that we can add additional pairs of parentheses around expressions without changing the
language they denote

25
REGULAR EXPRESSIONS AND DFA’S

 Conventions for dropping the parentheses


 As defined, regular expressions often contain unnecessary pairs of parentheses.
 We may drop certain pairs of parentheses if we adopt the conventions that:
 a) The unary operator has highest precedence and is left associative.
 b) Concatenation has second highest precedence and is left associative
 c) | has lowest precedence and is left associative

26
27
REGULAR EXPRESSIONS AND DFA’S

28
REGULAR EXPRESSIONS AND DFA’S

 Transition Diagrams
 An intermediate step in the construction of a lexical analyser, we first convert patterns into stylized
flowcharts, called “transition diagrams."
 We perform the conversion from regular-expression patterns to transition diagrams
 Transition diagrams have a collection of nodes or circles, called states. Each state represents a
condition that could occur during the process of scanning the input looking for a lexeme that
matches one of several patterns.
 We may think of a state as summarizing all we need to know about what characters we have seen
between the lexemeBegin pointer and the forward pointer

29
REGULAR EXPRESSIONS AND DFA’S

 DFA
 We shall assume that all our transition diagrams are deterministic, meaning that there is never more than
one edge out of a given state with a given symbol among its labels
 Conventions for the transition diagrams
 Some important conventions about transition diagrams are:
 Certain states are said to be accepting, or final. These states indicate that a lexeme has been found, although the
actual lexeme may not consist of all positions between the lexemeBegin and forward pointers.
 In addition, if it is necessary to retract the forward pointer one position (i.e., the lexeme does not include the
symbol that got us to the accepting state), then we shall additionally place a* near that accepting state
 One state is designated the start state, or initial state; it is indicated by an edge, labeled “start,"

30
REGULAR EXPRESSIONS AND DFA’S

Figure: Patterns for tokens


Figure: Tokens, their patterns, and attribute values
31
REGULAR EXPRESSIONS AND DFA’S

 Example:
 A transition diagram that
recognizes the lexemes matching
the token relop.
 We begin in state 0, the start
state

32
Thanks 

33

You might also like