0% found this document useful (0 votes)

101 views31 pages

Compilers Notes

A compiler translates a program written in one language into an equivalent program in another language. It has multiple phases including lexical analysis, syntax analysis, code generation, and error handling. Syntax-directed translation interleaves the actions of these phases by directing the parsing process. Attributes are associated with grammar symbols and semantic rules specify how children attributes are used to synthesize parent attributes during parsing.

Uploaded by

ravijntuh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views31 pages

Compilers Notes

Uploaded by

ravijntuh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Compilers

A compiler is a program that reads a program in one language, the source language and translates into an equivalent program in another language, the target language. The translation process should also report the presence of errors in the source program. Source Program

Compiler

Target Program

Error Messages There are two parts of compilation. The analysis part breaks up the source program into constant piece and creates an intermediate representation of the source program. The synthesis part constructs the desired target program from the intermediate representation.

Phases of Compiler
The compiler has a number of phases plus symbol table manager and an error handler. Input Source Program

Lexical Analyzer

Syntax Analyzer

Symbol Table Manager Semantic Analyzer Error Handler

Intermediate Code Generator

Code Optimizer

Code Generator

Out Target Program The cousins of the compiler are 1. Preprocessor. 2. Assembler. 3. Loader and Link-editor. Front End vs Back End of a Compilers. The phases of a compiler are collected into front end and back end. The front end includes all analysis phases end the intermediate code generator. The back end includes the code optimization phase and final code generation phase. The front end analyzes the source program and produces intermediate code while the back end synthesizes the target program from the intermediate code.

A naive approach (front force) to that front end might run the phases serially. 1. Lexical analyzer takes the source program as an input and produces a long string of tokens. 2. Syntax Analyzer takes an out of lexical analyzer and produces a large tree. 3. Semantic analyzer takes the output of syntax analyzer and produces another tree. 4. Similarly, intermediate code generator takes a tree as an input produced by semantic analyzer and produces intermediate code. Minus Points

Requires enormous amount of space to store tokens and trees. Very slow since each phase would have to input and output to and from temporary disk

Remedy

use syntax directed translation to inter leaves the actions of phases. Compiler construction tools.

Parser Generators: The specification of input based on regular expression. The organization is based on finite automation. Scanner Generator: The specification of input based on regular expression. The organization is based on finite automation. Syntax-Directed Translation: It walks the parse tee and as a result generate intermediate code. Automatic Code Generators: Translates intermediate rampage into machine language. Data-Flow Engines: It does code optimization using data-flow analysis.

Syntax Definition
A contex free grammar, CFG, (synonyms: Backus-Naur Firm of BNF) is a common notation for specifying the syntax of a languages For example, an "IF-ELSE" statement in c-language has the form IF (Expr) stmt ELSE stmt In other words, it is the concatenation of:

the keyword IF ; an opening parenthesis ( ; 3

an expression Expr ; a closing parenthesis ) ; a statement stmt ; a keyword ELSE ; Finally, another statement stmt.

The syntax of an 'IF-ELSE' statement can be specified by the following 'production rule' in the CFG. stmt IF (Expr) stmt ELSE stmt The arrow ( ) is read as "can have the form". A context-free grammar (CFG) has four components: 1. 2. 3. 4. A set of tokens called terminals. A set of variable called nonterminals. A set of production rules. A designation of one of the nonterminals as the start symbol.

Multiple production with the same nonterminal on the left like: list + digit list - digit list may be grouped together separated by vertical bars, like: list list + digit | list - digit | digit

Ambiguity
A grammar is ambiguous if two or more different parse trees can be desire the same token string. Equivalently, an ambiguous grammar allows two different derivations for a token string. Grammar for complier should be unambiguous since different parse trees will give a token string different meaning. Consider the following grammar

string string + string | string - string |0|2|...|9 To show that a grammar is ambiguous all we need to find a "single" stringthat has more than one perse tree.

Figure:23 --- pg.31 Above figure show two different parse trees for the token string 9 - 5 + 2 that corresponds to two different way of parenthesizing the expression: ( - 5) + 2 and 9 -(5 + 2). The first parenthesization evaluates to 2. Perhaps, the most famous example of ambiguity in a programming language is the dangling 'ELSE'. Consider the grammar G with the production: S IF b THEN S ELSE S | IF b THEN S | a G is ambiguous since the sentence IF b THEN IF b THEN a ELSE a has two different parse trees or derivation trees. Parse tree I figure This parse tree imposes the interpretation IF b THEN (IF b THEN a ) ELSE a Parse Tree II Figure This parse tree imposes the interpretation

IF b THEN (IF b THEN a ELSE a) The reason that the grammar G is ambiguous is that an 'ELSE' can be associated with two different THENs. For this reason, programming languages which allows both IF-THENELSE and IF-THEN constant can be ambiguous.

Associativity of Operators
If operand has operators on both side then by connection, operand should be associated with the operator on the left. In most programming languages arithmetic operators like addition, subtraction, multiplication, and division are left associative.

Token string: 9 - 5 + 2 Production rules list list - digit | digit digit 0 | 1 | 2 | . . . | 9 Parse tree for left-associative operator is

figure 24 on pg. 31 In the C programming language the assignment operator, =, is right associative. That is, token string a = b = c should be treated as a = (b = c).

Token string: a = b =c. Production rules: right letter = right | letter letter a | b | . . . | z Parse tree for right-associative operator is:

Figure

Precedence of Operators
An expression 9 + 5 * 2 has two possible interpretation:

(9 + 5) * 2 and 9 + (5 * L) The associativity of '+' and '*' do not resolve this ambiguity. For this reason, we need to know the relative precedence of operators. The convention is to give multiplication and division higher precedence than addition and subtraction. Only when we have the operations of equal precedence, we apply the rules of associative. So, in the example expression: 9 + 5 * 2. We perform operation of higher precedence i.e., * before operations of lower precedence i.e., +. Therefore, the correct interpretation is 9 + (5 *).

Separate Rule
Consider the following grammar and language again. S IF b THEN S ELSE S | IF b THEN S | a An ambiguity can be removed if we arbitrary decide that an ELSE should be attached to the last preceding THEN, like: Figure

We can revise the grammar to have two nonterminals S1 and S2. We insist that S2 generates IF-THEN-ELSE, while S1 is free to generate either kind of statements. The rules of the new grammar are: S1 IF b THEN S1 | IF b THEN S2 THEN S1 | a S2 IF b THEN S2 ELSE S2 | a Although there is no general algorithm that can be used to determine if a given grammar is ambiguous, it is certainly possible to isolate rules which leads to ambiguity or ambiguous grammar. A grammar containing the productions.

A AA | Alpha is ambiguous because the substring AAA has different parse tree. Figure This ambiguity disappears if we use the productions A AB | B B or A BA | B B Syntax of Expressions A grammar of arithmetic expressions looks like: Expr expr + term | expr - term | term term term * factor | term/factor | factor factor id | num | (expr) That is, expr is a string of terms separated by '+' and '-'. A term is a string of factors separated by '*' and '/' and a factor is a single operand or an expression wrapped inside of parenthesis.

Syntax-Directed Translation

Modern compilers use syntax-directed translation to interleaves the actions of the compiler phases. The syntax analyzer directs the whole process during the parsing of the source code.

Calls the lexical analyzer whenever syntax analyzer wants another token. Perform the actions of semantic analyzer. Perform the actions of the intermediate code generator.

The actions of the semantic analyzer and the intermediate code generator require the passage of information up and/or down the parse tree.

We think of this information as attributes attached to the nodes of the parse tree and the parser moving this information between parent nodes and children nodes as it performs the productions of the grammar. Postfix Notation Postfix notation also called reverse polish notation or RPN places each binary arithmetic operator after its two operands instead of between them. Infix : (9 - 5) + Expression 2 = (95 -) + 2 = (95-) 2 + = 95 - 2 + : Postfix Notation

Infix : 9 - (5 + Expression 2) =9(52+) = 9 (52+) = 9 5 2 + : Postfix Notation Why postfix notation? There are two reasons

There is only one interpretation We do not need parenthesis to disambignate the grammar.

Syntax-Directed Definitions

A syntax-directed definition uses a CFG to specify the syntatic structure of the input. A syntax-directed definition associates a set of attributes with each grammar symbol. A syntax-directed definition associates a set of semantic rules with each production rule.

For example, let the grammar contains the production: XYZ

And also let that nodes X, Y and Z have associated attributes X.a, Y.a and Z.a respectively. The annotated parse tree looks like: diagram If the semantic rule {X.a := Y.a + Z.a} is associated with the production XYZ then parser should add the attribute 'a' of node Y and attribute 'a' of node Z together and set the attribute 'a' of node X to their sum. Synthesized Attributes An attribute is synthesized if its value at a parent node can be determined from attributes of its children. diagram Since in this example, the value of a node X can be determined from 'a' attribute of Y and Z nodes attribute 'a' in a synthesized attribute. Synthesized attributes can be evaluated by a single bottom-up traversal of the parse tree. Example 2.6: Following figure shows the syntax-directed definition of an infix-topostfix translator. Figure 2.5 Pg. 34 PRODUCTION expr expr1 + term expr expr1 term expr term term 0 term 1 : : term 9 SEMANTIC RULE expr.t : = expr1.t + | | term.t | | '+' expr.t : = expr1.t + | | term.t | | '-' expr.t : = term.t term.t : = '0' term.t : = '1' : : term.t : = '9'

Parse tree corresponds to Productions Diagram

Annotated parse tree corresponds to semantic rules. Diagram The above annotated parse tree shows how the input infix expression 9 - 5 + 2 is translated to the prefix expression 95 - 2 + at the root. Depth-First Traversals A depth-first traversal of a parse tree is one way of evaluating attributes. Note that a syntax-directed definition does not impose any particular order as long as order computes attribute of parent after all its children's attributes. PROCEDURE visit (n: node) BEGIN FOR each child m of n, from left to right Do visist (m); Evaluate semantic rules at node n END Diagram Translation Schemes A translation scheme is another way of specifying a syntax-directed translation. This scheme is a CFG in which program fragments called semantic actions are embedded within the right sides of productions. For example, rest + term {primt ( ' + ' )} rest, indicates that a '+' sign should be printed between:

depth-first traversal of the term node, and depth first traversal of the rest, node.

Diagram Ex. 2.8 REVISION: SYNTAX-DIRECTED TRANSLATION Step1: Syntax-directed definition for translating infix expression to postfix form. PRODUCTION expr expr1 + term expr expr1 term SEMANTIC RULE expr.t : = expr1.t + | | term.t | | '+' expr.t : = expr1.t + | | term.t | | '-'

expr term term 0 term 1 : : term 9

expr.t : = term.t term.t : = '0' term.t : = '1' : : term.t : = '9'

Step 2: A translation scheme derived from syntax-direction definition is : Figure 2.15 on pg. 39 expr term expr term expr term term expr + expr term 0 1 9 {print( ' + ' )} {print( ' - ')} {print( ' 0 ' )} {print( ' 1 ' )} : : {print( ' 9 ' )}

: : term

Step 3: A parse tree with actions translating 9 - 5 + 2 into 95 - 2 +

Figure 2.14 on pg. 40 Note that it is not necessary to actually construct the parse tree.

Parsing
The parsing is a process of finding a parse tree for a string of tokens. Equivalently, it is a process of determining whether a string of tokens can be generated by a grammar. The worst-case time pf parsing algorithms are O(nn3) but typical is : O(n) time. For example. The production rules of grammar G is: list list + digit | list - digit | digit digit 0 | 1 | . . . | 9

Given token string is 9-5+2. Parse tree is: diagram Each node in the parse tree is labeled by a grammar symbol. the interior node corresponds to the left side of the production. the children of the interior node corresponds to the right side of production. The language defined by a grammar is the set of all token strings can be derived from its start symbol. The language defined by the grammar: list list + digit | list - digit | digit digit 0 | 1 | 2 | . . . | 9 contains all lists of digits separated by plus and minus signs. The Epsilon, E, on the right side of the production denotes the empty string. As we have mentioned above, the parsing is the process of determining if a string of tokens can be generated by a grammar. A parser must be capable of constructing the tree, or else the translation cannot be guaranteed correct. For any language that can be described by CFG, the parsing requires O(n3) time to parse string of n token. However, most programming languages are so simple that a parser requires just O(n) time with a single left-to-right scan over the iput string of n tokens. There are two types of Parsing 1. Top-down Parsing (start from start symbol and derive string) A Top-down parser builds a parse tree by starting at the root and working down towards the leaves. o Easy to generate by hand. o Examples are : Recursive-descent, Predictive. 2. Bottom-up Parsing (start from string and reduce to start symbol) A bottom-up parser builds a parser tree by starting at the leaves and working up towards the root. o Not easy to handle by hands, usually compiler-generating software generate bottom up parser o But handles larger class of grammar o Example is LR parser.

Top-Down Parsing
Consider the CFG with productions: expr term rest rest + term rest | - term rest term 0 | 1 | . . . | 9 Step 0: Initialization: Root must be starting symbol Step 1: expr term rest Step 2: term 9 Step 3 rest term rest Step 4: term 5 Step 5: rest term rest Step 6: term 2 Step 7: rest E In the example above, the grammar made it easy for the top-down parser to pick the correct production in each step. This is not true in general, see example of dangling 'else'.

Predictive Parsing
Recursive-descent parsing is a top-down method of syntax analysis that executes a set of recursive procedure to process the input. A procedure is associated with each nonterminal of a grammar. A predictive parsing is a special form of recursive-descent parsing, in which the current input token unambiguously determines the production to be applied at each step. Let the grammar be:

expr term rest rest + term rest | - term rest | 6 term 0 | 1 | . . . | 9 In a recursive-descent parsing, we write code for each nonterminal of a grammar. In the case of above grammar, we should have three procedure, correspond to nonterminals expr, rest, and term. Since there is only one production for nonterminal expr, the procedure expr is: expr ( ) { term ( ); rest ( ); return } Since there are three (3) productions for rest, procedure rest uses a global variable, 'lookahead', to select the correct production or simply selects "no action" i.e., E - production, indicating that lookahead variable is neither + nor rest ( ) { IF (lookahead = = '+') { match ( ' + ' ); term ( ); rest ( ); return } ELSE IF ( lookahead = = '-') { match (' - '); term ( ); rest ( ); return { ELSE { return; } } The procedure term checks whether global variable lookahead is a digit. term ( ) { IF ( isdigit (lookahead)) { match (lookahead);

return; } else{ ReportError ( ); } After loading first input token into variable 'lookahead' predictive parser is stared by calling starting symbol, 'expr'. If the input is error free, the parser conducts a depth-first traversal of the parse tree and return to caller routine through expr. Problem with Predictive Parsing: left recursion

Left Recursion
The production is left-recursive if the leftmost symbol on the right side is the same as the non terminal on the left side. For example, expr expr + term. If one were to code this production in a recursive-descent parser, the parser would go in an infinite loop. diagram We can eliminate the left-recursion by introducing new nonterminals and new productions rules. For example, the left-recursive grammar is: E E+T|T E T*F|F F (E) | id. We can redefine E and T without left-recursion as: E TE` E` + TE` | E T FT` T * FT` | E F (E) | id

Getting rid of such immediate left recursion is not enough. One must get rid of indirect left recursion too, where two or more nonterminals are mutually left-recursive.

Lexical Analyzer
The main task of lexical Analyzer is to read a stream of characters as an input and produce a sequence of tokens such as names, keywords, punctuation marks etc.. for syntax analyzer. It discards the white spaces and comments between the tokens and also keep track of line numbers. <fig: 3.1 pp. 84>

Tokens, Patterns, Lexemes Specification of Tokens o Regular Expressions o Notational Shorthand Finite Automata o Nondeterministic Finite Automata (NFA). o Deterministic Finite Automata (DFA). o Conversion of an NFA into a DFA. o From a Regular Expression to an NFA.

Tokens, Patterns, Lexemes Token

A lexical token is a sequence of characters that can be treated as a unit in the grammar of the programming languages. Example of tokens:

Type token (id, num, real, . . . ) Punctuation tokens (IF, void, return, . . . ) Alphabetic tokens (keywords)

Example of non-tokens:

Comments, preprocessor directive, macros, blanks, tabs, newline, . . .

Patterns
There is a set of strings in the input for which the same token is produced as output. This set of strings is described by a rule called a pattern associated with the token. Regular expressions are an important notation for specifying patterns. For example, the pattern for the Pascal identifier token, id, is: id letter (letter | digit)*.

Lexeme
A lexeme is a sequence of characters in the source program that is matched by the pattern for a token. For example, the pattern for the RELOP token contains six lexemes ( =, < >, <, < =, >, >=) so the lexical analyzer should return a RELOP token to parser whenever it sees any one of the six.

3.3 Specification of Tokens

An alphabet or a character class is a finite set of symbols. Typical examples of symbols are letters and characters. The set {0, 1} is the binary alphabet. ASCII and EBCDIC are two examples of computer alphabets. Strings A string over some alphabet is a finite sequence of symbol taken from that alphabet. For example, banana is a sequence of six symbols (i.e., string of length six) taken from ASCII computer alphabet. The empty string denoted by , is a special string with zero symbols (i.e., string length is 0). If x and y are two strings, then the concatenation of x and y, written xy, is the string formed by appending y to x. For example, If x = dog and y = house, then xy = doghouse. For empty string, , we have S = S = S. String exponentiation concatenates a string with itself a given number of times: S2 = SS or S.S S3 = SSS or S.S.S S4 = SSSS or S.S.S.S and so on

By definition S0 is an empty string, , and S` = S. For example, if x =ba and na then xy2 = banana. Languages A language is a set of strings over some fixed alphabet. The language may contain a finite or an infinite number of strings. Let L and M be two languages where L = {dog, ba, na} and M = {house, ba} then

Union: LUM = {dog, ba, na, house} Concatenation: LM = {doghouse, dogba, bahouse, baba, nahouse, naba} Expontentiation: L2 = LL By definition: L0 ={ } and L` = L

The kleene closure of language L, denoted by L*, is "zero or more Concatenation of" L. L* = L0 U L` U L2 U L3 . . . U Ln . . . For example, If L = {a, b}, then L* = { , a, b, aa, ab, ab, ba, bb, aaa, aba, baa, . . . } The positive closure of Language L, denoted by L+, is "one or more Concatenation of" L. L+ = L` U L2 U L3 . . . U Ln . . . For example, If L = {a, b}, then L+ = {a, b, aa, ba, bb, aaa, aba, . . . }

Regular Expressions

1. The regular expressions over alphabet specifies a language according to the following rules. is a regular expression that denotes { }, that is, the set containing the empty string. 2. If a is a symbol in alphabet, then a is a regular expression that denotes {a}, that is, the set containing the string a. 3. Suppose r and s are regular expression denoting the languages L(r) and L(s). Then a. (r)|(s) is a regular expression denoting L(r) U L(s). b. (r)(s) is a regular expression denoting L(r) L(s). c. (r)* is a regular expression denoting (L(r))*.

d. (r) is a regular expression denoting L(r), that is, extra pairs of parentheses may be used around regular expressions. Unnecessary parenthesis can be avoided in regular expressions using the following conventions:

The unary operator * (kleene closure) has the highest precedence and is left associative. Concatenation has a second highest precedence and is left associative. Union has lowest precedence and is left associative.

Regular Definitions
A regular definition gives names to certain regular expressions and uses those names in other regular expressions. Here is a regular definition for the set of Pascal identifiers that is define as the set of strings of letter and digits beginning with a letters. letter A | B | . . . | Z | a | b | . . . | z digit 0 | 1 | 2 | . . . | 9 id letter (letter | digit)* The regular expression id is the pattern for the Pascal identifier token and defines letter and digit. Where letter is a regular expression for the set of all upper-case and lower case letters in the alphabet and digit is the regular for the set of all decimal digits. The pattern for the Pascal unsigned token can be specified as follows: digit 0 | 1 | 2 | . . . | 9 digit digit digit* Optimal-fraction . digits |

Optimal-exponent (E (+ | - | ) digits) | num digits optimal-fraction optimal-exponent. This regular definition says that

An optimal-fraction is either a decimal point followed by one or more digits or it is missing (i.e., an empty string). An optimal-exponent is either an empty string or it is the letter E followed by an ' optimal + or - sign, followed by one or more digits.

Notational Shorthand
The unary postfix operator + means "one of more instances of " (r)+ = rr* The unary postfix operator? means "zero or one instance of" r? = (r | ) Using these shorthand notation, Pascal unsigned number token can be written as: digit 0 | 1 | 2 | . . . | 9 digits digit+ optimal-fraction (. digits)? optimal-exponent (E (+ | -)?digits)? num digits optimal-fraction optimal-exponent

Finite Automata

A recognizer for a language is a program that takes a string x as an input and answers "yes" if x is a sentence of the language and "no" otherwise. One can compile any regular expression into a recognizer by constructing a generalized transition diagram called a finite automation. A finite automation can be deterministic means that more than one transition out of a state may be possible on a same input symbol. Both automata are capable of recognizing what regular expression can denote.

Nondeterministic Finite Automata (NFA)

A nondeterministic finite automation is a mathematical model consists of 1. 2. 3. 4. 5. a set of states S; a set of input symbol, , called the input symbols alphabet. a transition function move that maps state-symbol pairs to sets of states. a state so called the initial or the start state. a set of states F called the accepting or final state.

An NFA can be described by a transition graph (labeled graph) where the nodes are states and the edges shows the transition function. The labeled on each edge is either a symbol in the set of alphabet, , or denoting empty string. Following figure shows an NFA that recognizes the language: (a | b)* a bb.

FIGURE 3.19 - pp 114 This automation is nondeterministic because when it is in state-0 and the input symbol is a, it can either go to state-1 or stay in state-0. The transition is FIGURE 115 pp. 115 The advantage of transition table is that it provides fast access to the transitions of states and the disadvantage is that it can take up a lot of soace. The following diagram shows the move made in accepting the input strings abb, aabb and ba bb. abb :

In general, more than one sequence of moves can lead to an accepting state. If at least one such move ended up in a final state. For instance

The language defined by an NFA is the set of input strings that particular NFA accepts.

Following figure shows an NFA that recognize aa* | bb*. Note that 's disappear in a cancatenation. FIGURE 3.21 pp. 116 The transition table is:

Deterministic Finite Automata (DFA)

A deterministic finite automation is a special case of a non-deterministic finite automation (NFA) in which 1. no state has an -transition 2. for each state s and input symbol a, there is at most one edge labeled a leaving s. A DFA has st most one transition from each state on any input. It means that each entry on any input. It means that each entry in the transition table is a single state (as oppose to set of states in NFA). Because of single transition attached to each state, it is vary to determine whether a DFA accepts a given input string. Algorithm for Simulating a DFA INPUT:

string x a DFA with start state, so . . . a set of accepting state's F.

OUTPUT:

The answer 'yes' if D accepts x; 'no' otherwise.

The function move (S, C) gives a new state from state s on input character C. The function 'nextchar' returns the next character in the string. Initialization: S := S0 C := nextchar;

while not end-of-file do S := move (S, C) C := nextchar; If S is in F then return "yes" else return "No". Following figure shows a DFA that recognizes the language (a|b)*abb. FIGURE The transition table is state 0 1 2 3 a 1 1 1 1 b 0 2 3 0

With this DFA and the input string "ababb", above algorithm follows the sequence of states:

FIGURE

Conversion of an NFA into a DFA

It is hard for a computer program to simulate an NFA because the transition function is multivalued. Fortunately, an algorithm, called the subset construction will convert an NFA for any language into a DFA that recognizes the same languages. Note that this algorithm is closely related to an algorithm for constructing LR parser.

In the transition table of an NFA, entry is a set of states; In the transition table of a DFA, each entry is a single state;

The general idea behind the NFA-to-DFA construction is that the each DFA state corresponds to a set of NFA states.

For example, let T be the set of all states that an NFA could reach after reading input: a1, a2, . . . , an - then the state that the DFA reaches after reading a1, a2, . . . , an corresponds to set T. Theoretically, the number of states of the DFA can be exponential in the number of states of the NFA, i.e., (2n), but in practice this worst case rarely occurs. Algorithm: Subset construction. INPUT: An NFA N OUTPUT: A DFA D accepting the same language. METHOD: Construct a transition table DTrans. Each DFA state is a set of NFA states. DTran simulates in parallel all possible moves N can make on a given string. Operations to keep track of sets of NFA states: Closure (S) Set of states reachable from state S via epsilon. Closure (T) Set of states reachable from any state in set T via epsilon. move (T, a) Set of states to which there is an NFA transition from states in T on a symbol a. Algorithm: initially, -Closure (S0) in DTrans. While unmarked state T in DTrans mark T for each input symbol 'a' do u = Closure (T, a) If u is not in DTrans then add u to DTrans DTrans [T, a] = U Following algorithm shows a computation of -Closure function. Push all states in T onto stack. initialize -Closure (T) to T while stack is not empty do pop top element t for each state u with -edge t to u do If u is not in -Closure(T) do add u Closure (T) push u onto stack

Following example illustrates the method by constructing a DFA for the NFA.

From a Regular Expression to an NFA

Thompson's construction is an NFA from a regular expression. The Thompson's construction is guided by the syntax of the regular expression with cases following the cases in the definition of regular expression. 1. is a regular expression that denotes {}, the set containing just the empty string. diagram where i is a new start state and f is a new accepting state. This NFA recognizes {}. 2. If a is a symbol in the alphabet, a , then regular expression 'a' denotes {a} and the set containing just 'a' symbol. diagram This NFA recognizes {a}. 3. Suppose, s and t are regular expressions denoting L{s} and L(t) respectively, then a. s/r is a regular expression denoting L(s) L(t) b. st is a regular expression denoting L(s) L(t) diagram c. s* is a regular expression denoting L(s)* diagram d. (s) is a regular expression denoting L(s) and can be used for putting parenthesis around regular expression Example: Use above algorithm, Thompson's construction, to construct NFA for the regular expression r = (a|b)* abb. First constant the parse tree for r = (a|b)* abb. figure

For r1 - use case 2.

figure

For r2 - use case 2.

figure

For r3 - use case 3a

figure

For r5 - use case 3c

figure We have r5 = (a|b)*

For r6 - use case 2

figure

and for r7 - use case 3b

figure We get r7 =(a|b)* a Similarly for r8 and r10 - use case 2 figure figure And get r11 by case 3b figure We have r = (a|b)*abb.

Code Generation

Introduction
Phases of typical compiler and position of code generation.

<fig: 9.1 - page 513>

Since code generation is an "undecidable problem (mathematically speaking), we must be content with heuristic technique that generate "good" code (not necessarily optimal code). Code generation must do following things:

Produce correct code make use of machine architecture. run efficiently.

Issues in the Design of Code generator

Code generator concern with: 1. 2. 3. 4. Memory management. Instruction Selection. Register Utilization (Allocation). Evaluation order.

1. Memory Management Mapping names in the source program to address of data object is cooperating done in pass 1 (Front end) and pass 2 (code generator). Quadruples address Instruction. Local variables (local to functions or procedures ) are stack-allocated in the activation record while global variables are in a static area. 2. Instruction Selection The nature of instruction set of the target machine determines selection. -"Easy" if instruction set is regular that is uniform and complete. Uniform: all triple addresses all stack single addresses. Complete: use all register for any operation. If we don't care about efficiency of target program, instruction selection is straight forward. For example, the address code is: a := b + c d := a + e Inefficient assembly code is: 28

1. 2. 3. 4. 5. 6.

MOV b, R0 ADD c, R0 MOV R0, a MOV a, R0 ADD e, R0 MOV R0 , d

R0 b R0 c + R 0 a R0 R0 a R0 e + R 0 d R0

Here the fourth statement is redundant, and so is the third statement if 'a' is not subsequently used. 3. Register Allocation Register can be accessed faster than memory words. Frequently accessed variables should reside in registers (register allocation). Register assignment is picking a specific register for each such variable. Formally, there are two steps in register allocation: 1. Register allocation (what register?) This is a register selection process in which we select the set of variables that will reside in register. 2. Register assignment (what variable?) Here we pick the register that contain variable. Note that this is a NP-Complete problem. Some of the issues that complicate register allocation (problem). 1. Special use of hardware for example, some instructions require specific register. 2. Convention for Software: For example

Register R6 (say) always return address. Register R5 (say) for stack pointer. Similarly, we assigned registers for branch and link, frames, heaps, etc.,

3. Choice of Evaluation order Changing the order of evaluation may produce more efficient code. This is NP-complete problem but we can bypass this hindrance by generating code for quadruples in the order in which they have been produced by intermediate code generator. ADD x, Y, T1 ADD a, b, T2 is legal because X, Y and a, b are different (not dependent).

The Target Machine

Familiarity with the target machine and its instruction set is a prerequisite for designing a good code generator. Typical Architecture Target machine is: 1. 2. 3. 4. Byte addressable (factor of 4). 4 byte per word. 16 to 32 (or n) general purpose register. Two addressable instruction of form: Op source, destination. e.g., move A, B add A, D

Typical Architecture: 1. 2. 3. 4. Target machine is : Bit addressing (factor of 1). Word purpose registers. Three address instruction of forms: Op source 1, source 2, destination e.g., ADD A, B, C Byte-addressable memory with 4 bytes per word and n general-purpose registers, R0, R1, . . . , Rn-1. Each integer requires 2 bytes (16-bits). Two address instruction of the form mnemonic source, destination MODE FORM ADDRESS EXAMPLE Absolute register Index Indirect register ADDEDCOST M M ADD R0, R1 1 ADD temp, R R 0 R1 ADD c (R) c + contents (R) 1 100(R2), R1 ADD * R2, *R contents (R) 0 R1

Indirect Index Literal Instruction costs:

*c (R) #c

contents (c + ADD * contents (R) 100(R2), R1 ADD # 3, constant c R1

1 1

Each instruction has a cost of 1 plus added costs for the source and destination. => cost of instruction = 1 + cost associated the source and destination address mode. This cost corresponds to the length (in words ) of instruction. Examples 1. Move register to memory R0 M. MOV R0, M cost = 1+1 = 2. 2. Indirect indexed mode: MOV * 4 (R0), M cost = 1 plus indirect index plus instruction word =1+1+1=3 3. Indexed mode: MOV 4(R0), M cost = 1 + 1 + 1 = 3 4. Litetral mode: MOV #1, R0 cost = 1 + 1 = 2 5. Move memory to memory MOV m, m cost = 1 + 1 + 1 = 3

Japanese Character Code Sets & Encodings:: History
100% (2)
Japanese Character Code Sets & Encodings:: History
8 pages
A Survival Guide To Punctuation
100% (1)
A Survival Guide To Punctuation
6 pages
2. Simple Syntax Directed Translation
No ratings yet
2. Simple Syntax Directed Translation
51 pages
Simple One Pass Compiler
No ratings yet
Simple One Pass Compiler
62 pages
Syntax Analysis: Role of Parsers
No ratings yet
Syntax Analysis: Role of Parsers
6 pages
Compiler Design Chapter-3
0% (1)
Compiler Design Chapter-3
177 pages
Correct The Sentence Punctuation Activity
No ratings yet
Correct The Sentence Punctuation Activity
4 pages
COSC3054 Lec 03 I Grammars (4)
No ratings yet
COSC3054 Lec 03 I Grammars (4)
96 pages
Adpuma Revised
No ratings yet
Adpuma Revised
3 pages
Adpuma Revised
No ratings yet
Adpuma Revised
3 pages
Break, Exit, Continue & Goto Statement Break Statement
No ratings yet
Break, Exit, Continue & Goto Statement Break Statement
3 pages
WT Module1
No ratings yet
WT Module1
95 pages
Compiler 2
100% (1)
Compiler 2
45 pages
CC lec 7
No ratings yet
CC lec 7
16 pages
Lec4 SyntaxAnalysis
No ratings yet
Lec4 SyntaxAnalysis
41 pages
What Is A Storage Class
No ratings yet
What Is A Storage Class
1 page
Chapter – three
No ratings yet
Chapter – three
139 pages
Compiler Construction CS-4207: Instructor Name: Atif Ishaq
No ratings yet
Compiler Construction CS-4207: Instructor Name: Atif Ishaq
19 pages
Csel Specification 1.1
No ratings yet
Csel Specification 1.1
16 pages
2 SimpleOnePassCompiler
No ratings yet
2 SimpleOnePassCompiler
66 pages
Jamming-Aware Traffic Allocation For Multiple-Path Routing Using Portfolio Selection
No ratings yet
Jamming-Aware Traffic Allocation For Multiple-Path Routing Using Portfolio Selection
18 pages
08 String
No ratings yet
08 String
14 pages
(English) JavaScript Full Course - Beginner To Pro - Part 1 (DownSub - Com)
No ratings yet
(English) JavaScript Full Course - Beginner To Pro - Part 1 (DownSub - Com)
418 pages
1.describing Syntax and Semantics
No ratings yet
1.describing Syntax and Semantics
110 pages
Programs and Programming Languages
No ratings yet
Programs and Programming Languages
37 pages
Compiler Design: 4. Language Grammars
No ratings yet
Compiler Design: 4. Language Grammars
14 pages
Power Geeze
60% (5)
Power Geeze
3 pages
APznzabvYKoN4zDY71onQwxNN3R5YXoFXjgna4I0XurpAH1XE77GlYeHrkYJx-bE96PPeJntwqzIfNBvguewq_9dNxjJHAPsi5CaMk-Pv6X530i-KQDKh3JuMvyl95bEO1TR_fC6I6zJQhW0qb1oPgi21XiXcoliVzRGGVn66Gsj5rdWsJ7DYhv9_bPuB3iUXcsUAVwQmrEsvBAIIrycUz
No ratings yet
APznzabvYKoN4zDY71onQwxNN3R5YXoFXjgna4I0XurpAH1XE77GlYeHrkYJx-bE96PPeJntwqzIfNBvguewq_9dNxjJHAPsi5CaMk-Pv6X530i-KQDKh3JuMvyl95bEO1TR_fC6I6zJQhW0qb1oPgi21XiXcoliVzRGGVn66Gsj5rdWsJ7DYhv9_bPuB3iUXcsUAVwQmrEsvBAIIrycUz
44 pages
Module1 1
No ratings yet
Module1 1
20 pages
2019-11-29_04_41_39CS_V_sem_Compiler_design
No ratings yet
2019-11-29_04_41_39CS_V_sem_Compiler_design
10 pages
ModuleIII
No ratings yet
ModuleIII
18 pages
Cd notes
No ratings yet
Cd notes
194 pages
Lecture 03
No ratings yet
Lecture 03
7 pages
Compiler Construction Week 04 Syntax Analysis I)
No ratings yet
Compiler Construction Week 04 Syntax Analysis I)
41 pages
Compiler Lecture 4
No ratings yet
Compiler Lecture 4
17 pages
Figure 1two Parse Trees For 9-5+2
No ratings yet
Figure 1two Parse Trees For 9-5+2
3 pages
L4 Formal Grammers
No ratings yet
L4 Formal Grammers
23 pages
A Simple One - Pass Compiler
No ratings yet
A Simple One - Pass Compiler
62 pages
8 Notes
No ratings yet
8 Notes
12 pages
Unit 3 SDD
No ratings yet
Unit 3 SDD
7 pages
Compiler Design
No ratings yet
Compiler Design
19 pages
CH2-1 To CH2-3
No ratings yet
CH2-1 To CH2-3
79 pages
Multimedia Application L4
No ratings yet
Multimedia Application L4
42 pages
BNF Ebnf
100% (1)
BNF Ebnf
25 pages
Structure Ofa Compiler: Front End
No ratings yet
Structure Ofa Compiler: Front End
95 pages
Lec02 Programming Language Specification
No ratings yet
Lec02 Programming Language Specification
36 pages
Chapter 1
No ratings yet
Chapter 1
40 pages
Encoder S
No ratings yet
Encoder S
3 pages
4.parsing
No ratings yet
4.parsing
32 pages
CSC 409 Note 2
No ratings yet
CSC 409 Note 2
12 pages
Compiler S
No ratings yet
Compiler S
5 pages
Python QA
No ratings yet
Python QA
4 pages
Vahid Reza Adineh - Guitar For All Styles (Classic Style) - Simin Dokht (2003)
100% (1)
Vahid Reza Adineh - Guitar For All Styles (Classic Style) - Simin Dokht (2003)
89 pages
3-Module 2 - Role of Parser - Parse Tree-02-08-2024
No ratings yet
3-Module 2 - Role of Parser - Parse Tree-02-08-2024
76 pages
PCD 1.4 Syntax Analysis
No ratings yet
PCD 1.4 Syntax Analysis
33 pages
Modbus ASCII Vs Modbus RTU
No ratings yet
Modbus ASCII Vs Modbus RTU
5 pages
Principles of Programming Languages: Syntax Analysis
No ratings yet
Principles of Programming Languages: Syntax Analysis
51 pages
Equivalents Chains & Belts
No ratings yet
Equivalents Chains & Belts
3 pages
CS 4300: Compiler Theory A Simple Syntax-Directed Translator
No ratings yet
CS 4300: Compiler Theory A Simple Syntax-Directed Translator
70 pages
SE Compiler Chapter 3-Parser
No ratings yet
SE Compiler Chapter 3-Parser
27 pages
Lecture 1 Introduction DR Raheel 19022024 032426pm
No ratings yet
Lecture 1 Introduction DR Raheel 19022024 032426pm
32 pages
CSC441-Lesson 04
No ratings yet
CSC441-Lesson 04
40 pages
2 Syntax Analysis - Introduction
No ratings yet
2 Syntax Analysis - Introduction
8 pages
Application Domains : Business Processing Scientific System Control Publishing
No ratings yet
Application Domains : Business Processing Scientific System Control Publishing
21 pages
Chapter 3
No ratings yet
Chapter 3
180 pages
1 Syntax Analyzer
No ratings yet
1 Syntax Analyzer
33 pages
Parsing - 1
No ratings yet
Parsing - 1
59 pages
Parsing Notes
No ratings yet
Parsing Notes
96 pages
dublas (2)
No ratings yet
dublas (2)
18 pages
Wordnum
No ratings yet
Wordnum
2 pages
Algorithm: Production of An Algorithem
No ratings yet
Algorithm: Production of An Algorithem
52 pages
Lex-Yacc For Exam
100% (1)
Lex-Yacc For Exam
17 pages
Hangeul beginner (수정본) PDF
No ratings yet
Hangeul beginner (수정본) PDF
95 pages
System Programming
No ratings yet
System Programming
22 pages
Compiler 2
No ratings yet
Compiler 2
45 pages
Grade 5 Multiply Fractions Denominator 2to25 B
No ratings yet
Grade 5 Multiply Fractions Denominator 2to25 B
2 pages
Unit of Length
No ratings yet
Unit of Length
14 pages
Chapter-3-Syntax Analysis
No ratings yet
Chapter-3-Syntax Analysis
126 pages
Compiler Theory: (A Simple Syntax-Directed Translator)
No ratings yet
Compiler Theory: (A Simple Syntax-Directed Translator)
50 pages
Test Bank For An Introduction To Physica
No ratings yet
Test Bank For An Introduction To Physica
24 pages
Fun With Numbers
No ratings yet
Fun With Numbers
19 pages
Entrepreneurship Process
No ratings yet
Entrepreneurship Process
22 pages
WWW Tutorialspoint Com Compiler Design Compiler Design Syntax Analysis HTM
No ratings yet
WWW Tutorialspoint Com Compiler Design Compiler Design Syntax Analysis HTM
11 pages
Negative Number Sequence
No ratings yet
Negative Number Sequence
25 pages
The Burmese Tamarind Script
No ratings yet
The Burmese Tamarind Script
16 pages
Compiler 3
No ratings yet
Compiler 3
11 pages
Syntax Analysis: - Check Syntax and Construct Abstract Syntax Tree
No ratings yet
Syntax Analysis: - Check Syntax and Construct Abstract Syntax Tree
22 pages
Common Grammar Issues - COMMAS
No ratings yet
Common Grammar Issues - COMMAS
19 pages
r05311201 Automata and Compiler Design
100% (3)
r05311201 Automata and Compiler Design
6 pages
Sys Verilog
No ratings yet
Sys Verilog
115 pages
C Programming
From Everand
C Programming
Netra
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet

Compilers Notes

Uploaded by

Compilers Notes

Uploaded by

Compilers

Symbol Table Manager Semantic Analyzer Error Handler

Intermediate Code Generator

the keyword IF ; an opening parenthesis ( ; 3

For example, let the grammar contains the production: XYZ

Parse tree corresponds to Productions Diagram

expr term term 0 term 1 : : term 9

expr.t : = term.t term.t : = '0' term.t : = '1' : : term.t : = '9'

Step 3: A parse tree with actions translating 9 - 5 + 2 into 95 - 2 +

Tokens, Patterns, Lexemes Token

Comments, preprocessor directive, macros, blanks, tabs, newline, . . .

3.3 Specification of Tokens

Nondeterministic Finite Automata (NFA)

Deterministic Finite Automata (DFA)

string x a DFA with start state, so . . . a set of accepting state's F.

The answer 'yes' if D accepts x; 'no' otherwise.

Conversion of an NFA into a DFA

From a Regular Expression to an NFA

For r1 - use case 2.

For r2 - use case 2.

For r3 - use case 3a

For r5 - use case 3c

figure We have r5 = (a|b)*

For r6 - use case 2

and for r7 - use case 3b

<fig: 9.1 - page 513>

Produce correct code make use of machine architecture. run efficiently.

Issues in the Design of Code generator

MOV b, R0 ADD c, R0 MOV R0, a MOV a, R0 ADD e, R0 MOV R0 , d

The Target Machine

Indirect Index Literal Instruction costs:

contents (c + ADD * contents (R) 100(R2), R1 ADD # 3, constant c R1

You might also like