7 Formal Systems and Programming Languages: An Introduction

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 40

7 Formal systems and programming languages: an introduction

The objectives of this chapter are to: 1. Provide an informal introduction to formal systems and grammars 2. Demonstrate the relevance of theoretical to practical work in the area of programming languages 3. Present formal system and terminology that are commonly used on the literature 4. Present theoretical tools relevant to compiling techniques 5. Describe two methods for programming language specification: Backus-Naur form and canonic systems 6. Present an example of research being conducted in this area and pose some theoretical questions that are being asked about programming languages and formal systems This chapter presents a brief of complicated and varied subject; more complete treatments will be found in references (chapter 10). Except for the description of Backus-Naur form in section 7.5 and problem 1, this chapter may be omitted from presentations of systems programming with a purely practical orientation. 7.1 USES OF FORMAL SYSTEMS IN PROGRAMMING LANGUAGES A formal system in an uninterpreted calculus or logistic systems. It consists of an alphabet, a set of words called, axioms and a finite set of relations called rules of inference. Examples of formal system are: set theory, Boolean algebra, propositional and predicate calculus, post systems, Euclids plane geometry, Backus normal form, and Peano arithmetic. A formal system is uninterrupted in the sense that no meaning is formally attached to the symbols of the systems; there is for each of the abovementioned systems a standard informal interpretation of the symbols, but other interpretations are generally valid so far as the system themselves are concerned. We generally construct the formal system in order to formalized model of informal, intuitive notions. A formal model can be studied mathematically; and if the model is appropriate, the result may tell us much about the notions that it portrays. Formal systems are becoming important in the design, implementation, and study of programming languages. Specifically, various sorts of formal systems are used for syntax specification, syntax-directed compilation, compiler verification, complexity studies, and analysis of the structure of language. 7.1.1 Language specification Formal systems are used to define the form (the syntax) of a programming language. Such definition is important to the user and to the implementer of the language. The user needs (for references) a clear description of the language. The implementer is faced with the problems of transferability and maintenance. If the same language is to be implemented (transferred) on the different machines, the legal strings of the language must be well-defined so that the user-level appearance is, so far as is practicable, invariant. Further, the implementation must be concerned with maintenance. Both the user and the maintainer of a compiler need an exact specification of the acceptable strings of the corresponding language.

7.1.2 Syntax-Directed Compilers A syntax-directed compiler uses a data base containing the syntactical rules of a source language to parse (recognize- find the sequence of rules that generate the strings) the source-language input. Formal systems are used as the data base. Because of the increase in the number of programming languages and machines, researches have been looking into automatic generation of compilers. The approach has been to have formal definitions of both the input and source language Lsource and the output language Lobject. The output of the compiler generator would be a translator T: Lsource Lobject A problem similar to that of automatic compiler- writing is the problem of the generating test programs to validate a language processor. If the source language is formally defined, the generative techniques described in the next section provide a method for automatically producing such test programs. This can be useful in testing software since a computer can be more pedantic than a human tester. 7.1.3 Complexity structure studies Formal systems are used to study the complexity of programming languages and of their compilers. Compiler writes and language designers want to know which features of language inordinately increases the complexity of a compilers recognition phase. A compiler writes also wants a basis for evaluating the performance of a compiler. He would like to know the theoretically optimum level of performance (in terms of numbers of step) that he could expect for the process of translation. After implementing a version of the compiler, he could compare its performance to the theoretical limit; if it was within some tolerance, he could forego further effort. But if the performance was two orders of magnitude worse than the theoretical bound, improvement would clearly be in order. Analogies will be found in the work surrounding Shannons information theory. Shannon determined a measure for information and applied it to be the coding of information. The resulting theory gave bounds on efficiency in information coding and transmission. Bell Telephone Laboratories researchers devising coding techniques for transmission of information could compare the performance their schemes with Shannons bound, using it as a yardstick. Shannons theory does not determine coding schemes. It merely provides a measure of how efficient they can be. Similarly, in complexity, the compiler writes would not be told what method to use, but merely whether a better method might exist. 7.1.4 Structure analysis Formal systems are used in attempts to prove the equivalence and the validity of programs. The work is proving the equivalence of programs is motivated by the prospects of global optimization. If there were an algorithm which recognized the equivalence of two different programs, the faster program could be used in place of the slower-running one. A formal theory also provides a framework for analyzing and comparing various languages. It helps answer questions such as: What are the basic language features? What constructs can exist in the language? How can the features be combined to build new constructs? What categories of problems can be programmed in the language What is the cost or difficulty of writing a program?

1. 2. 3. 4.

These key questions may be approached through formalization. The answer are also of interest in the field of machine design; the ideal computer should efficiently perform the operations corresponding to the basic features of a language. 7.2 FORMAL SPECIFICATION 7.2.1 Approaching a Formalism Before going deeper into formalism it is useful to analyze some of the problems in formally defining a language and to look into the intuitive basis for the definitions. A language may be thought of as a set of sentences or formulae- strings of symbolswith well defined structure and, usually, a meaning. The rules specifying valid constructions of a language are its syntax: the syntax of a language describes its form. For example, when we say that x+2 is the sum of the values of x and 2 or that 2 .x= x+x is true- we are referring to the usual semantics of algebra. If all languages consisted of a finite number of legal sentences or formulae, syntactic definition would not pose a problem; it would suffice merely to list all legal sentences- a strings of symbols would be a sentence only if it appeared in the list. The problem of definition exist because almost all languages of any utility contains an unlimited (or every large) number of valid sentences. It is not possible to store a list of all valid strings for infinite languages. But it is not necessary to store the list of sentences if any member of the list can be generated whenever it is needed, even though to generate all sentences may be endless. If an algorithm exists that will successively produce legal strings, an arbitrary string is in the language if it ever occurs in the list that is generated: if it is a legal string, it will be generated after some finite (but possibly long) time. Such a listing algorithm is called a generative specification of the languages. If the algorithm generates sentences in such an order that each new sentence is at least as long (has at least as many symbols) as the previous sentence, it is clearly possible to determine whether or not a given string is in the language whenever the algorithm begins to generate sentences longer than the string being tested, that string cannot be in the language unless it has already been generated. This type of algorithm enables us to decide after a finite number of steps whether or not a string is a legal sentence. If such a decision can be made in finite time for every string, the language is called decidable. An alternative type of algorithm could be used to specify languages. In this second approach the string to be tested is fed as data into the algorithm. The algorithm then analysis the input performs whatever computation is necessary; and produces an answer indicating yes, the string is legal, or no, it is not legal. This is called analytic specification. A language with an analytic specification is decidable if the analyzer always stops after finitely many steps for every input string. Unfortunately, formal analytic specification is often very difficult to derive; this chapter will deal primarily with generative specifications. English is not suitable for defining languages formally because it is too vague and leads to ambiguous definitions. It is necessary to develop a formalism in which language definition can be stated. This definition language is the syntactic metalanguage when we employ a language to talk about some language (itself or another) we shall call the latter the object language and the former the meta- language. A formal system of meta-language. Symbols of the object language are called terminal symbols. Symbols of meta-language that denotes strings in the object language are called nonterminals to formally define the meta-language would require a meta-metalanguage; therefore, we hope that the meta-language is intuitively clear.

The first step of the definition process is to establish the universe of discourse. That is, it is necessary to specify the objects being discussed. The most elementary object is a symbol. Symbols are concatenated to form strings, which may or may not belong to the language. Definition 1: An alphabet T is a finite set of symbol (terminal symbols). A formula (also called a string or a sentence) is a concatenation of symbols. It is useful to have notation for the class of all possible finite strings on an alphabet T. This is designated by T*. For any set U, U* represents the set of all possible finite concatenations of elements of the set U. Small Greek letters are used to denote strings. We commonly write to represent the null or empty string (i.e. the string which contains no elements). Generally, a language does not include all the possible strings on its alphabet. Only certain strings are valid formulae in the language. Thus: Definition 2: a language, L, is a subset of the set of finite concatenations of symbols in an alphabet T. We write this L T*. 7.2.2 Development of Formal Specification Let us turn for an example to English syntax. English is not just a collection of group of words- there is an underlying structure connecting the words. Given the sentence The student studies hard One can construct the sentence structure shown in figure 7.1. In particular, one can separate the Noun Phrase (NP) from the Verb Phrase (VP) and the complete the analysis by subdividing these phrases into individual words. Since all sentences have some structure, it should be possible to generate this structure in small steps and thus build up to complex sentences. We can represent structure graphically with a syntax tree; branches from each node on the tree indicate its logical subdivisions. For example, we might began with the classification sentences and replace it by the pair NP and VP to construct one possible form that a sentence might have. It is notationally convenient to write this as Sentence NP VP It is clear from the context that NP represents the linguistic class noun phrase. In figure 7.1 we must distinguish between the strings NP, which may be replaced by the string article, and the string The, which cannot be replaced. We distinguish the names of classes from words in the language by placing the meta-brackets (and) around symbols used to represent the linguistic classes. The first structure rule than becomes: 1. (Sentence) (NP) (VP) 2. (NP) (article) (noun) Similarly, a verb- phrase consists of a verb possibility modified by an adverb. 3. (VP)(verb) 4. (VP)(verb) (adverb)

Figure 7.1 syntax tree for the student studies hard The way this structure transformations or rewriting rules are written allow a (VP) to have optionally one adverb while an (NP) must have exactly one adjective. The last step is to list possible terminal symbol replacements for the class representatives, (article), (adverb), (noun), and (verb); 5. (Article) The 6. (noun) student 7. (verb) studies 8. (adverb) hard 9. (adverb) slowly Using these rewriting rules, it is possible to build a sentence by replacing the linguistic class symbol with the structure they represent. Figure 7.2 gives a derivation of The student studies hard. By changing the last step and using (Adverb) slowly It is possible to generate instead The student studies slowly. The structure rewriting rules of transformations determine the form of language generated. In our example the language is very small subset of English. Such a system of rewriting rules constitutes an algorithm for generating sentences. By changing the symbol for words to phonemes and allowing for more complex structure transformations, one can describe much of English syntax in a similar manner. The scheme can handle most programming-language features as well. However, features that require more specification power will lead us to seek other, most discriminating methods of language specification. 7.3 FORMAL GRAMMARS The above example indicates a way to formalize the process of generating the

FIGURE 7.2 Steps in derivation Strings of a language: before considering a formal definition, it is useful to analyze the structure of a sentence. We used two classes of symbols: one, enclosed in brackets, to represent linguistic classes (grammatical units used as intermediate steps in the formal generation process); and another, composed of Roman letters, from which the generated sentence was eventually formed. Because the symbols of one set are in the sentences when the generation terminates, while those of the other appear only in the intermediate steps, they are referred to as terminal and nonterminal

symbols, respectively. One nonterminal symbol, the starting symbol (in our example, (sentence)) is distinguished as the symbol with which the generation process begins. Definition 3: The terminal symbols are the symbols of the alphabet T. The nonterminal symbols are a set N of symbols not in T that represent intermediate states in the generation process. The starting symbol is a distinguished nonterminal symbol from which all strings of the language are derived. The generation process itself consisted of applying, at each step, any one of the set of rewriting rules of productions. This process transforms the string into a new string; the process stops when there is no production that can be applied or when the string consists solely of terminal symbols. Definition 4: A production is a string transformation rule having left-hand side that is a pattern to match a substring (possibly all) of the string to be transformed, and a right-hand side that indicates a replacement for the matched portion of the strings. It is important to realize that any substring of the current strings may be replaced by an applicable production and that only that part of the string matched by the lefthand side of the production is affected. Productions can totally replace substrings, or they may merely rearrange the symbols of the matching substring. Definition 5: A formal grammar G is a 4-tuple G= (N,T,,P) where 1. N is the set of nonterminal symbols 2. T is the set of terminal symbols 3. is the starting symbol; N 4. P is the set of productions where , (NT) *, (i.e. is not null) 5. NT is empty Requirement (5) assures that it is always possible to distinguish nonterminals from terminals. 7.3.1 Example of formal grammars To clarify the notion of formal grammar we consider two examples. The nonterminals will be capital Roman letters (A,B,C) and ; the terminals symbols will be small Roman letters (a,b,c) Example 1: N= {A,B,} T= {a,b} P= { AB (1) A aA (2) A a (3) B Bb (4) B b} (5) Notice that in production (2) the nonterminal A occurs on both side of the rule; (2) indicates that the class of strings corresponding to A is closed under prefixing with a. Since the only other production indicating the structure of class A is production 3, it is easy to see that A is closed under concatenation and consists of the class of finite strings of as: a,aa,aaa,aaaa,Likewise, the class B consist of finite strings of bs: b,bb,bbb,bbbb,.The language produced by the grammar consists of all string of as followed by a string of bs.

Example 2: N = {A,} T = {a,b} P = { A A aAb A ab} Since the terminal alphabets are the same in examples 1 and 2, both languages are subsets of the set of strings containing as and bs. There is a difference, however. If the language of example 1 an arbitrary number of as may precede an equally arbitrary number of bs. But in example 2 every time an a is generated, a b is also generated. Therefore, the legal sentences of example 2 consist of a string of as followed by an equal number of bs. 7.3.2 The Derivation of Sentences So far we have indicated that formal grammars provide a generative specification of a language, but we have to formally defined the process of generating a string. Definition 6: A string is immediately derived from a string (write ) in a grammar G if and only if = , = , and P of G where and represent arbitrary (possibly empty) strings. Referring to example 1 above, suppose that =aABb and = aaBb. aABb aaBb is then an immediate derivation with = a, = A, = Bb, = a, and the production Aa. We now define proof in a grammar. Definition 7: A string is derived from a string (write *) in a grammar G if and only if there is a sequence of strings 0, 1,., n-1,n, for n0, such that = 0, = n and for every 0 i < n i i+1 (i.e. = 0 1 . n-1 n = ) with i (N T)* for every i. The list {i} is a proof of in G. This is a formal statement that each new string in the derivation process must come from some previously derived strings by application of a production of <w>. The last condition rules out deriving a string from the empty string (i.e. a string without symbols). A typical derivation in the grammar of example 1 is AB aAB aABb aaBb aabb hence * aabb (also aAB * aaBb etc) 7.3.3 Sentential Forms and Sentences The above definitions specify formally the generation process. It is now necessary to designate which of the set is possible derivation terminate on strings of the language. Definition 8: A sentential form is any string which can be derived from the starting symbol

Examples of sentential forms in the grammar of example 1 are aAB and aabb. The sentential or sentence-like forms include formulae which have nonterminal symbols on the final string. Definition 9: A sentence is a sentential form containing only terminal symbols. Therefore, aabb is a sentence, but aAB is not. Definition 10: A language L defined by a grammar G (write L(G)) is the set of sentences that can be derived from in G: L(G) = { T* | * }. L is called ambiguous if it contains a formula for which there is more than one distinct proof in which the left-most non-terminal is replaced at each step. 7.4 HIERARCHY OF LANGUAGES The definition of production allows for a wide variety of string transformations. Certain restrictions on the form of productions give grammars producing sub-classes of the class of formal languages- e.g., linear grammars, producing regular languages. Noam Chomsky has constructed a system of four language type that classify some languages according to such restrictions. The most general type of grammar imposes no restrictions on the productions. In intermediate strings particular, productions that eliminate (erase) symbol are permitted. This allows the intermediate strings to expand and contract. An example of an erasing production aAB aB, in which A is erased from the context aAB. A grammar (as we have defined it) without restriction is called a type 0 grammar. The simplest restriction which produces a strictly smaller class of languages is to require the right-hand side of every production to have at least as many symbols as the left-hand side. A grammar with this restriction is called a type 1 or noncontracting context-sensitive grammar because is possible only if length () > (). Examples of productions in a type 1 grammar are bB Bb (interchange symbols) r r (where length () > length (a) Definition 11: The length of a string is the number of symbols in the string. If Consists of a single symbol, length () = 1; length () = 0 (the null string). For strings , , length () = length () + length (). In both type 0 and type 1 grammars can be any string. context-sensitive refers to the fact that some productions may recognize context- e.g., in the case of rr, the transformation occurs only in the context r. An example of a type 1 grammar is: Example 3: N = {A,B,} T= {a,b,c} P = { Abc Ab aAbB

Bb bB Bc bcc A a} This grammar generates strings of the form anbncn for n > 1. If the left-hand of the production is restricted to a single nonterminal symbol occurs. Grammars with this restriction (and nonblank right-hand strings) are called type 2, Context-free or simple phrase-structure grammars. The latter name comes from an Analogy to the method that we used to generate the sentence the student studies hard.. Our grammar satisfied the single-symbol restriction, and every nonterminal symbol expanded into a words or phrase- for example, (sentence) become the concatenation of a Noun phrase (NP) and a Verb phrase (VP). Two more context free grammars appeared in example 1 and 2 above. A subclass of context-free languages called bounded-context languages has become important in practical compiling. A third type of restriction on the productions restricts the number of terminals and nonterminals that each step can create. When, at most, one nonterminal symbol is used in both the right- and left-hand sides of a production, the production is said to be linear. If the nonterminal occurs to the right of all other symbols on the right-hand side of a production, the production is called a right-linear production. Similarly, if the nonterminal occurs to the left of all the other symbols the production is called a leftlinear production. A grammar is called right-(left-) linear if each of its productions is right-(left-) linear; a language is called regular if it can be generated by a right-or leftlinear grammar. Each of the above restrictions includes those above it. These types form a hierarchy that is summarized in figure 7.3. We remark without proof that no type 3 grammar can generate the language defined in example 2, so type 3 is a strict subset of type 2. Similarly, no type 2 grammar can generate the language of example 3, so type 2 is a strict subset of type 1. Finally, type 1 is a strict subset of type 0. We classify languages according to the type of grammar that can generate them: a type i language is a language that can be generated by a grammar of type i (but not by one of type i+1, for i=0,1,or 2). Languages can also be defined in terms of machines (e.g., translators, interpreters) that accept them. To each of these general language types there is a corresponding type of abstract machine: for example, regular languages are languages that can be recognized by finite-state machines; while type 0 languages are all languages that can be recognized by a class of machines called Turing machines. The table in figure 7.3 indicates the type of abstract machine corresponding to each class of formal grammar.
Type 0 Type of language and recognizing Production automation restrictions Contracting context sensitive (Post systems): Turning machines , (N T )* ; 1 Nocontracting context-sensitive: non- deterministic , (NT)*;, (NT)*- Linear-bounded automate and length ()() form and

Context-free: Non-deterministic push-down Storage automata Regular or finite-state: finite-state automata

A (N T)*-; AN Right linear AaB left linear ABa

Aa Aa a T ; A,B N

Notation: By convention X* consists of all finite strings of symbols from the set X including the empty string . Figure7.3 Basic formal grammars 7.5 BACKUS-NAUR FORM BACKUS NORMAL FORM - BNF BNF is a notation for writing grammars that is commonly used to specify the syntax of programming languages. In BNF nonterminals are written as names enclosed in corner-brackets < >. The sign is written = (read is replaced by). Alternatives ways of rewriting a given nonterminal are separated by a vertical bar, |, (read or). Using BNF notation example 1 of the last section would be written. <>= <A><B> - read the sign is replaced by A followed by B <A>= a <A> | a read A is replaced by a followed by A or by a <B>= <B> b | b As an example of BNF, we specify in figure 7.4 a FORTRAN- like language consisting only of GO TO statements. The GO TO statements have statements labels and reference labels of arbitrary length. It would of course be desirable to have all the reference labels appear among the list of statement labels. We will see that we cannot impose this restriction on a language using Backus Normal Form. <letter>= A | B | C | | Z (reader reads letter is replaced by A or B) <identifier>= <letter> | <letter> <identifier> <go to stm>= GO TO <identifier>; <program>= <identifier> <go to stm> | <identifier> <go to stm> <program> Figure 7.4 A BNF specification of a subset of FORTRAN like language An example of a program in this language would be: AB GO TO XY; XY GO TO WXYZA; WXYZA GO TO AB; According to the specification it is possible to generate a program such as A GO TO B; C GO TO D;where no statements are labeled B or D. we would like to exclude such programs. However, there is no way to do this formally in a BNF specification. The set <program> is much larger than the set that we wish to call valid programs. BNF notation is equivalent to context-free (or phrase-structure) grammar,

so class symbols expand without reference to surrounding context. We would like to distinguish a subset of programs in which every reference label occurs as a statement label. To express such a relationship or function, we need a more powerful formal system, one with the capability of cross- reference between elements of the sentence structure that it generates. Cross-reference is a context-sensitive feature. 7.6 CANONIC SYSTEMS We present canonic systems as another more powerful method for defining languages. The author feels that canonic systems provide a useful vehicle for a theoretical approach to language. They also exemplify some of the theoretical work currently being done, and the same basic questions that we pose about them may be directed at any formal language specification. A canonic system is a type of formal system that operates on several sets of strings over a finite alphabet. Canonic systems (equivalent to Smullyans elementary formal systems) are a variant of Posts canonical systems. In canonic systems the general framework of productions, or string-transformation rules, is replaced by a system of axioms (canons) and by the logical rules of substitution for variables and detachment (modus ponens). A canonic system defines a set of interrelated predicates, each of which is a set of strings. Canonic systems have been used to specify the syntax and the translation of programming languages (More; Donovan and Ledgard, 1967). They have served as a data base for a generalized translator for computer languages (Alsop, 1967; Altman); later theorems as to their mathematical power and their formal properties have been proven (Doyle, Mandl); and they have been used to study the complexity of translators and languages (Haggerty, 1969). The ultimate goal of this research has been to say something about the programming languages. The author hopes to use this system to prove things about programming languages and their translators. Consider the inadequacies of the Backus-Naur Form (BNF) specification of the syntax of programming languages. In BNF it is impossible to describe many of the constraints that exist in programming languages, such as the restriction that a legal program is not acceptable to a translator, even though correct in form, unless all the reference labels in the program correspond to the statement labels. These features are sometimes referred to as context-sensitive features. Some people feel that they really refer to the meaning of a language and that they are semantic and not syntactic. However, these features must be specified in specifying the translation of a language. The distinction between syntax and semantics is not always clear. For example, the statement 20 GO TO 20 may be syntactically a legal statement, but it is semantically correct? (One might argue that this statement is useful for determining) how long a computer will run before making an error.) The translation of a programming language is specified by a canonic system that generates a set of ordered pairs of form <statement in source language, its translation in the target language> Such a set of rules would specify the translation of a computer language. If the target language is understood, such a specification could be said to define the semantics of the language. Like BNF, canonic systems generate the strings of a language. The recognition process is a different problem: An algorithm that uses a canonic system specification

of a programming language to recognize and produce the translation of strings has been developed and implemented. The issue of whether or not canonic systems specifying the context-sensitive code parts of a programming language specify the syntax or the semantics of a language is academic; the real goal is to use a specification of a language to say something about the language and its translation. If we are to define the translation of a language, we must specify all the legal strings that get translated in that language. We need a system powerful enough to exclude illegal strings, whether on the basis of syntax or on semantics; hence, the motivation for a more powerful system than BNF. Yet, in their most general form, canonic systems are so powerful that they introduce many undecidability problems. A requirement exists, therefore, to determine more precisely the power of canonic systems and to restrict their power to the point at which they are adequate to handle the features we wish to define, but not powerful enough to introduce problems we cannot cope in our models. We will first introduce canonic system informally. A canonic system consists of a number of canons, logical rules which state that certain premises imply certain conclusions. A predicate is a name given to a well defined set of strings over the alphabet of the object language. For a programming language these sets are defined in such a way as to aid the user of the canonic system. The assertion sign is used to separate the premises from the conclusions. The general form of a canon sign is a;b;;c z and is read from the premises a;b;;c can be asserted z. For example, we might define a set number: 1 digit 2 digit 3 digit x digit x number x digit; y number yx number This system defines number as the set of strings over the symbol 1, 2, and 3. Any terminals may be substituted for the variable x and y, but no conclusion can be drawn unless the resulting premises are true- that is, unless the resulting premise statements have been previously reached as conclusions. The first three canons, which have no premises, are axioms: their conclusions are immediate. PREDICATE, VARIABLE, TERM, REMARK, CANON In the above example, digit and number are predicates of degree 1 that name certain sets of strings. A predicate of degree 2 names a set of ordered pairs of strings. A term in a string of concatenated variables and terminal symbols; a remark is a term followed by a predicate symbol. If R1,..,Rn-1 Rn is a canon, R1,..,Rn-1 are premises and Rn is the conclusion. Each of the Ri is a remark. The following notation will be used: 1. Lower case letters will be used as variables. 2. Italicized strings of letters will be used as predicate symbols.

3. The notation <x1 < x2 < xn> will be used to denote n-tuples. Terms of degree 1 will be denoted by their single component without the brackets. 4. A series of canons with identical premises R1,..,Rn and different conclusions 1 P,,mP will be abbreviated (see fig.7.5 for example) R1;..;Rn 1+ 2+ 3+.nP FORMAL DEFINITION Definition 12: a canonic system is a sextuple <S> = (C,V,M,P,S,D) where C is a finite set if canons V is an alphabet of terminal symbols used to form the strings generated (i.e., provable) by <S> M is a finite set of variable symbols (variables) P is a finite set of predicate symbols(predicates) used to name set of n-tuples. The number of components in the n-tuples denoted by a predicate is the degree of the predicate S is a finite set of punctuation signs used in writing canons D (P) is a set of sentence predicates, the union of which will be defined to be the language specified by the canonic system The usual punctuation signs are , ,< , and;. SUBSTITUTION AND DETACHMENT An instance of a canon is| the result of substituting strings from V* for the variables that appear in the canon. Substitution must be uniform: occurrences of a single variable must all be replaced by the same string. The rule of inference (called modus ponens, or detachment) states that if R1; . . . ; Rn-1 Rn is an instance of a canon, the re mark Rn can be derived (in the canonic system) only if the premises R1 ; . . . ; Rn-1 are all in the system. (Note that in the case of an axiom there are no premises). Rn is then immediately derived from R1. . . Rn-1. A proof or derivation of a remark R in canonic system <S> is a finite sequence of remarks R1, . . . , Rn, R every member of which can be immediately derived from one or more of the preceding remarks. R is in <S> (can be derived or proven in <S>) if and only if there exists a proof for R in <S>. 7.6.1 Example: Syntax Specification In the section we present an example of the use of canonic systems to specify the syntax of a programming language. Our example defines the same FORTRAN subset that was used to demonstrate the use of Backus Naur Form. A canonic system specification of this same subset is given in Figure 7.5. 1 A+B+C . . . +Z letter 2 <S> letter <S> identifier 3 <S> letter; y identifier | y<S> identifier 4 y identifier GO TO y go to stm 5 i identifier; x go to stm ix program

6 i identifier; x go to stm; p program is xp program Figure 7.5 Canonic system specification of syntax We can generate programs using the specification. For example, the string A GO TO B may be generated using the fifth canon with the terminal string substituted as below: A identifier; GO TO B go to stm A GO TO B program A is in letter by the axiom (1) and so by (2) is in identifier. Therefore, the first premise of canon 5 is satisfied. The second premise may be satisfied by canon 4, 2 and 1. Therefore, the following sequence using substitutions and modus ponens repeatedly derives A GO TO B program: (see fig.7.6). 1 A letter C.1,MP 2 A identifier 1, C.2, MP 3 B letter 4 B identifier 5 GO TO B gotostm 4, C.4, MP 6 A GO TO B program 5,2, C.5, MP (k, Cn, MP ~Mp = Modus Ponens; Cn = canon in fig.7.5; k = line in this fig.) Figure 7.6 Derivation of a string in a canonic system We now construct a canonic system that specifies the same language but with the restriction that all reference labels appear among the statement labels. Thus the program below A GO TO B C GO TO D will not be a legal program because the reference labels B and D are not among the statement labels A and C. To effect this restriction, we will eventually generate a set of ordered triples: a program, a list of reference label in the program. And a list of statement labels in the program. Out of this set we want only the programs containing reference labels bearing the relationship in to the list of statement labels. To accomplish this, we rewrite the canonic system of Figure 7.5, as follows: 7 A+ B+ C+ +Z letter 8 <S> letter <S> identifier 9 <S> letter; letter; y identifier y <S> identifier 10 y identifier (GO TO y<y) go to stm with ref label Canons 7-9 are similar to those used previously. However, 10 differs in that it defines a set of ordered pairs. Each member of this set consists of a pair of strings GO TO y and y. That is, each element consists of a GO TO statement with a reference label, and the reference label. When generating the code for the legal GO TO statements, we must keep track of the reference labels. 11 s identifier; <x<r> go to stm with a ref label (s x< s< r) prog with stm labels and ref labels

The above canon defines a set of ordered triplets. The first element of one of the members of this set is a string consisting of an identifier followed by a GO TO statement; the second element is the statement label; the third element is the reference label. An instance of the canon scheme number 11 is A identifier; (GO TO MZ<MZ) go to stm with ref label (A GO TO MZ< A< MZ) prog with stm labels and ref labels 12 i identifier; (x<<S>) go to stm labels and ref labels (p<s<r) prog with stm labels and ref labels (i x p <si <r<S>) prog with stm labels and ref labels Canon 12 above generates the set of ordered triplets of which the first element is a program consisting of GO TO statements; the second element is a list of statement labels; and the third element is a list of reference labels (elements of the lists are separated by commas.) 13 (p<s<r) program with stm labels and ref labels; (r<s) in p program The above canon states that given an ordered triplet of which the first element is a program, the next element is a list of statement labels, and the third is a list of reference labels; given also that the list of reference labels bears the relationship in to the list of statements labels, then the program is a legal program. We now defines the relationship n as a set of ordered pairs of which the first element is a list and the second element is a list of labels containing those of the first list. Canons 14-16 defines list: 14 list 15 I identifier I, list 16 x list; y list xy list Canons 17 and 18 define the predicate in: 17 x list; y list; z list (y<xyz) in 18 (a<<S>) in; (b< <S>) in (ab<<S>) in Canons 7-18 together defines the set of legal programs. 7.6.2 Specification of Translation Canonic systems may be used to specify the translation of a language. A translation is a function and may be defined by a set of ordered pairs, of which the first element is a legal program and the second element is its translation into the target language. For example, a specification of the translation of PL/I into 360 assembly language ultimately specifies the set of ordered pairs: <legal PL/I program<360 assembly-language program> Figure 7.7 is a canonic system specifying the translation of the GO TO subset of FORTRAN into the assembly language of the IBM 360 (BAL). For simplicity of this specification, we have not included the restriction that reference labels be in the list of

statement labels, nor, as would be the case in real BAL, have we limited the length of identifiers. A+ B+ C+ Z letter <S> letter <S>identifier <S> letter; y identifier y<S> identifier y identifier <GO TO y<B y> go to stm with translation i identifier; <x<y> go to stm with translation <ixp<iyt> translation i identifier; <x<y> go to stm with translation; <p<t> translation <ixp<iyt> translation Figure 7.7 A canonic system specification of the translation of a subset of FORTRAN to BAL 7.6.3 Recognition and Translation Algorithm Canonic system specifies a language by a set of rules that may generate legal strings. An interesting problem arises: can a canonic system be used as a basis for recognizing strings from defined set? In addition, if the members of the defined set are ordered pairs, triplets, etc, can the algorithm be used to produce the missing terms corresponding to a given term? One straightforward recognition algorithm would be simply to generate all possible strings until a match is found. This section discusses a more efficient algorithm, which is capable of recognizing strings produced by a Backus-Naur Form system. It was developed by Alsop as an extension of an algorithm presented by Cheatham and Sattley. The algorithm, which has been implemented on a timesharing system at M.I.T allows a user to type in strings of the language. The program will proceed to recognize the strings and produce the corresponding translation. The algorithm is presented in this chapter in order to illustrate techniques used in syntax directed compilers and recursion. The program has two phases. A preliminary phase checks the syntax of the canonic system used. The principal phase scans the input strings, determines whether it satisfies the canonic system definition, and generates any associated translations. The algorithm is principally topdown; it attempts first to match the input string against the sentence predicate of the canonic system (e.g. program) and it arrives only through recursion at a lower-level predicate (e.g. integer or digit). The following is a simplified statement of the algorithm for canonic systems involving only predicates of degree 1. The simplified algorithm (Fig. 7.8) will later be expanded to include more general cases. Imagine an arbitrary character string with an imaginary pointer to the left of the first character, and a canonic system defines a set of strings. We wish to determine whether the character string is a member of the set. Step 1. The program considers in sequence those canons directly defines the string in question, and performs the following steps (2 through 6) for each such canon.

Figure 7.8 flowchart of simplified algorithm Step 2. The conclusion of the canon is matched, item by item, against the input string. If the item in the conclusion is a terminal character, step 3 is performed; if a variable, step 4 is performed. If the end of the canon is reached, the algorithm proceeds to step 5. Step 3. The item in the conclusion is a terminal character. It is compared with the character in the input string to the right of the pointer. If they are identical, the program return to step 2 to consider the next item in the conclusion, with the pointer shifted to the right. If not, the scan fails and the program returns to step 1 to consider any remaining canons that might produce the string. Step 4. The item in the conclusion is a variable, and the program must determine by recursion the definition of the variable in terms of the input string. In other words, it must be determine the number of characters from the input string, commencing with the character to the right of the pointer, which should be allotted to the definition of this variable. To accomplish this, the program assembles a new input string, which is a copy of all input characters to the right of the pointer, and picks a predicate among the premises of the canon that contains the variable. After saving its present state in a stack (pushing), the program returns to step 1 to determine the definition of the variable by examining the canons defining the premise predicate chosen. If there is no response upon return, the scan fails and the program returns to step 1 to consider alternative definition of the string. If there is a response, the program compares it with the original input string to determine the definition of the variable and moves the pointer to its new position following the definition. The algorithm returns to step 2. Step 5. The scan of the conclusion is complete, and the definition (in terms of the input characters) of the variables appearing in the conclusion has been recorded. The algorithm now inspects the premises. Those premises used in step 4 to determine the definitions of the variable in the conclusion may already be asserted, since they were used to generate the definitions. However, a variable may appear twice in the premises, and we must insure that the string which forms the definition of the variable is a member of both sets. The algorithm forms an input string from the definition of the variable and operates recursively to determine if the other premise containing the variable is also true; i.e. if the string which is the definition of the variable is also a member of the second set named as a premise predicate. Upon return, if there is no response, the algorithm returns to step 1 to pursue alternatives as before. If there is a response, the program insures that the string has been fully scanned. If there are still more unchecked premises, it treats them in the same manner. After all such premises have been successfully verified, the simplified algorithm proceeds to the last step. Step 6. The results of the scan at this level, which constitute the response for the next higher level of calling are assembled. There are no results if the scan failed. Otherwise, they consist of the input string with the mental pointer resting at the spot where the scan of the conclusion was completed. The algorithm now returns to step 1 if there are more canons directly defining the set of which the input string is possibly a member. Since each canon could conceivably add to the results, the program must actually be equipped to handle multiple results and hence multiple responses at the next higher level, and check out each possibility. The example which follows will serve to clarify the problem. If there are no further canons, the program proceeds to step 7. Step 7. The program pops its state- that is, it returns to pick up where it left off at the next higher level. If the highest level has been reached, then the results are

examined for a completely scanned input string. If such a condition is found, the input string is a member of the originally defined set. If not, there exists a syntax error in the string. It is not clear that the set of all syntactically incorrect strings will be detected by the algorithm; this recognition problem is unsolvable in general. A simple example will serve to illustrate the process and the problems involved in multiple answers. Consider the following canonic system: 1 1 digit 2 2 digit 3 3 digit 4 d digit integer 5 d digit; i integer di integer We wish to determine by use of the algorithm whether the string 31 is in integer. The process is described in the shorthand fashion below (fig 7.9). For simplicity and efficiency in the algorithm, canons are not allowed to be right-recursive (e.g. the canon i integer; d digit id integer is not allowed) Figure 7.10 outlines an extension of the algorithm that handles predicates of degree greater than 1, for which one or more of the terms are not known and are desired as translated output. FORMAL SYSTEM AND PROGRAMMING LANGUAGES: AN INTRODUCTION

FIGURE 7.9 Steps in the recognition of a string

7.7 Canonic systems and formal systems


Robert Mandl has proven a general theorem relating canonic systems to various types of formal grammars. THEOREM: For every type of grammar there exists a class of canonic systems with the property that for every grammar of the type under consideration there exists a canonic system that generates the same

Figure 7.10 Flowchart of general algorithm to recognize and generate missing terms language and that belongs to a corresponding class. Further, that class of canonic systems can be constructed. We give one example of an equivalent proof: the equivalence of a class of canonic systems to linear grammars. First we establish a correspondence of elements of a canonic system (C, V, M, P, S, D) to elements of a grammar (N, T, P, ). Then we show that both systems generate the same strings. We also choose D= {}, and call the sentence symbol (sentence predicate). Definition 13: A canonic system is a right-linear single-premise canonic system (RLCS) if each canon is of the form aA or xA ax B, where xM; A, B P; aV. THEOREM: For every regular grammar there is a right-linear single-premise canonic system that generates the same language and, conversely, for every RLCS there is a regular grammar generating the same language. (A similar result holds for LLCS, the left-linear counter-part). In other words, the class of RLCS is equivalent in generative power to the class of regular grammar. Proof: 1. Let G= (NG, TG, PG, ), be a regular grammar. Construct the following RLCS: <S>= (C, V, M, P, S, ), where V= TG M= {x} P= {A | A NG} S= {; , } C= { aA for every production A a in PG xB axA for every production A aB in PG} To prove that the grammar and the canonic system generates the same set, we show by the construction how a derivation in either of these two formal systems can be simulated step-by-step by a derivation in the other system. The proof will be by induction on the number of steps used in the derivation. BASIS OF INDUCTION: A one step derivation in <S> must be of the form a A, and when a A is assertable in a derivation in <S> in a derivation in <S>; A a is then applicable in a derivation in G. A one-step derivation in G must be of the form A a (this is a derivation from A). When this production is applicable, a A is assertable and is the corresponding one-step derivation in <S>. INDUCTION STEP: We assume for derivation of up to k-1 steps that the simulation is possible, in both directions. By hypothesis a k-1 step canonic-system derivation of B corresponds to a proof in G: B * ; and vice versa. At the Kth step of

derivation, if a canon a A (a production A a) is used, we are in a situation similar to that considered above, under Basis of induction; if the other type of canon (production) is used, we distinguish between the two directions of stimulation: a) (Derivation in the canonic system simulated by the grammar) Assume that a canon of the form x B ax A is used in the Kth step of a derivation; its applicability implies the existence of a proof in <S> of B. Since this derivation takes only k-1 steps by the induction hypothesis we have B* in G. An instance of x B ax A with x= completes a proof of a A in <S>. The corresponding rule A a B in G yields A*a. Therefore all derivation in <S> can be simulated step-by-step by derivation on G, including in particular the - derivation. This proves L (<S>)= L (G). b) (Derivation in the grammar, simulated by the canonic system). Consider a kstep derivation A* . Suppose that the first production applied is AaB; its applicability implies the existence of a string , T*G, such that B* in G is k-1 steps long and =a. By the induction hypothesis this k-1 step derivation in G can be simulated by a k-1 step derivation in <S>. The canon corresponding to the rule AaB is x B ax A Substituting for x yields a k-step proofA in <S>. Hence * in G implies that is in <S>, and L <S>= L(G). proof: 2) Let <S> = (C, V, M, P, S, ) be a RLCS. Construct G,G=(NG , TG, PG, ) NG = {A | A P} TG = V PG = {Aa, for every a A in C; A aB, for every x B ax A in C} corresponds to This argument gives L(<S>)= L(G).QED Notice that these systems are strongly equivalent (derivation-for-derivation equivalent), thus THEOREM: The class of RLCS and the class of regular grammars are strongly equivalent. Much of the research into formal systems are grammars concerns their generative power and the structure and the complexity of their languages; these qualities are interrelated. We described at the beginning of this chapter some of the motivation behind complex studies. Underlying all such works is the desire to simplify linguistics into a form that can be productively used and analyzed. Language is too complicated

and varied, so we turn to formal systems. But the mere existence of formalism does not solve our problems. Many formal systems- for example, canonic systems and type 0 grammars- have inherent undecidability problems: in general, there is no algorithm capable of telling after a finite amount of time whether or not a given string is in the language of a given grammar. (Recall our definition of decidability in section 7.2.) Studies of generative power help us to understand how characteristics of a grammar correspond to structural features of languages and to choose the weakest grammar suitable to a given section. At the same time, by exploring restrictions we learn about the structure of languages. Figure 7.11 is an inclusion diagram of the relationships between certain classes of grammars. Classes of systems in the diagram include all classes below them (that is, at nodes below them in the tree). The diverging branches represent classes for which inclusion either does not exist or is not presently known. Complexity is an intuitive notion, and no satisfactory measure for it has yet been proposed. We use the term complexity of a formula to mean some measure of the difficulty involved in generating or recognizing it; complexity measures will be used to compare formulae or languages. Possible measures are the length of proof; the length of string; or the height (length of longest path of branches) of the treediagram of a derivation. In the case of canonic systems, one might count the number of remarks considering a proof. These measures all deal with aspects of structure. Another way of getting at structure is to consider a machine, or automaton corresponding to a formal system: a complexity measure might be given in terms of some characteristics of the machine, for example the amount of memory it requires, or the number of steps it takes. Alternatively, a measure might treat the meaning or depth of formulae semantic notations. Another sort of measure might involve the difficulty of recognizing a formula, and perhaps of constructing a derivation for it. In this latter case a short or structurally simple formula might have a high measure of complexity if it involved ambiguities or if it were similar, superficially or structurally, to many other formulae. It is obvious that these proposed measures are not in complete agreement: they lead to very different orderings of the formulae in a language and of different languages. In the case of language-translating systems, we can consider the complexity measures of equivalent formulae in different languages; we might

Figure 7.11 Corresponding hierarchies ask, for example, whether a given translation algorithm preserves the ordering of complexities under a given measure. The structure of a formula or sentence is important because in it lies much of the meaning of the sentence. (See the phrase-structure example in section 7.2.1) Ambiguity is thus of specific interest. (For example, there is a theorem that states that some context-free languages are inherently ambiguous, that is, they have no unambiguous context-free grammars.) When we say that two grammars both generate a given sentence, we may want to know also whether they give the same structure for

the sentence. This question relates to complexities because complexity is a structural notion. 7.8 SUMMARY We have explored the practical use of formal system in defining, studying, and implementing programming languages and language processors. We defined some terminology of formal linguistics (the study of grammars, formal systems, and languages). BNF was presented as a method of defining syntax and translation and of studying programming languages. As an example of research currently being conducted in this area, canonic systems were defined, some formal properties were investigated, and their power was diagrammed. We showed how canonic system can be used to define language feature that BNF, a context-free grammar, cannot specify. Finally, we discussed the notion of a complexity measure.

QUESTIONS 1. Consider the following BNF specification of simple arithmetic expressions such as A+2 * C: 1. <arith> : := <arith> + <term> | <term> 2. <term> : := <factor>*<term> | <factor> 3. <factor> : := <symbol> | <number> 4. <number> : := 0 | 1 | 2 | 3 5. <symbol> : := [<letter>]81 6. <letter> : := A | B | C | D Note: [<letter>]81 means 1 to 8 letters. It is shorthand for <letter> | <letter> <letter> | <letter> <letter> <letter> | ] Now consider: THETA + SIGMA + 2 * 4 * ALPHA + MOO and its parse tree:

or its compact parse tree :

Each dashed encloses a sub-tree (headed by one node) of the entire tree. These three sub-trees show examples of left-recursion, right-recursion, and precedence in BNF. Precedence is the implied order of computation of the operators in the sub-trees. a. Sub-tree 1 must be evaluated before the * operation of sub-tree 2 can be done. Thus in 2*4* ALPHA, the 4* ALPHA is computed first and the parse is (2* (4* ALPHA)). Thus multiplication proceeds right to left. This results from the right recursion in rule (2) of the BNF above. There : := <factor>*<term> says <term> (which can contain multipliers) is an input on the right side to * and must be performed before the * can. <factor> on the left can contain no operators. Right recursion yields right to left evaluation of the operator. b. Sub-tree 3 shows the left recursion of + which results from rule (1) of the BNF above. Rule (1) says : :=<arith>+<term> meaning <arith> (which can contain additions) is input on the left side to + and must be performed. Thus addition proceeds left to right. Left recursion yields left to right evaluation of the operator. c. Sub-tree 2 illustrates the precedence (or highest priority) of * over + since the 2 * (sub-tree 1) of sub-tree 2 must be evaluated before the + from which the tree hangs can be performed. This precedence of * over + is expressed in rule (1) of the above BNF when : :=<arith> + <term> states <term> which can contain multiplication is input to + and must be evaluated before +. In general, if <foo> is input to operator and a <foo> can contain operator then has precedence over . Finally, recursion and precedence relate. In our example, the right * has precedence over (is performed before) the left *; the left + has precedence over the right +. PROBLEM Concerning the language defined by the following BNF: (1) <sentence > : := <clause> <sentence> | <clause> (2) <clause> : := <clause> <foo> | <word> $ <cough> (3) <foo> : := <word> | <mumble> (4) <word> : := <word> $ <char> | <char> (5) <mumble> : := <char> <char> (6) <cough> : := <cough> <char> | <char> <char> (7) <char> : := A | B | C | D a. Which rule contain recursion? Identify it as left or right and quote the part(s) of the rule that contain the recursion. Please identify the rules by number. b. Which rules express a precedence relation between different operators? Quote the part (parts) of the rule and state what relationship(s) they imply. c. If you did (b) correctly you found different rules implying opposite relationships between the same two operators. This is not wrong or

even ambiguous or inconsistent. Why? (Hint: when does a relationship expressed by a rule apply? Not everywhere!) (See next question.) 2. 3. d. Show the parse tree of A $ C D D List four ways in which formal systems are useful in compilers or programming languages. a. Differentiate among: 1). A grammar 2). A language 3). A machine b. Which of the above describes the behavior of a BNF specification? Of a canonic system? c. what is the relationship between a BNF and a canonic system? What is the difficulty in using a generative grammar (capable of describing language in question) to determine the legality of a particular string? a. why is BNF unsatisfactory for completely describing some languages? b. how do canonic systems overcome this deficiency? The following Backus-Naur form grammar generates string expressions which contain string operands and binary operators. The transformations indicated by these operations are: operator * description Concatenate .. Imbedded in .. definition ABC DE G <S> ABCDEG AB * CD <S> CDABCD X * AB <S> ABXAB Less all occurrences of .. The BNF specification is: <string expression>: := <partial string>|<string expression><partial string> <partial string>: := <partial string>-<nested strings>|<nested string> <nested string>: := <basic string>|<nested string>*<basic string> <basic string>: := | <basic string> <letter> <letter>: := A | B | | Y | Z where represents the null string. a. Generate two examples of legal string expressions (showing the parse tree). Use every rule at least once. b. List the operators in order of precedence, highest precedence first (see problem 7.1 for definition of precedence). In an arithmetic expression A+B*C, the rule for parsing this statement must make up pass over the + and examine the *. In other words, the rule must recognize that B is the left operand of * and not the right operand of +. In order to make the decision, in the absence of parentheses, we must look at the ABCDBABC BC <S> ADBA XX-X <S>

4. 5. 6.

operators on both side of an operand to highest precedence, or if both operators have the same precedence, it belongs to the left-most operator of the two. One may depict the relative precedence of operators in a two-dimensional array where the left operator is the one to the left of the operand. c. Fill in the precedence array indicating the relative precedence of each pair of operators:

7. In chapter 8 (Compilers), figure 8.16, there is a BNF specification of a very small subset of PL/I. a. Show the parse tree of: COST= RATE*(START-FINISH)+ 2*RATE*(START-FINISH-100) b. Show the parse tree of the program in Figure 8.1. c. Is there any precedence expressed between the arithmetic operators (+,-,*, /) of this subset? In what order are arithmetic operations performed in an expression? Is this order ambiguous? Are parenthesized expressions legal? d. Extended the BNF to have regular arithmetic precedence. 8. The following is a canonic system description of a language L. (1) P+ Q+ R+ S+ letter (2) j letter j id (3) I id i primary (4) j letter; i id ij id (5) q primary q term (6) t term; q primary t*q term (7) t term t exp (8) t term; e exp e+ t exp (9) e exp (e) primary a. Are any precedence relations expressed between + and *? If so, which cannon(s) express them and what are they? b. L already has parenthesized expressions (by canon (9)). Add the exponentiation operators $ to L. $ has precedence over * and +. $ should evaluate right to left. Thus: P + Q * R $ S means P+ (Q * (R $ S)) and P + Q $ R $ S means P + (Q $ (R $ S)) 9. Any BNF system may be written as an equivalent canonic system and sometimes a canonic system can be written as an equivalent BNF system. The canonic system of problem 8 can be written in BNF. One technique to write the BNF equivalent is to:

1) For every canon, write one BNF rule. 2) Let the canonic systems predicates (the initialized names) be the nonterminals in BNF (the bracketed names). EXAMPLES Canonic system P+ Q+ R+ S+ letter j letter j id j letter ; i id ij id

BNF letter: := P | Q | R | S id: := letter id: := idletter

a. Following our example, write the BNF equivalent of the canonic system of problem 8. b. A canonic system can express a language more complex than BNF (or context free) languages. Such a language cannot have a BNF equivalent. Does the following have a BNF equivalent? If so, write it. If not, indicate which canons and predicates cannot be expressed in BNF. (1) (2) (3) (4) (5) (6) (7) (8) (9) A+ B+C+ D letter 0+ 1+ 2+ 3 digit a letter a thing a letter & b thing ab thing list i thing i, list a+b list ab list a+ b+ c list b in abc a in m & b in m ab in m

(10) t thing GO TO t goto-stm (11) t thing & s thing s= t assign-label-stmt (12) s goto-stmt s stmt-seq (13) s assign-label-stmt s stmt-seq 10. Suggest a possible measure of complexity for a language or a translator specified by a canonic system. 11. Take as a simple measure m (C,t) of a theorem t in a canonic system C as the number of remarks in the shortest proof of t in C. That is, if r1,, r2..,rk is the shortest proof of t in C then m (C,t)=k. a. If t is a theorem of C, then prove m(C,t) is computable. b. Take a new measure function m(C,n)=maximum {m(C,t) | t is provable in C and |t| =n}; c. Prove that all canonic systems can be reduced to a canonic system of only single premise canons.

d. Find a correspondence between this measure function and the number of steps it may take to recognize a string t defined by a canonic system C. 12. a. For your measure function of problem 10, prove the theorems of problems 11a and 11b. b. Find a closed form upper and lower bounds of your measure function.

8 compilers
This chapter has two purposes: 1. To present a general model of a compiler that may be used as a basis for designing and studying compilers. 2. To create an appreciation of the difficulty and cost of implementing and using particular features of languages. To accomplish this we have divided the chapter into three parts. PART 1 presents a simple example and introduces a general model of a compiler. PART 2 elaborates upon the model and explains its inner working in details. PART 3 users the model to demonstrate the implementation of advanced features discussed in Chapter 6 (e.g. data structures, recursion, storage allocation, block structure, conditions and pointers). PART 1

8.1 STATEMENT OF PROBLEM


A compiler accepts a program written in a higher level language as input and produces its machine language equivalent as output. Here in Part 1, we examine a simple PL/I program and become familiar with the issue we must face in trying to complete it. WCM: PROCEDURE (RATE, START, FINISH); DECLARE (COST, RATE, START, FINISH) FIXED BINARY (31) STATIC; COST= RATE * (START-FINISH) + 2 * RATE * (START-FINISH -100); RETURN (COST); END; FIGURE 8.1 MINI-PL/I program example What must the compiler do in order to produce the machine language equivalent of WCM? 1. Recognize certain strings as basic elements, e.g. recognize that COST is a variable, WCM is a label, PROCEDURE is a keyword, and = is an operator 2. Recognize combinations of elements as syntactic units and interpret their meaning, e.g. ascertain that the first statement is a procedure name with three arguments, that the next statement defines four variables to be fixed binary numbers of 31 bits, that the third statement is an assignment statement that requires seven computations, and that the last statement si a return statement with one argument

3. Allocate storage and assign locations for all variables in this program 4. Generate the appropriate object code. 8.1.1 Problem No. 1- Recognizing Basic Elements The action of parsing the source program into the proper syntactic classes is known as lexical analysis. The program is scanned and separated as shown in Figure 8.2. The operational details for this step involve conceptually simple string processing techniques. The source program is scanned sequentially. The basic elements or tokens are delineated by blanks, operators, and special symbols and thereby recognized as identifiers, literals, or terminal symbols (operators, keywords). The basic elements (identifiers and literals) are placed into tables. As other phases recognize the use and meaning of the elements, further information is entered into these tables (e.g. precision, data type, length, and storage class). Other phases of the compiler use the attributes of each basic element and must therefore have access to this information. Either all the information about each element is passed to other phases, or typically, the source string itself is converted into a string of uniform symbols. Uniform symbols are of fixed size and consist of the syntactic class and a pointer to the table entry of the associated basic element. Figure 8.3 depicts uniform symbol for users. Because the uniform symbols are of fixed size, converting to them makes the later phases of the compiler simpler. Also, testing for equality is now just a matter of comparing syntactic classes and pointers rather than comparing long character strings.

Figure 8.2 Lexical analysis- tokens of example program This lexical process can be done in one continuous pass through the data by creating an intermediate form of the program consisting of a chain or table of tokens. Alternatively, some schemes reduce the size of the token table by only parsing tokens as necessary, and discarding those that are longer needed. Current compilers use both of these approaches. Either method of lexical analysis will discover and note lexical errors (e.g. invalid characters and improper identifiers). The lexical phase also discards comments since they have no effect on the processing of the program.

Figure 8.3 Uniform symbols of first statement

Notice that the uniform symbol is the same length whether the token is 1 or 31 character long. Other phases of the compiler deal mainly with the small uniform symbol, but they may access any attribute of the token by following a pointer. 8.1.2 Problem No. 2 Recognizing Syntactic Units and Interpreting Meaning Once the program has been broken down into tokens or uniform symbols, the compiler must (1) recognize the phrases (syntactic construction); each phrase is a semantic entity and is a string of tokens that has an associated meaning; and (2) interpret the meaning of the constructions. The first of these two steps is concerned solely with recognizing and thus separating the basic syntactical constructs in the source program. This process is known as syntax analysis. For our example, we arbitrarily take the statements as phrases. The program is scanned and separated as shown in Figure 8.4. Syntactic analysis also notes syntactic errors and assures some sort of recovery so that the compiler can continue to look for other compilation errors. Some compilers attempt to guess what the programmer did wrong and correct it. Once the syntax of a statement has been ascertained, the second step is to interpret the meaning (semantics). Associated with each syntactic construction is a defined meaning. This may be in the form of actual code or an intermediate form of construction. There are many ways of operationally recognizing the basic constructs (syntactical analysis) and interpreting their meaning. In Part 2, we chose to use very general methods. They use rules (reductions) which specify the syntax form of the source language. These reductions define the basic syntax construction and the appropriate compiler routine (action routine) to be executed when a construction is recognized. The action routines interpret the meaning of the constructions and generate either code or an intermediate form of the constructions. For example, a reduction might specify: if an identifier is followed by an =

Figure 8.4 Syntax analysis- syntactic units of example program sign, then call the action routine ARITHMETIC_STM.

8.1.3 Intermediate Form


Once the syntactic construction is determined, the compiler can generate object code for each construction. Typically, however, the compiler creates an intermediate form of the source program. The intermediate form affords two advantages: (1) it facilitates optimization of object code; and (2) it allows a logical separation between the machine-independent phases (lexical,

syntax, interpretation) and the machine-independent phases (code generation and assembly). Using an intermediate form raises two questions: (1) what form? and (2) what are the rules for converting source code into that form? The form depends on the syntactic construction, e.g. arithmetic, nonarithmetic, or nonexecutable statements. 8.1.3.1 ARITHMETIC STATEMENTS One intermediate form of an arithmetic statement is a parse tree. From the formal methods of the previous chapter, e.g. the general syntax tree of problem 1. Chapter 7, we can obtain the parse tree of Figure 8.5. The rules of converting an arithmetic statement into a parse tree are: 1. Any variable is a terminal node of the tree. 2. For every operator, construct (in order dicated by the rules of algebra) a binary (two-branched) tree whose left branch is the tree for operand 1 and whose right branch is the tree for operand 2.

Figure 8.5 Tree- intermediate form of example arithmetic statement The tree for the arithmetic statement of the example program is depicted in Figure 8.5. Although this picture makes it easy for us to visualize the structure of the statement, it is not a practical method for a complier. The compiler may use as an intermediate form a linear representation of the parse tree called a matrix. In a matrix, operations of the program are listed sequentially in the order they would be executed. Each matrix entry has one operator and two operands. The operands are uniform symbols denoting either variables, literals, or other matrix entries Mi (i denotes a matrix entry number). The matrix that would be generated for the arithmetic statement is depicted in Figure 8.6 (For ease of reading we use the actual symbols as entries instead of their uniform symbols.) The tree (Fig. 8.5) and the matrix (Fig 8.6) are equivalent representations of the assignment statement. The reader can form the tree of such a statement using rules of elementary algebra. The process of constructing a tree or matrix from an arithmetic statement is the subject of many compiler books. We leave it as an exercise for the reader (see question 1 of Chapter 7). 8.1.3.2 NONARITHMETIC STATEMENTS The problem of creating an accurate intermediate form for nonarithmetic executable statements is similar to that for arithmetic ones. The statements DO, IF, GO TO etc., can all be replaced by a sequential ordering of individual matrix entries (Fig 8.7). The operators in the matrix are defined in later phases of the compiler so that proper code can be produced.

Figure 8.6 Matrix for example arithmetic statement

Figure 8.7 Example matrix for RETURN and END statement

8.1.3.3 NONEXECUTABLE STATEMENTS


Nonexecutable statements like DECLARE or FORTRANs DIMENSION give the compiler information that clarifies the referencing or allocation of variables and associated storage. There is no intermediate form for these statements. Instead, the information contained in a nonexecutable statement is entered into tables, to be used by the other parts of the compiler. For instance, for the DECLARE statement in our example the interpretation phase would note the data type, precision, and storage class(FIXED BINARY,31 bits, STATIC) in the identifier table for which of the variables COST, RATE, START, and FINISH (as in Fig. 8.8).1 In a high-powered language like PL/I where some storage allocation and data definition are delayed until execution time, certain tables are used by the object program as well as by the compiler. Therefore, some declarative information is passed on the object program to be used at executive time. So far we have discussed lexical analysis, syntactical analysis, and interpretation (semantics). Now we will discuss how the last two general problem associated with the PL/I example may be solved. 8.1.4 Problem No. 3 Storage Allocation At some time we must reserve the proper amounts of the storage required by our program. In our example the DCL statement give us the proper information:

The interpretation phase constructs the entries in the table of the figure 8.8.1The storage allocation routine scan the identifier table and assigns a location to each scalar. In the case of fixed binary numbers of 32 bits, it will assign the first to relative location 0, the second to location 4, and so on. The absolute ad-

Name COST RATE START FINISH

Base BINARY BINARY BINARY BINARY

Scale FIXED FIXED FIXED FIXED

Precision(bits) Storage class 31 STATIC 31 STATIC 31 STATIC 31 STATIC

Location 0 4 8 12

Figure 8.8 Identifier table dress is unknown until load time. The storage assignment phase must also assume that 4 words of 31 binary digits and 1 sign bit are collected at load time. The compiler uses these relative assigned addresses in later phases for proper accessing. Similar, storage is assigned for the temporary locations that will contain intermediate results of the matrix (e.g. M1, M2, M3, M4, M5, M6, M7).

For a language like FORTAN where all storage is assigned at compilation time, this is a straightforward procedure. We would look at an entry in the table, find its base, scale, and precision, and reserve the proper amount of storage. However, in a language like PL/I only STATIC storage is assigned this way. There are three other types of storage (AUTOMATIC, CONTROLLED and BASED) that are not. For variable declared as automatic, controlled and based information, about each variable is included in the object deck and thus is available at execution time. Further, the compiler generates for statements such as PROCEDURE, BEGIN, and ALLOCATE the proper object code to reserve the necessary storage for these variables. In a similar way, END, RETURN and FREE statements must generate object code to free this dynamic storage for other use. This requires that these statements have associated entries in the matrix that will in turn cause the proper code to be generated. Accessing this type of storage is difficult. Typically, the complier doesnt even know whether or not it will be allocated, much less the location at which it is to be allocated. References to this type of data may be made relative to registers that can be updated at execution time to point at the dynamically allocated data. 8.1.5 Problem no. 4 Code Generation Once the compiler has created the matrix and tables of supporting information, it may generate the object code. One scheme is to have a table (code production table) defining each type of matrix operation with the associated object code (Fig.8.9). The code generation phase would scan the |matrix and generate, for each entry, the code define in the table using the operands of the matrix entries to further specialize the code. This process is depicted in Figure 8.10 using the code definitions of Figure 8.9. One way to use this scheme is that we are treating operators in the matrix as macro calls, operands as arguments, and the production table as macro definitions.

When we examine Figure 8.10 carefully, three questions arise. 1. Was it a good idea to generate code directly from the matrix? Line 1 and 4 of our matrix are identical and thus resulted in the redundant code. 2. Have we made the best use of the machine we have at our disposal? Lines 12 (ST 1,M4) and 13 (L1,M4) of the generate code are wasteful; they do Standard code definitions for -,*,+,= L 1,&OPERAND1 S 1,&OPERAND2 ST 1,M&N __________ * L 1,&OPERAND1 M 0,&OPERAND2 ST 1,M&N __________ + L 1,&OPERAND1 A 1,&OPERAND2 ST 1,M&N ___________ = L 1,&OPERAND2 ST 1,&OPERAND1 where &OPERAND1 is the first operand of the matrix, &OPERAND2 is the second operand, &N is the line number of the matrix entry being examined FIGURE 8.9 Code definitions productions for -,*,+,= Matrix 1 - START FINISH L S ST L M ST L M ST L S ST L S ST Generated code 1,START 1,FINISH 1,M1 1,RATE 0,M1 1,M2 1,=F2 0,RATE 1,M3 1,START 1,FINISH 1,M4 1,M4 1,=F100 1,M5 -

RATE

M1

RATE

- START

FINISH

M4

100

M3

M5

L M ST L A ST L

1,M3 0,M5 1,M6 1,M2 1,M6 1,M7 1,M7 1,COST

M2

M6

= COST

M7

FIGURE 8.10 Example code generation not change the result of any computation. Also we have used only 2 out of the 16 registers and only one type instruction (RX) of the five types available on the 360. The RR instructions, which are faster, have not been used. 3. Can we generate machine language directly? The example used assembly language. What would have happened if one of the operands in the matrix was a label on a statement that wasnt generated yet? The first two of these questions are issues of optimization. The first refers to the optimality of the matrix as an intermediate form (machine-independent) and second to the optimality of the actual machine code (machine-dependent). A compiler eliminates the redundant code of the question 1 it will not waste time trying to improve the eliminated code in question 2.

8.1.5.1 OPTIMIZATION (MACHINE-INDEPENDENT) The issue raised by the question 1 provides an example of one type of machineindependent optimization problem. Operationally, when a subexpression occurs in the same statement more than once (a common subexpression), we can delete all duplicate matrix entries and modify all references to the deleted entries so that they refer to the remaining copy of that sub expression (Fig. 8.11). The resulting code shown in the middle column of Figure 8.12 is an improvement over that in Figure 8.10. Matrix after elimination of Matrix with common subexpression common subexpression 1 - START FINISH 1 - START FINISH 2 * RATE M1 2 * RATE M1 3 * 2 RATE 3 * 2 RATE 4 - START FINISH 4 5 - M4 100 5 - M1 100 6 * M3 M5 6 * M3 M5 7 + M2 M6 7 * M2 M6 8 = COST M7 8 = COST M7 FIGURE 8.11 Elimination of common subexpressions Other machine-independent optimization steps are: 1. Compile time computation of operations, both of whose operands are constants 2. Movement of computations involving nonvarying operands out of loops

3. Use of properties of Boolean expressions to minimize their computation These will be discussed in more detail in Part 2. At this point it is important to realize that machine-independent optimization of the matrix is possible and that logically it should occur before we use the matrix as a basis for code generation. Such a process is called the optimization phase and is logically separated from code generation. 8.1.5.2 OPTIMIZATION (MACHINE-DEPENDENT) The problem raised by question 2 can best be illustrated by comparing two possible versions of the code that can be generated from our optimized matrix example: Optimized matrix first try Improved code 1 - START FINISH L 1,START L 1,START S 1,FINISH S 1,FINISH M1 R1 ST 1,M1 2 * RATE M1 L 1,RATE L 3,RATE M 0,M1 MR 2,1 M2 R3 ST 1,M2 3 * 2 RATE L 1|=F2 L 5,=F2 M 0,RATE M 4,RATE M3 R5 ST 1,M3 4 5 M1 100 L 1,M1 S 1,=F100 ST 1,M5 L 1,M3 M 0,M5 ST 1,M6 L 1,M2 A 1,M6 ST 1,M7 L ST 1,M7 1,COST 80 bytes Figure 8.12 Machine-dependent optimized code Figure 8.12 depicts the matrix| that we previously optimized by eliminating a common subexpression (M4). Next to each matrix entry is the code generated using the operators as defined in Figure 8.9. The third column is even better code in that it uses less storage and is faster due to a more appropriate mix of instructions. How did we get this more efficient version? 1. We made better use of temporary storage by employing as many of the 360s 16 registers as we could and by storing our intermediate results unless necessary. This reduced the no of loads and stores from 14 to 5. 2. We used shorter and faster 360 instructions whenever possible (MR instead of M). M5 R1 M6 R7 M7 R3

1,=F100

M3

M5

LR 7,5 MR 6,1

+ M2

M6

AR

3,7

M7

COST

ST

3,COST

36 bytes

This example of machine-dependent optimization has reduced both the memory space needed for the program and the execution time of the object program by a factor of 2. Machine-dependent optimization is typically done while generating code. Operationally, we can extend the previous scheme of a code generator to that of a sophisticated macro processor with conditional macro pseudo-ops. In this way we could vary the instructions generated according to the availability and contents of temporary storage. This type of scheme would incorporate machine-dependent optimization into the code generation phase. 8.1.5.3 ASSEMBLY PHASE In the beginning of this section we stated that the task of a complier was to produce the machine language equivalent of a source program. Yet in our examples of code generation we have been producing assembly language. Literals and symbolic addresses are easier for the reader, but the compiler must contain the actual machine language. The compiler can generate references to actual core location in place of literals and variables assigned by a storage allocation routine. However, labels cannot be assigned values until the final code has been generated. Therefore, coupled with code generation are: 1. Generating code 2. Defining labels and resolving all references We separate (1) the code generation phase, and (2) the assembly phase since they are logically distinct and are often implement as such. Operationally, the assembly phase is similar to pass 2 of the assembler. 8. 1.6 General Model of Complier In analyzing the compilation of our simple PL/I program we have found seven distinct logical problems as follows and summarized in Figure 8.13. 1. Lexical analysis recognition of basic elements and creation of uniform symbols 2. Syntax analysis recognition of basis syntactic construct through reductions 3. Interpretation definition of exact meaning, creation of matrix and tables by action routines 4. Machine independent optimization creation of more optimal matrix 5. Storage assignment modification of identifier and literal tables. It makes entries in the matrix that allow code generation to create code that allocates dynamic storage, and that also allow the assembly phase to reserve the proper amounts of STATIC storage. 6. Code generation- use of macro processor to produce more optimal assembly code. 7. Assembly and output- resolving symbolic addresses (labels) and generating machine language. These will be names of the seven phases in our model compiler. Phases 1 through 4 are machine-independent and language-dependent. Phases 5 through 7 are machinedependent and language-independent. For reasons of efficiency, in actual implementations these phases may not be separate modules of code. We might evaluate a compiler not only by the object code produced but also by the amount core it occupies and the time it takes to execute. Unfortunately, these dimensions of optimality are often inversely proportional to each other. Likewise, the optimality of the code typically is inversely proportional to the complexity, size and

execution time of the compiler itself. These are tradeoffs that we must be aware of throughout this chapter. We also mentioned or assumed the following data bases which are used by the compiler and which form the lines of communication between the phases: A. Source code in our example the simple PL/I program in Figure 8.1 B. Uniform symbol table consists of a full or partial list of the tokens as they appear in the program. Created by lexical analysis and use for syntax analysis and interpretation (Fig.8.3) C. Terminal table a permanent table which lists all key words and special symbols of the language in symbolic form (pointed to by uniform symbols in Fig.8.3) D. Identifier table contains all variables in the program (four in example) and temporary storage and any information needed to reference or allocated storage for them; created by lexical analysis, modified by interpretation and storage allocation, and referenced by code generation and assembly (Fig.8.8). The table may also contain information of all temporary locations that complier creates for use during execution of the source program (e.g., temporary matrix entries). E. Literal table contains all constants in the program (two in example). Creation and use similar to D above F. Reductions permanent table of decision rules in the forms of patterns for matching with the uniform symbol table to discover syntactic structure G. Matrix intermediate form of program which is created by the action routines, optimized and then use for code generation (Fig.8.6) H. Code productions - permanent table of definitions. There is one entry defining code for each possible matrix operator (Fig.8.9).

I. Assembly code assembly language version of the program which is created by the code generation phase and is input to the assembly phase J. Relocatable object code final output of the assembly phase, ready to be used as input to loader These phases and the data bases and their interaction are summarized in Figure 8.13. This figure represents the model compiler described in part 2 and used in

part 3 of this chapter. In reading these parts the reader should refer back to yhe figure of an overview. PART 2 8.2 PHASES OF COMPILER In this chapter we examine in detail the seven phases of the compiler model. Each phase (Fig. 8.13) is described chronologically, and data bases are introduced into the discussion as the compiler creates or first references them. For reasons of efficiency or because of particular features of the source languages, actual implementations may use more or less elaborate data bases and algorithms than those presented. For example, FORTRAN- or COBOLLike languages may not require the compiler to establish as many attributes in the identifier table as would be needed for languages like PL/I or ALGOL. The algorithms presented similarly may not be the best for particular situation. However, they do provide the basis guidelines to be followed in compiler design. We feel that the reader may extend the data bases easily to include the exceptions and special cases he will encounter when implementing his own compiler. 8.2.1 Lexical Phase 8.2.1.1 TASKS The three tasks of the lexical analysis phase are: 1. The parse the source program into the basic elements or tokens of the language 2. To build a literal table and an identifier table 3. To build a uniform symbol table 8.2.1.2 DATA BASES These tasks involve manipulations of five data bases. Possible forms for these are: 1. Source program original form of program; appears to the compiler as a string of characters 2. Terminal table a permanent data base that has an entry for each terminal symbol (e.g., arithmetic operators, keywords, nonalphameric symbols). Each entry consists of the terminal symbol, in indication of its classification (operator, break character), and its precedence (used in later phases). See problem 1 of the chapter 7.

3. Literal table created by lexical analysis to describe all literals used in the source program. There is one entry for each literal, consisting of a value, a number of attributes, an address denoting the location of the literal at execution time (filled in by a later phase), and other information (e.g., in some implementation we may wish to distinguish being literals used by the

program and those used by the compiler, such as the literal 31 in the expression BINARY FIXED (31)). The attributes, such as data type or precision, can be deduced from the literal itself and filled in by lexical analysis.

4. Identifier created by lexical analysis to describe all identifiers used in the source program. There is one entry for each identifier. Lexical analysis creates the entry and places the name of the name of identifier into that entry. Since in many languages identifier may be from 1to 31 symbols long, the lexical phase may enter a pointer in identifier table for efficiency of storage. The pointer points to the name in a table of names. Later phases will fill in the data attributes and address of each identifier.

5. Uniform symbol table created by lexical analysis to represent the program as a string of tokens rather than of individual characters. (Spaces and comments in the source are not represented by uniform symbols and are not used by future phases. There is one uniform symbol for every token in the program.) Each uniform symbol contains the identification of the table of which the token is a member (e.g., a pointer to the table or a code) and its index within that table.

You might also like