0% found this document useful (0 votes)
85 views21 pages

CIT316 Summary

The document provides an overview of formal grammars, languages, and automata, detailing the components and semantics of grammars, types of grammars according to the Chomsky hierarchy, and the operations on languages. It also discusses the role of formal languages in computer science, particularly in compiler design, including the structure and phases of a compiler, as well as the importance of translators. Additionally, it covers lexical analysis, regular expressions, and the construction of lexical analyzers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views21 pages

CIT316 Summary

The document provides an overview of formal grammars, languages, and automata, detailing the components and semantics of grammars, types of grammars according to the Chomsky hierarchy, and the operations on languages. It also discusses the role of formal languages in computer science, particularly in compiler design, including the structure and phases of a compiler, as well as the importance of translators. Additionally, it covers lexical analysis, regular expressions, and the construction of lexical analyzers.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

REVIEW OF GRAMMARS, LANGUAGES AND AUTOMATA

For mal Gr ammar


A formal grammar (sometimes called a gr ammar ) is a set of rules of a specific kind,
for forming strings in a formal language.

Recognizer is a function in computing that determines whether a given string belongs


to the language or is grammatically incorrect.

Automata Theor y is a formalism used by formal language theory to describe


recognizers.

Par sing is the process of recognising an utterance (a string in natural languages) by


breaking it down to a set of symbols and analysing each one against the grammar of
the language.

Ambiguous gr ammar is a grammar with multiple ways of generating the same single
string.

The Syntax of Gr ammar s


A grammar G consists of the following components:
 a finite set N of non-terminal symbols, none of which appear in strings formed
from G.
 a finite set Σ of terminal symbols that is disjoint from N.
 A distinguished symbol S ∈ N, that is, the start symbol
 A grammar is formally defined as the tuple (N,Σ, P, S). Such a formal
 a finite set P of production rules, each rule of the form
(Σ ∪ N)* N(Σ ∪ N)* → (Σ ∪ N)*

The Semantics of Gr ammar s


The operation of a grammar can be defined in terms of relations on strings
 given a grammar G = (N, Σ, P, S), the binary relation ⇒ G (pronounced as "G
derives in one step") on strings in (Σ ∪ N)* is defined by:
x ⇒ y iff ∃ u,v,p,q ∈(Σ ∪ N)*: x = upv ∧p → q∈P

 the relation ⇒ G* (pronounced as G derives in zero or more steps) is defined as

the reflexive transitive closure of ⇒ G


 a sentential form is a member of (Σ ∪ N)* that can be derived in a finite number
of steps from the start symbol S; i.e a sentential form is a member of {w ∈(Σ ∪
N)* | S⇒ G*w}.
 A sentential form that contains no non-terminal symbols (i.e. is a member of
Σ* ) is called a sentence.
 the language of G, denoted as L(G), is defined as all those sentences that can be
derived in a finite number of steps from the start symbol S, that is, the set {w
∈Σ* | S⇒ G*w}.
Types of Gr ammar s and Automata
In the field of Computer Science, there are four basic types of grammars:
a. Type-0 grammars (unrestricted grammars) include all formal grammars.
b. Type-1 grammars (context-sensitive grammars) generate the context-sensitive
languages.
c. Type-2 grammars (context-free grammars) generate the context-free languages.
d. Type-3 grammars (regular grammars) generate the regular languages.

Type-0: Unr estr icted Gr ammar s


These grammars generate exactly all languages that can be recognized by a Turing
machine. These languages are also known as the Recur sively Enumer able
Languages.

Type-1: Context-Sensitive Gr ammar s


These grammars have rules of the form αAβ → αγβ with A being a non-terminal
and α , β and γ strings of terminals and non-terminals. And γ being a non-emtpy
non-terminal. The languages described by these grammars are exactly all languages
that can be recognized by a linear bounded automaton

Type2: Context-Fr ee Gr ammar s


A context-free grammar is a grammar in which the left-hand side of each production
rule consists of only a single non-terminal symbol. The languages that can be
generated by context free grammars are called context-fr ee languages.

For every context-free language, a machine can be built that takes a string as input
and determines in O (n3) time whether the string is a member of the language, where
n is the length of the string.

Type-3: Regular Gr ammar s


In regular grammars, the left hand side is again only a single non-terminal symbol, but
now the right-hand side is also restricted. The right side may be the empty string, or a
single terminal symbol, or a single terminal symbol followed by a non-terminal
symbol, but nothing else.

Chomsky Hier ar chy


The Chomsky hierarchy (occasionally referred to as Chomsky–Schü tzenberger
hierarchy) is a containment hierarchy of classes of formal grammars.
Every r egular language is context-fr ee, every context-free language, not containing
the empty string, is context-sensitive and every context sensitive language is
r ecur sive and every recursive language is r ecur sively enumer able.

For mal Language


A formal language is a set of wor ds, i.e. finite strings of letters, symbols, or tokens.

The alphabet over which a formal language is defined is a set of words from which
letters are taken.

The elements of an alphabet are called its letter s.

A wor d over an alphabet can be any finite sequence, or str ing, of letter s.

The set of all words over an alphabet Σ is usually denoted by Σ* (using the Kleene
star).

For any alphabet there is only one word of length 0, the empty word, which is often
denoted by e, ε or λ.

A formal language is often defined by means of a for mal gr ammar (also called its
for mation r ules);

words that belong to a formal language are sometimes called well-for med wor ds

Language-Specification For malisms


 those strings generated by some formal grammar;
 those strings described or matched by a particular regular expression;
 those strings accepted by some automaton, such as a Turing machine or finite
state automaton;
 those strings for which some decision procedure (an algorithm that asks a
sequence of related YES/NO questions) produces the answer YES.

Typical questions asked about such formalisms include:


 What is their expressive power? (Can formalism X describe every language that
formalism Y can describe? Can it describe other languages?)
 What is their recognisability? (How difficult is it to decide whether a given word
belongs to a language described by formalism X?)
 What is their comparability? (How difficult is it to decide whether two languages,
one described in formalism X and one in formalism Y, or in X again, are actually
the same language?).
Oper ations on Languages
Certain operations on languages are common. This includes the standard set
operations, such as union, intersection, and complement. Another class of operation is
the element-wise application of string operations.
a. concatenation L1L2 consists of all strings of the form vw where v is a string from
L1 and w is a string from L2
b. L1 ∩ L2 of L1 and L2 consists of all strings which are contained in both languages
c. The complement ¬L of a language with respect to a given alphabet consists of all
strings over the alphabets that are not in the language.
d. The Kleene star: the language consisting of all words that are concatenations of 0
or more words in the original language
e. Let e be the empty word, then eR = e, and for each non-empty word w = x1…xn
over some alphabet, let wR = xn…x1, then for a formal language L, LR = {wR | w ∈
L}.

Str ing homomor phism


A class of languages is closed under a particular operation when the operation,
applied to languages in the class, always produces a language in the same class again.

The context free languages are known to be closed under union, concatenation, and
inter section with regular languages, but not closed under inter section or
complement.

Uses of For mal Languages


1. In computer science they are used, among other things, for the precise definition of
data formats and the syntax of programming languages.
2. Formal languages play a crucial role in the development of compilers
3. formal constructs are needed for the formal specification of programme semantics.
4. Formal languages are also used in logic and in foundations of mathematics to
represent the syntax of formal theories.

The study of interpretations of formal languages is called for mal semantics

WHAT IS A COMPILER?
Tr anslator s
A translator is a programme that takes as input a programme written in one
programming language ( the source language) and produces as output a programme in
another language (the object or target language)

Compiler
A compiler takes an input (source) a high level language, and produces as output, a
low level language (assembly or machine language).
Inter pr eter
An interpreter executes a simplified language, called inter mediate code.

Assembler
An Assembler takes as source an assembly language and produces as output (target
language) machine language

Pr epr ocessor
Is a translator that a program in one high level language into equivalent program in
another high level language.

Why Do We Need Tr anslator s?


We need translators to overcome the rigour of programming in machine language,
which involves communicating directly with a computer in terms of bits, register, and
primitive machine operations.

The Challenge in Compiler Development


1. Many variations of:
a) programming languages (java, fortran)
b) programming paradigms (oop, functional, logic)
c) computer architectures (MIPS, SPARC, Intel)
d) many operating systems (Windows, Linux)
2. Qualities of a compiler:
a) the compiler itself must be bug-free
b) it must generate correct machine code
c) the generated machine code must run fast
d) the compiler itself must run proportionally fast
e) the compiler must be portable (modular, support seperate compilation)
f) it must print good diagnostics and error messages
g) the generated code must work well with existing debuggers
h) must have consistent and predictable optimisation.
3. In-depth knowledge of:
a) programming languages (variables, parameters, scoping, memory allocation)
b) theory (automata, context-free languages, etc.)
c) algorithms and data structures (hash tables, graph , dynamic programming)
d) computer architecture (assembly programming)
e) software engineering.

Compiler Ar chitectur e
The front end consists of the following phases:
 scanning: a scanner groups input characters into tokens
 par sing: a parser recognises sequences of tokens according to some grammar and
generates Abstract Syntax Trees (ASTs)
 semantic analysis: performs type checking, and translates ASTs into
intermediate representation (IRs)
 optimisation: optimises IRs.
The back end consists of the following phases:
 instruction selection: maps IRs into assembly code
 code optimisation: optimises the assembly code using control flow and data-flow
analyses, register allocation, etc
 code emission: generates machine code from assembly code.

The operating system provides the following utilities to execute the object file:
 linking: A linker takes several object files and libraries as input and produces one
executable object file.
 loading: A loader loads an executable object file into memory
 Relocatable shar ed libr ar ies: allow effective memory use when many different
applications share the same code.

The Str uctur e of a Compiler


We can identify four components
i. Fr ont-End: the front-end is responsible for the analysis of the structure and
meaning of the source text. Here we have the
a) syntactic analyser,
b) semantic analyser, and
c) lexical analyser.
ii. Back-End: The back-end is responsible for generating the target language. Here
we have
a) intermediate code optimiser,
b) Code generator and
c) code optimiser.
iii. Tables of Infor mation: It includes the symbol-table and other information table
iv. Run-Time Libr ar y: It is used for run-time system support.

Languages for Writing Compiler


a. Machine language
b. Assembly language
c. High level language or high level language with bootstrapping facilities for
flexibility and transporting

Phases of a Compiler
1. The Lexical Analyser : Also referred to as the Scanner. It separates characters of
the source language into groups that logically belong together; these groups are called
tokens.
2. The Syntax Analyser : This groups tokens together into syntactic structures. For
example, the three tokens representing A+B might be grouped into a syntactic
structure called an expr ession. Expressions might further be combined to form
statements.
3. The Inter mediate Code Gener ator : This uses the structure produced by the
syntax analyser to create a stream of simple instructions.
4. Code Optimisation: This is an optional phase designed to improve the
intermediate code so that the ultimate object programme runs faster and/or takes less
space.
5. Code Gener ation: Produces the object code by deciding on the memory locations
for data, selecting code to access each datum, and selecting the registers in which each
computation is to be done.
6. The Table Management or Bookkeeping: A symbol table (data structure) keeps
track of the names used by the programme and records essential information about
each, such as its type (integer, real, etc.)
7. The Er r or Handler : This is invoked when a flaw in the source programme is
detected.

Passes
Portions of one or more phases are combined into a module called a pass.

Reducing the Number of Passes


This can be done by merging phases into one, with a technique known as ‘back
patching’. which generate output with ‘slots’ that can be filled in later, after more of
the input is read.

Cr oss Compilation
i. Write new back-end in C to generate code for computer B
ii. Compile the new back-end and using the existing C compiler running on computer
A generating code for computer B.
iii. Use this new compiler to generate a complete compiler for computer B.
iv. We now have a complete compiler for computer B that will run on computer B.
v. Copy this new compiler across and run it on computer B (this is cross
Compilation).

Oper ations of a Compiler


i. Fr ont-end:
a) The lexical analysis ( or scanning)
a) The syntax analysis Front-end
b) Semantic analysis
ii. Back-end
a) Intermediate code optimisation
b) Code generation Back-end
c) Code optimisation
T-Diagr ams
They are used to describe the nature of a compilation. It is usually in the
form of T. (left-right: source-target language, top-bottom: compiler-language
implementation)

Lexical Analysis
A token is the smallest unit recognisable by the compiler.

Generally, we have four classes of tokens:


1. Keywords
2. Identifies
3. Constants
4. Delimiters

Constr uction of Lexical Analyser


i. Hand implementation
a) Input Buffer ing: The lexical analyser scans the characters of the source
programme one at a time to discover tokens
b) Tr ansition Diagr am (TD): One way to begin the design of any programme
is to describe the behaviour of the programme by a flowchart.
ii. Automatic generation of lexical analyser (REs)

Regular Expr essions (REs)


Regular expressions are a very convenient form of representing (possibly infinite) sets
of strings, called regular sets.

Basis
 ε is a regular expression that denotes { ε }.
 A single character a is a regular expression that denotes { a }.

Induction
Suppose r and s are regular expressions that denote the languages L(r) and L(s):
 (r)|(s) is a regular expression that denotes L(r) ∪ L(s)
 (r)(s) is a regular expression that denotes L(r)L(s)
 (r)* is a regular expression that denotes L(r)*
 (r) is a regular expression that denotes L(r).

Pr ecedence
1. the Kleene star operator * has the highest precedence and is left associative
2. concatenation has the next highest precedence and is left associative
3. the union operator | has the lowest precedence and is left associative

Extensions of Regular Expr essions


 Positive closure: r+ = rr*
 Zero or one instance: r? = | r
 Character classes:
 [abc] = a | b | c
 [0-9] = 0 | 1 | 2 | … | 9

For convenience, we can give names to REs so we can refer to them by their name.
For example:
 for – keyword = For
 Letter = [a - zA - Z]
 digit = [0 - 9]
 identifier = letter (letter | digit)*
 sign = + | - |
 integer = sign (0 | [1 - 9]digit*)
 decimal = integer . digit*
 real = (integer | decimal ) E sign digit+

Today regular expressions come in many different forms.


a. The earliest and simplest are the Kleene r egular expr ession
b. Awk and egr ep extended grep's regular expressions with union and parentheses.
c. POSIX has a standard for Unix regular expressions.
d. Per l has an amazingly rich set of regular expression operators.
e. Python uses pcr e regular expressions.

Lex Regular Expr essions


The following symbols in Lex regular expressions have special meanings:
\".^$[]*+?{}|()/
To turn off their special meaning, precede the symbol by \.

Examples of Lex regular expressions and the strings they match are:
1. "a.*b" matches the string a.*b.
2. . matches any character except a newline.
3. ^ matches the empty string at the beginning of a line.
4. $ matches the empty string at the end of a line.
5. [abc] matches an a, or a b, or a c.
6. [a-z] matches any lowercase letter between a and z.
7. [A-Za-z0-9] matches any alphanumeric character.
8. [^abc] matches any character except an a, or a b, or a c.
9. [^0-9] matches any nonnumeric character.
10. a* matches a string of zero or more a's.
11. a+ matches a string of one or more a's.
12. a? matches a string of zero or one a's.
13. a{2,5} matches any string consisting of two to five a's.
14. (a) matches an a.
15. a/b matches an a when followed by a b.
15. \n matches a newline.
16. \t matches a tab.

Tokens/Patter ns/Lexemes/Attr ibutes


A token is a pair consisting of a token name and an optional attribute value. e.g., <id,
ptr to symbol table>, <=>
Some regular expressions corresponding to tokens:
 Keyword = BEGIN END IF THEN ELSE
 Identifier = letter (letter digit)*
 Constant = digit*
 Relop = < <= <> > >=

A pattern is a description of the form that the lexemes making up a token a source
program may have. patterns. e.g., identifiers in C: [_A-Za-z][_A-Za-z0-9]*

A lexeme is a sequence of characters that matches the pattern for a token, e.g.,
 identifiers: count, x1, i, position
 keywords: if
 operators: =, ==, !=, +=

An attr ibute of a token is usually a pointer to the symbol table entry that gives
additional information about the token, such as its type, value, line number, etc

Languages for Specifying Lexical Analyser


There are tools that can generate lexical analyzers. An example of such tools is LEX

Specifying a Lexical Analyser with Lex


Lex is a special-purpose programming language for creating programmes to process
streams of input characters

A Lex programme has the following form:


 Declarations
 translation rules
 auxiliary functions

The input to the LEX compiler is called LEX sour ce and the output of LEX compiler
is called lexical analyser

LEX Sour ce
The LEX source programme consists of two parts:
a. The auxiliary definitions: are statements of the form:
D1 = R1
D2 = R2; Where each Di is a distinct name and Ri is a regular expression
whose symbols are chosen from the alphabets of the language
b. The Translation Rules: these are of the form:
P1{A1}
P2 {A2}
Pm {Am}; Where each Pi is a regular expression called Pattern over the alphabet

Steps in lex implementation


1. Read input lang. spec
2. Constr uct NFA with epsilon-moves
3. Conver t NFA to DFA
4. Optimise the DFA
5. Gener ate parsing tables & code

Finite Automata
A r ecogniser for a language L is a programme that takes as input a string x and
answers “yes” if x is a sentence of L and “no” otherwise.

Nondeter ministic Finite Automaton (NFA)


A nondeterministic finite automaton (NFA) consists of:
 a finite set of states S
 an input alphabet consisting of a finite set of symbols Σ
 a transition function δ that maps S × (Σ ∪ {ε}) to subsets of S.
 an initial state s0 in S
 a subset of S, called the final (or accepting) states.

An NFA accepts an input string x if there is a path in the transition graph from the
initial state to a final state that spells out x. The language defined by an NFA is the set
of strings accepted by the NFA.

Deter ministic Finite Automata (DFAs)


A deterministic finite automaton (DFA) is an NFA in which:
 there are no ε moves, and
 for each state s and input symbol a there is exactly one transition out of s labelled
a.

Equivalence of Regular Expr essions and Finite Automata


Regular expressions and finite automata define the same class of languages, namely
the r egular sets. In Computer Science Theory we showed that:
 every regular expression can be converted into an equivalent NFA using the
McNaughton-Yamada-Thompson algorithm.
 every NFA can be converted into an equivalent DFA using the subset
construction.
 every finite automaton can be converted into a regular expression using Kleene's
algorithm.

Conver ting a Regular Expr ession to an DFA


This is accomplished in two steps: first it converts REs into a non-deterministic finite
automaton (NFA) and then it converts the NFA into a DFA. (called subset
constr uction)

The Par ser


The Parser takes tokens from lexer (scanner or lexical analyser) and builds a parse
tree.

Role of the Par ser


 The parser reads the sequence of tokens generated by the lexical analyser
 It verifies that the sequence of tokens obeys the correct syntactic structure of the
programming language by generating a parse tree implicitly or explicitly for the
sequence of tokens.
 It enters information about tokens into the symbol table
 It reports syntactic errors to the user.

Context-Fr ee Gr ammar s (CFG's)


A CFG is sometimes called Backus-Naur For m (BNF). It consists of
1. A finite set of terminal symbols,
2. A finite nonempty set of nonterminal symbols,
3. One distinguished nonterminal called the start symbol, and
4. A finite set of rewrite rules, called productions, each of the form A → α where A is
a nonterminal and α is a string (possibly empty) of terminals and nonterminals

A context- free grammar is formally represented by the tuple: CFG = ( N, T, P, s )


where: N and T are finite sets of nonterminals and terminals under the assumption
that N and T are disjoint
P is a finite set of productions such that each production is of the form: A →
α (where A is a nonterminal and α is a string of symbols)
s is the start symbol

Notational Conventions
Terminals are usually denoted by:
 lower-case letters early in the alphabet: a, b, c;
 operator symbols: +, -, *, etc.;
 punctuation symbols: (, ), {, }, ; etc.;
 digits: 0, 1, 2, ..., 9;
 boldface strings: if, else, etc
Nonter minals ar e usually denoted by:
 upper-case letters early in the alphabet: A, B, C;
 the letter S representing the start symbol;
 ower-case italic names: expr , stmt, etc.;

Gr ammar symbols, that is either terminals or nonterminals, are represented by


upper-case letters late in the alphabet: X, Y, Z

Str ings of Ter minals only are represented by lower-case letters late in the alphabet: u,
v, w ... z

Pr oductions are represented in the following way: A → α1, A → α2 Alternatives in


productions are represented by A → α1 | α2 etc.

Par se Tr ees
Parse tree is a graphical representation for derivation that filters out the choice
regarding replacement. It has purpose of making explicit the hierarchical syntactic
structure of sentences that is implied by the grammar.

Ambiguity
A grammar that produces more than one parse tree for some sentence is said to be
ambiguous i.e. An ambiguous grammar is one that produces more than one leftmost
or more than one rightmost derivation for some sentence.

Left Recur sion


"A grammar is left-recursive if we can find some non-terminal A which will
eventually derive a sentential form with itself as the left-symbol."

Par sing Techniques


There are two types of Parsing
a. Top-down Par sing: (start from start symbol and derive string). A Top-down
parser builds a parse tree by starting at the root and working down towards the leaves.
E.g. Recursive-descent parser and Predictive parser
b. Bottom-up Par sing: (start from string and reduce to start symbol) A bottom-up
parser builds a parser tree by starting at the leaves and working up towards the root.
Characteristics:
a) It is not easy to handle by hands, usually generated by a compiler generating
software
b) It handles larger class of grammar

Bottom-Up Parsing
In programming language compilers, bottom-up parsing is a parsing method that
works by identifying terminal symbols first, and combines them successively to
produce non-terminals. The productions of the parser can be used to build a parse tree
of a programme written in human-readable source code that can be compiled to
assembly language or pseudo code.

Types of Bottom-up Par ser s


The common classes of bottom-up parsers are:
 LR parser e.g
 LR(0) no lookahead symbol,
 SLR(1) simple with one lookahead symbol,
 LALR (1) Lookahead bottom up e.g YACC
 LR(k)
 Precedence parsers
 Simple precedence parser
 Operator-precedence parser
 Extended precedence parser.

Shift-Reduce Par ser s


These bottom-up parsers examine the input tokens and either shift (push) them onto a
stack or reduce elements at the top of the stack, replacing a right-hand side by a
left-hand side.

Action Table
The following is a description of what can be held in an action table.
Actions
a. Shift - push token onto stack
b. Reduce - remove handle from stack and push on corresponding nonterminal
c. Accept - recognise sentence when stack contains only the distinguished symbol and
input is empty
d. Er r or - happens when none of the above is possible; means original input was not
a sentence.

Handles
A handle of a string is a substring that matches the right side of a production whose
reduction to the nonterminal on the left represents one step along the reverse of a
rightmost derivation.

Oper ator Pr ecedence Par ser


An operator precedence parser is a bottom-up parser that interprets an
operator-precedence grammar.

Oper ator Gr ammar s


Operator grammars have the property that no production right side is empty or two or
more adjacent non-terminals.
Oper ator Pr ecedence Gr ammar
An operator precedence grammar is an ε-free operator grammar in which the
precedence relations are disjoint i.e never more than one of these relations: a < b, a =
b a < b is true.

Methods of Gener ating Relationship between Oper ator s


a. Intuition Method: We can use associativity and precedence of the operations to
assign the relation intuitively as follows:
i. If operator θ1 has higher precedence than θ2, then we make θ1 > θ2 and θ2 <
θ1
ii. If θ1 and θ2 are operators of equal precedence then either we make θ1 > θ2 and
θ2 > θ1 if the operators are left associative or make θ1 < θ2 and θ2 < θ1, if they are
right associative.

Simple Precedence Parser


A Simple precedence parser like the operator precedence parser that can be used only
by Simple precedence grammars.

Top-Down Parsing Techniques


Top-down parsing is a way of finding a leftmost derivation for an input string or
constructing a parse tree for the input string starting from the root and creating the
nodes of the parse tree in pre-order.

Difficulties with Top-Down Parsing


i. Which production will you apply? If the right production is not applied you will not
get the input string.
ii. A left recursive production can lead to continue cycling without getting to the
answer i.e. input string.
iii. The sequence of application for the production determines if the input string will
be found
iv. If a wrong production is applied, It has to be restarted

Minimising the Difficulties


i. Left recursion can be removed from the grammar.
ii. a method called factoring i.e left factoring is used to ensure that the right
sequence/order is used or followed.

LL (K) GRAMMARS
LL (K) grammars are those for which the left parser can be made to work
deterministically, if it is allowed to look at K-input symbols, to the right of its current
input position.
Recursive Descent Parsing
Recursive descent is a strategy for doing top-down parsing they operate doing
backtracking i.e they make repeated scans of the input in order to decide which
production to consider next for tree expansion.

LR (K) GRAMMARS
These are grammars for which the right parser can be made to work deterministically
if it is allowed to look at k-input symbols to the left of its current input position.

Why Study LR Grammars?


 LR (1) grammars are often used to construct parsers.
 virtually all context-free programming language constructs can be expressed in an
LR(1) form
 LR grammars are the most general grammars parse-able by a deterministic,
bottom-up parser
 efficient parsers can be implemented for LR(1) grammars
 LR parsers detect an error as soon as possible in a left-to-right scan of the input
 LR grammars describe a proper superset of the languages recognised by
predictive (i.e., LL) parsers

Benefits of LR Parsing
a. LR parsing can handle a larger range of languages than LL parsing
b. LR parsers can be constructed to recognise virtually all programming language
constructs for which context-free grammars can be written
c. It is more general than operator precedence or any other common shift-reduce
techniques
d. LR parsing also dominates the common forms of top-down parsing without
backtrack.

Drawback of LR Parsing
There is too much work to implement an LR parser by hand therefore a specialised
tool called an LR parser generator is used.

Techniques for Producing LR Parsing Tables


1. Simple LR (SLR (1) for short): it is the easiest to implement but may fail to
produce tables for certain grammars on which other methods may succeed.
Characteristics:
a) smallest class of grammars
b) smallest tables (number of states)
c) simple, fast construction
2. Canonical LR (or LR (1)): most powerful, works with a large class of grammars
but it is expensive to implement. Characteristics:
a) full set of LR(1) grammars
b) largest tables (number of states)
c) slow, expensive and large construct
3. lookahead LR (LALR (1) for short): intermediate in power, works for most
programming languages and can be implemented efficiently. Characteristics:
a) intermediate sized set of grammars
b) same number of states as SLR(1)
c) canonical construction is slow and large
d) better construction techniques exist

Errors during Lexical Analysis


i. Strange characters: Some programming languages do not use all possible
characters
ii. Long quoted strings: Many programming languages do not allow quoted strings
to extend over more than one line;
iii. Invalid numbers: A number such as 123.45.67 could be detected.

Errors during Syntax Analysis


During syntax analysis, the compiler is usually trying to decide what to do next on the
basis of expecting one of a small number of tokens.

Errors during Semantic Analysis


One of the most common errors reported during semantic analysis is
1. "identifier not declared"; either omitted declaration or misspelt an identifier.
2. incompatible use of types e.g assigning boolean to strings
3. Parameter or subscript miscount

Run-Time Speed versus Safety


 attempt to divide by 0
 overflow (and possibly underflow) during arithmetic operations.
 attempt to use a variable before it has been set to some sensible value (undefined
variable)
 attempt to refer to a non-existent array element (invalid subscript).
 attempt to set a variable (defined as having a limited range) to some value outside
this range
 various errors related to pointers:
 Attempt to use a pointer before it has been set to point to somewhere useful.
 attempt to use a nil pointer, which explicitly doesn't point anywhere useful
 attempt to use a pointer which points outside the array it should point to.
 attempt to use a pointer after the memory it points to has been released.
Cheap Detection of 'Undefined'
Murray Langton has had some success in checking for 'undefined' in a 140,000 line
safety-critical legacy FORTRAN programme. The fundamental idea is to set all
global variables to recognisably strange values which are highly likely to produce
visibly strange results if used.
For an IBM mainframe, the strange values were:
 REAL set to -9.87654E70
 INTEGER set to -123456789
 CHAR set to '?'

How to Check for 'Undefined'


The basic idea is to ensure that all variables are flagged as 'undefined' when declared.
Some languages allow simultaneous declaration and initialization, in which case a
variable is flagged as 'defined'. Whenever a value is assigned to a variable the flag is
changed to 'defined'. Whenever a variable is used the flag is checked and an error is
reported if it is 'undefined'.

The simplest way of providing a flag is to use some specific value which is (hopefully)
unlikely to appear in practice. Particular values depend on the type of the variable
involved:
1. Boolean: A value such as 255 or 128 is a suitable flag.
2. Character: 127 or 128 or 255 may be suitable choices
3. Integer: largest negative number e.g (for 16-bits, range is -32768 to +32767
4. Real: for most IEEE standard hardware. NaN (not a number).

Semantic Analysis
Semantic analysis is roughly the equivalent of checking that some ordinary text
written in a natural language (e.g. English) is meaningful regardless of been correct

Symbol Tables
A compiler uses a symbol table i.e a table with two fields, a name field and an
information field to keep track of scope and binding information about names. The
two symbol table mechanisms are linear lists and hash tables.

The items that are usually entered into a symbol table are:
 variable names
 defined constants
 procedure and function names
 literal constants and strings
 source text labels
 compiler-generated temporaries

Symbol Table Organisation


A linear list is slow to access but simple to implement. A hash table is fast but more
complex. Tree structures give intermediate performance.
Characteristics of symbol table structure:
1. Linear List
a) O(n) probes per lookup
b) easy to expand — no fixed size
c) one allocation per insertion
2. Ordered Linear List
a) O(log2n) probes per lookup using binary search
b) insertion is expensive (to reorganise list)
3. Binary Tree
a) O(n) probes per lookup, if the tree is unbalanced
b) O(log2n) probes per lookup, if the tree is balanced
c) easy to expand with no fixed size
d) one allocation per insertion
4. Hash Table
a) O(1) probes per lookup on the average
b) expansion costs vary with specific scheme

Attribute information
Attributes are internal representation of declarations. Symbol table associates names
with attributes. Names may have different attributes such as below depending on their
meaning:
a. Variables: type, procedure level, frame offset
b. Types: type descriptor, data size/alignment
c. Constants: type, value
d. Procedures: formals (names/types), result type, block information (local
declarations), frame size.

Intermediate Code Generator


The data structure passed between the analysis and synthesis phases is called the
intermediate representation (IR) of the programme
Intermediate representations may be:
 assembly language like or
 be an abstract syntax tree.

Intermediate Languages/ Representations


There are three types of intermediate representation:
a. Syntax Trees
b. Postfix notation
c. Three Address Code

Graphical Representations
A syntax tree depicts the natural hierarchical structure of a source programme.
Three-Address Code
Three-address code is a sequence of statements of the general form:
x := y op z

Types of Three-Address Statements


1) Assignment statements of the form: x := y op z (where op is a binary arithmetic
or logical operation)
2) Assignment statements of the form: x := op y (where op is a unary operation e.g
unary minus, logical negation, shift operators.
3) Copy statements of the form: x := y where the value of y is assigned to x;
4) The unconditional jump goto L.
5) Conditional jumps such as: if x relop y goto L
6) Procedure calls: call p,n and returned values from functions:
7) Indexed assignments of the form: x := y[i] and x[i] := y.
8) Address and pointer assignments: x := &y and x := *y.

Code Generation
The primary objective of the code generator is to convert atoms or syntax trees to
instructions.

Target Programmes
The output of the code generator is the target programme. It may take on a variety of
forms: absolute machine language, relocatable machine language, or assembly
language.

Code Optimisation
Optimisation within a compiler is concerned with improving in some way the
generated object code while ensuring the result is identical.

Optimisations fall into three categories:


a. Speed: improving the runtime performance of the generated object code. This is the
most common optimisation
b. Space: reducing the size of the generated object code
c. Safety: reducing the possibility of data structures becoming corrupted (for example,
ensuring that an illegal array element is not written to).

Many "speed" optimisations make the code larger, and many "space" optimisations
make the code slower. This is known as the space-time trade-off.

Criteria for Code-Improving Transformations


Simply stated, the best programme transformations are those that yield the most
benefit for the least effort.
Optimising compiler should have several properties.
i. transformation must preserve the meaning of programmes
ii. a transformation must, on the average, speed up programmes by a measurable
amount
iii. a transformation must be worth the effort

Improving Transformations
The code produced by straightforward compiling algorithms can be made to run faster
using code improving transformations. Compilers using such transformations are
called optimizing compilers.

Common Optimisation Algorithms


 Function-preserving transformations
 common sub-expressions identification/elimination
 copy propagation
 dead-code elimination
 Loop optimisations
 induction variables and reduction in strength
 code motion
 Function Chunking.

A code improving transformation is called local if it is performed by looking at


statements within one concrete block and global if it is performed by looking at
statements not only in one concrete block, but also outside in global and other outside
blocks.

Function-Preserving Transformations
i. Common sub-expression elimination
ii. copy propagation
iii. dead-code elimination
iv. and constant folding

Loop Optimisations
techniques for loop optimisation:
 Strength Reduction which replaces an expensive (time consuming) operator by
a faster one;
 Induction Variable Elimination which eliminates variables from inner loops;
 Code Motion which moves pieces of code outside loops.

Function Chunking
Function chunking is a compiler optimisation for improving code locality. Profiling
information is used to move rarely executed code outside of the main function body.
This allows for memory pages with rarely executed code to be swapped out.

You might also like