CIT316 Summary
CIT316 Summary
Ambiguous gr ammar is a grammar with multiple ways of generating the same single
string.
For every context-free language, a machine can be built that takes a string as input
and determines in O (n3) time whether the string is a member of the language, where
n is the length of the string.
The alphabet over which a formal language is defined is a set of words from which
letters are taken.
A wor d over an alphabet can be any finite sequence, or str ing, of letter s.
The set of all words over an alphabet Σ is usually denoted by Σ* (using the Kleene
star).
For any alphabet there is only one word of length 0, the empty word, which is often
denoted by e, ε or λ.
A formal language is often defined by means of a for mal gr ammar (also called its
for mation r ules);
words that belong to a formal language are sometimes called well-for med wor ds
The context free languages are known to be closed under union, concatenation, and
inter section with regular languages, but not closed under inter section or
complement.
WHAT IS A COMPILER?
Tr anslator s
A translator is a programme that takes as input a programme written in one
programming language ( the source language) and produces as output a programme in
another language (the object or target language)
Compiler
A compiler takes an input (source) a high level language, and produces as output, a
low level language (assembly or machine language).
Inter pr eter
An interpreter executes a simplified language, called inter mediate code.
Assembler
An Assembler takes as source an assembly language and produces as output (target
language) machine language
Pr epr ocessor
Is a translator that a program in one high level language into equivalent program in
another high level language.
Compiler Ar chitectur e
The front end consists of the following phases:
scanning: a scanner groups input characters into tokens
par sing: a parser recognises sequences of tokens according to some grammar and
generates Abstract Syntax Trees (ASTs)
semantic analysis: performs type checking, and translates ASTs into
intermediate representation (IRs)
optimisation: optimises IRs.
The back end consists of the following phases:
instruction selection: maps IRs into assembly code
code optimisation: optimises the assembly code using control flow and data-flow
analyses, register allocation, etc
code emission: generates machine code from assembly code.
The operating system provides the following utilities to execute the object file:
linking: A linker takes several object files and libraries as input and produces one
executable object file.
loading: A loader loads an executable object file into memory
Relocatable shar ed libr ar ies: allow effective memory use when many different
applications share the same code.
Phases of a Compiler
1. The Lexical Analyser : Also referred to as the Scanner. It separates characters of
the source language into groups that logically belong together; these groups are called
tokens.
2. The Syntax Analyser : This groups tokens together into syntactic structures. For
example, the three tokens representing A+B might be grouped into a syntactic
structure called an expr ession. Expressions might further be combined to form
statements.
3. The Inter mediate Code Gener ator : This uses the structure produced by the
syntax analyser to create a stream of simple instructions.
4. Code Optimisation: This is an optional phase designed to improve the
intermediate code so that the ultimate object programme runs faster and/or takes less
space.
5. Code Gener ation: Produces the object code by deciding on the memory locations
for data, selecting code to access each datum, and selecting the registers in which each
computation is to be done.
6. The Table Management or Bookkeeping: A symbol table (data structure) keeps
track of the names used by the programme and records essential information about
each, such as its type (integer, real, etc.)
7. The Er r or Handler : This is invoked when a flaw in the source programme is
detected.
Passes
Portions of one or more phases are combined into a module called a pass.
Cr oss Compilation
i. Write new back-end in C to generate code for computer B
ii. Compile the new back-end and using the existing C compiler running on computer
A generating code for computer B.
iii. Use this new compiler to generate a complete compiler for computer B.
iv. We now have a complete compiler for computer B that will run on computer B.
v. Copy this new compiler across and run it on computer B (this is cross
Compilation).
Lexical Analysis
A token is the smallest unit recognisable by the compiler.
Basis
ε is a regular expression that denotes { ε }.
A single character a is a regular expression that denotes { a }.
Induction
Suppose r and s are regular expressions that denote the languages L(r) and L(s):
(r)|(s) is a regular expression that denotes L(r) ∪ L(s)
(r)(s) is a regular expression that denotes L(r)L(s)
(r)* is a regular expression that denotes L(r)*
(r) is a regular expression that denotes L(r).
Pr ecedence
1. the Kleene star operator * has the highest precedence and is left associative
2. concatenation has the next highest precedence and is left associative
3. the union operator | has the lowest precedence and is left associative
For convenience, we can give names to REs so we can refer to them by their name.
For example:
for – keyword = For
Letter = [a - zA - Z]
digit = [0 - 9]
identifier = letter (letter | digit)*
sign = + | - |
integer = sign (0 | [1 - 9]digit*)
decimal = integer . digit*
real = (integer | decimal ) E sign digit+
Examples of Lex regular expressions and the strings they match are:
1. "a.*b" matches the string a.*b.
2. . matches any character except a newline.
3. ^ matches the empty string at the beginning of a line.
4. $ matches the empty string at the end of a line.
5. [abc] matches an a, or a b, or a c.
6. [a-z] matches any lowercase letter between a and z.
7. [A-Za-z0-9] matches any alphanumeric character.
8. [^abc] matches any character except an a, or a b, or a c.
9. [^0-9] matches any nonnumeric character.
10. a* matches a string of zero or more a's.
11. a+ matches a string of one or more a's.
12. a? matches a string of zero or one a's.
13. a{2,5} matches any string consisting of two to five a's.
14. (a) matches an a.
15. a/b matches an a when followed by a b.
15. \n matches a newline.
16. \t matches a tab.
A pattern is a description of the form that the lexemes making up a token a source
program may have. patterns. e.g., identifiers in C: [_A-Za-z][_A-Za-z0-9]*
A lexeme is a sequence of characters that matches the pattern for a token, e.g.,
identifiers: count, x1, i, position
keywords: if
operators: =, ==, !=, +=
An attr ibute of a token is usually a pointer to the symbol table entry that gives
additional information about the token, such as its type, value, line number, etc
The input to the LEX compiler is called LEX sour ce and the output of LEX compiler
is called lexical analyser
LEX Sour ce
The LEX source programme consists of two parts:
a. The auxiliary definitions: are statements of the form:
D1 = R1
D2 = R2; Where each Di is a distinct name and Ri is a regular expression
whose symbols are chosen from the alphabets of the language
b. The Translation Rules: these are of the form:
P1{A1}
P2 {A2}
Pm {Am}; Where each Pi is a regular expression called Pattern over the alphabet
Finite Automata
A r ecogniser for a language L is a programme that takes as input a string x and
answers “yes” if x is a sentence of L and “no” otherwise.
An NFA accepts an input string x if there is a path in the transition graph from the
initial state to a final state that spells out x. The language defined by an NFA is the set
of strings accepted by the NFA.
Notational Conventions
Terminals are usually denoted by:
lower-case letters early in the alphabet: a, b, c;
operator symbols: +, -, *, etc.;
punctuation symbols: (, ), {, }, ; etc.;
digits: 0, 1, 2, ..., 9;
boldface strings: if, else, etc
Nonter minals ar e usually denoted by:
upper-case letters early in the alphabet: A, B, C;
the letter S representing the start symbol;
ower-case italic names: expr , stmt, etc.;
Str ings of Ter minals only are represented by lower-case letters late in the alphabet: u,
v, w ... z
Par se Tr ees
Parse tree is a graphical representation for derivation that filters out the choice
regarding replacement. It has purpose of making explicit the hierarchical syntactic
structure of sentences that is implied by the grammar.
Ambiguity
A grammar that produces more than one parse tree for some sentence is said to be
ambiguous i.e. An ambiguous grammar is one that produces more than one leftmost
or more than one rightmost derivation for some sentence.
Bottom-Up Parsing
In programming language compilers, bottom-up parsing is a parsing method that
works by identifying terminal symbols first, and combines them successively to
produce non-terminals. The productions of the parser can be used to build a parse tree
of a programme written in human-readable source code that can be compiled to
assembly language or pseudo code.
Action Table
The following is a description of what can be held in an action table.
Actions
a. Shift - push token onto stack
b. Reduce - remove handle from stack and push on corresponding nonterminal
c. Accept - recognise sentence when stack contains only the distinguished symbol and
input is empty
d. Er r or - happens when none of the above is possible; means original input was not
a sentence.
Handles
A handle of a string is a substring that matches the right side of a production whose
reduction to the nonterminal on the left represents one step along the reverse of a
rightmost derivation.
LL (K) GRAMMARS
LL (K) grammars are those for which the left parser can be made to work
deterministically, if it is allowed to look at K-input symbols, to the right of its current
input position.
Recursive Descent Parsing
Recursive descent is a strategy for doing top-down parsing they operate doing
backtracking i.e they make repeated scans of the input in order to decide which
production to consider next for tree expansion.
LR (K) GRAMMARS
These are grammars for which the right parser can be made to work deterministically
if it is allowed to look at k-input symbols to the left of its current input position.
Benefits of LR Parsing
a. LR parsing can handle a larger range of languages than LL parsing
b. LR parsers can be constructed to recognise virtually all programming language
constructs for which context-free grammars can be written
c. It is more general than operator precedence or any other common shift-reduce
techniques
d. LR parsing also dominates the common forms of top-down parsing without
backtrack.
Drawback of LR Parsing
There is too much work to implement an LR parser by hand therefore a specialised
tool called an LR parser generator is used.
The simplest way of providing a flag is to use some specific value which is (hopefully)
unlikely to appear in practice. Particular values depend on the type of the variable
involved:
1. Boolean: A value such as 255 or 128 is a suitable flag.
2. Character: 127 or 128 or 255 may be suitable choices
3. Integer: largest negative number e.g (for 16-bits, range is -32768 to +32767
4. Real: for most IEEE standard hardware. NaN (not a number).
Semantic Analysis
Semantic analysis is roughly the equivalent of checking that some ordinary text
written in a natural language (e.g. English) is meaningful regardless of been correct
Symbol Tables
A compiler uses a symbol table i.e a table with two fields, a name field and an
information field to keep track of scope and binding information about names. The
two symbol table mechanisms are linear lists and hash tables.
The items that are usually entered into a symbol table are:
variable names
defined constants
procedure and function names
literal constants and strings
source text labels
compiler-generated temporaries
Attribute information
Attributes are internal representation of declarations. Symbol table associates names
with attributes. Names may have different attributes such as below depending on their
meaning:
a. Variables: type, procedure level, frame offset
b. Types: type descriptor, data size/alignment
c. Constants: type, value
d. Procedures: formals (names/types), result type, block information (local
declarations), frame size.
Graphical Representations
A syntax tree depicts the natural hierarchical structure of a source programme.
Three-Address Code
Three-address code is a sequence of statements of the general form:
x := y op z
Code Generation
The primary objective of the code generator is to convert atoms or syntax trees to
instructions.
Target Programmes
The output of the code generator is the target programme. It may take on a variety of
forms: absolute machine language, relocatable machine language, or assembly
language.
Code Optimisation
Optimisation within a compiler is concerned with improving in some way the
generated object code while ensuring the result is identical.
Many "speed" optimisations make the code larger, and many "space" optimisations
make the code slower. This is known as the space-time trade-off.
Improving Transformations
The code produced by straightforward compiling algorithms can be made to run faster
using code improving transformations. Compilers using such transformations are
called optimizing compilers.
Function-Preserving Transformations
i. Common sub-expression elimination
ii. copy propagation
iii. dead-code elimination
iv. and constant folding
Loop Optimisations
techniques for loop optimisation:
Strength Reduction which replaces an expensive (time consuming) operator by
a faster one;
Induction Variable Elimination which eliminates variables from inner loops;
Code Motion which moves pieces of code outside loops.
Function Chunking
Function chunking is a compiler optimisation for improving code locality. Profiling
information is used to move rarely executed code outside of the main function body.
This allows for memory pages with rarely executed code to be swapped out.