0% found this document useful (0 votes)
22 views23 pages

CSC 409 Part 1 - 113507

The document discusses the components and processes involved in compiler construction, including the roles of compilers, assemblers, linkers, and loaders in converting high-level programming languages into machine code. It explains the concepts of grammars and languages, token recognition using finite automata, and the construction of transition diagrams for lexical analysis. Additionally, it provides examples of grammars and their corresponding languages, as well as methods for recognizing keywords and identifiers in programming languages.

Uploaded by

terkulalubem93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views23 pages

CSC 409 Part 1 - 113507

The document discusses the components and processes involved in compiler construction, including the roles of compilers, assemblers, linkers, and loaders in converting high-level programming languages into machine code. It explains the concepts of grammars and languages, token recognition using finite automata, and the construction of transition diagrams for lexical analysis. Additionally, it provides examples of grammars and their corresponding languages, as well as methods for recognizing keywords and identifiers in programming languages.

Uploaded by

terkulalubem93
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

ZAMFARA STATE UNIVERSITY, TALATA MAFARA

DEPARTMENT OF COMPUTER SCIENCE

PROGRAMME: B.Sc. Computer Science


CSC 409: Compiler Construction II (2 units) (L30:P 0)
Grammars and languages
Computers are a balanced mix of software and hardware. Hardware is just a piece of
mechanical device and its functions are being controlled by a compatible software.
Hardware understands instructions in the form of electronic charge, which is the
counterpart of binary language in software programming. Binary language has only two
alphabets, 0 and 1. To instruct, the hardware codes must be written in binary format,
which is simply a series of 1s and 0s. It would be a difficult and cumbersome task for
computer programmers to write such codes, which is why we have compilers to write
such codes.
Language Processing System
We have learnt that any computer system is made of hardware and software. The
hardware understands a language, which humans cannot understand. So we write
programs in high-level language, which is easier for us to understand and remember.
These programs are then fed into a series of tools and OS components to get the desired
code that can be used by the machine. This is known as Language Processing System.
The high-level language is converted into binary language in various phases.
A compiler is a program that converts high-level language to assembly language.
Similarly, an assembler is a program that converts the assembly language to machine-
level language.
Let us first understand how a program, using C compiler, is executed on a host
machine.
 User writes a program in C language (high-level language).
 The C compiler, compiles the program and translates it to assembly program
(low-level language).
 An assembler then translates the assembly program into machine code (object).
 A linker tool is used to link all the parts of the program together for execution
(executable machine code).
 A loader loads all of them into memory and then the program is executed.
Before diving straight into the concepts of compilers, we should understand a few other
tools that work closely with compilers.
Preprocessor
A preprocessor, generally considered as a part of compiler, is a tool that produces input
for compilers. It deals with macro-processing, augmentation, file inclusion, language
extension, etc.
Interpreter
An interpreter, like a compiler, translates high-level language into low-level machine
language. The difference lies in the way they read the source code or input. A compiler
reads the whole source code at once, creates tokens, checks semantics, generates
intermediate code, executes the whole program and may involve many passes. In
contrast, an interpreter reads a statement from the input, converts it to an intermediate
code, executes it, then takes the next statement in sequence. If an error occurs, an
interpreter stops execution and reports it. whereas a compiler reads the whole program
even if it encounters several errors.
Assembler
An assembler translates assembly language programs into machine code.The output of
an assembler is called an object file, which contains a combination of machine
instructions as well as the data required to place these instructions in memory.
Linker
Linker is a computer program that links and merges various object files together in
order to make an executable file. All these files might have been compiled by separate
assemblers. The major task of a linker is to search and locate referenced
module/routines in a program and to determine the memory location where these codes
will be loaded, making the program instruction to have absolute references.
Loader
Loader is a part of operating system and is responsible for loading executable files into
memory and execute them. It calculates the size of a program (instructions and data)
and creates memory space for it. It initializes various registers to initiate execution.
Cross-compiler
A compiler that runs on platform (A) and is capable of generating executable code for
platform (B) is called a cross-compiler.
Source-to-source Compiler
A compiler that takes the source code of one programming language and translates it
into the source code of another programming language is called a source-to-source
compiler.

The set of all strings that can be derived from a grammar is said to be the language
generated from that grammar. A language generated by a grammar G is a subset

L(G)={W|W ∈ ∑*, S ⇒G W}
formally defined by

If L(G1) = L(G2), the Grammar G1 is equivalent to the Grammar G2.


Example
If there is a grammar
G: N = {S, A, B} T = {a, b} P = {S → AB, A → a, B → b}
Here S produces AB, and we can replace A by a, and B by b. Here, the only accepted
string is ab, i.e.,
L(G) = {ab}
Example
Suppose we have the following grammar −
G: N = {S, A, B} T = {a, b} P = {S → AB, A → aA|a, B → bB|b}
The language generated by this grammar −
L(G) = {ab, a2b, ab2, a2b2, ………}
= {am bn | m ≥ 1 and n ≥ 1}
Construction of a Grammar Generating a Language
We’ll consider some languages and convert it into a grammar G which produces those
languages.
Example
Problem − Suppose, L (G) = {am bn | m ≥ 0 and n > 0}. We have to find out the
grammar G which produces L(G).
Solution
Since L(G) = {am bn | m ≥ 0 and n > 0}
the set of strings accepted can be rewritten as −
L(G) = {b, ab,bb, aab, abb, …….}
Here, the start symbol has to take at least one ‘b’ preceded by any number of ‘a’
including null.
To accept the string set {b, ab, bb, aab, abb, …….}, we have taken the productions −
S → aS , S → B, B → b and B → bB
S → B → b (Accepted)
S → B → bB → bb (Accepted)
S → aS → aB → ab (Accepted)
S → aS → aaS → aaB → aab(Accepted)
S → aS → aB → abB → abb (Accepted)
Thus, we can prove every single string in L(G) is accepted by the language generated
by the production set.
Hence the grammar −
G: ({S, A, B}, {a, b}, S, { S → aS | B , B → b | bB })
Example
Problem − Suppose, L (G) = {am bn | m > 0 and n ≥ 0}. We have to find out the
grammar G which produces L(G).
Solution −
Since L(G) = {am bn | m > 0 and n ≥ 0}, the set of strings accepted can be rewritten as −
L(G) = {a, aa, ab, aaa, aab ,abb, …….}
Here, the start symbol has to take at least one ‘a’ followed by any number of ‘b’
including null.
To accept the string set {a, aa, ab, aaa, aab, abb, …….}, we have taken the productions

S → aA, A → aA , A → B, B → bB ,B → λ
S → aA → aB → aλ → a (Accepted)
S → aA → aaA → aaB → aaλ → aa (Accepted)
S → aA → aB → abB → abλ → ab (Accepted)
S → aA → aaA → aaaA → aaaB → aaaλ → aaa (Accepted)
S → aA → aaA → aaB → aabB → aabλ → aab (Accepted)
S → aA → aB → abB → abbB → abbλ → abb (Accepted)
Thus, we can prove every single string in L(G) is accepted by the language generated
by the production set.
Hence the grammar −
G: ({S, A, B}, {a, b}, S, {S → aA, A → aA | B, B → λ | bB })

Recognition of Tokens
Tokens can be recognized by Finite Automata
A Finite automaton(FA) is a simple idealized machine used to recognize patterns within
input taken from some character set(or Alphabet) C. The job of FA is to accept or reject
an input depending on whether the pattern defined by the FA occurs in the input.
There are two notations for representing Finite Automata. They are
Transition Diagram
Transition Table
Transition diagram is a directed labeled graph in which it contains nodes and edges
Nodes represents the states and edges represents the transition of a state
Every transition diagram is only one initial state represented by an arrow mark (-->) and
zero or more final states are represented by double circle
Example:

Where state "1" is initial state and state 3 is final state.


Finite Automata for recognizing identifiers

Finite Automata for recognizing keywords


Finite Automata for recognizing numbers

Finite Automata for relational operators

Finite Automata for recognizing white spaces

Recognition of Tokens

1 Transition Diagrams
2 Recognition of Reserved Words and Identifiers
3 Completion of the Running Example
4 Architecture of a Transition-Diagram-Based Lexical Analyzer
5 Exercises for Section 3.4

In the previous section we learned how to express patterns using regular expres-sions.
Now, we must study how to take the patterns for all the needed tokens and build a piece
of code that examines the input string and finds a prefix that is a lexeme matching one
of the patterns. Our discussion will make use of the following running example.

Example 3.8 : The grammar fragment of Fig. 3.10 describes a simple form of
branching statements and conditional expressions. This syntax is similar to that of the
language Pascal, in that t h e n appears explicitly after conditions.
For relop, we use the comparison operators of languages like Pascal or SQL, where = is
"equals" and <> is "not equals," because it presents an interesting structure of lexemes.

The terminals of the grammar, which are if, then, else, relop, id, and number, are the
names of tokens as far as the lexical analyzer is concerned. The patterns for these
tokens are described using regular definitions, as in Fig. 3.11. The patterns
for id and number are similar to what we saw in Example 3.7.

For this language, the lexical analyzer will recognize the keywords if, then, and e l s e ,
as well as lexemes that match the patterns for relop, id, and number. To simplify
matters, we make the common assumption that keywords are also reserved words: that
is, they are not identifiers, even though their lexemes match the pattern for identifiers.
In addition, we assign the lexical analyzer the job of stripping out white-space, by
recognizing the "token" ws defined by:

ws -» ( blank | tab j newline )+

Here, blank, tab, and newline are abstract symbols that we use to express the ASCII
characters of the same names. Token ws is different from the other tokens in that, when
we recognize it, we do not return it to the parser, but rather restart the lexical analysis
from the character that follows the whitespace. It is the following token that gets
returned to the parser.

Our goal for the lexical analyzer is summarized in Fig. 3.12. That table shows, for each
lexeme or family of lexemes, which token name is returned to the parser and what
attribute value, as discussed in Section 3.1.3, is returned. Note that for the six relational
operators, symbolic constants LT, LE, and so on are used as the attribute value, in
order to indicate which instance of the token relop we have found. The particular
operator found will influence the code that is output from the compiler. •

1. Transition Diagrams

As an intermediate step in the construction of a lexical analyzer, we first convert


patterns into stylized flowcharts, called "transition diagrams." In this section, we
perform the conversion from regular-expression patterns to transition dia-grams by
hand, but in Section 3.6, we shall see that there is a mechanical way to construct these
diagrams from collections of regular expressions.

Transition diagrams have a collection of nodes or circles, called states. Each state
represents a condition that could occur during the process of scanning the input looking
for a lexeme that matches one of several patterns. We may think of a state as
summarizing all we need to know about what characters we have seen between the
lexemeBegin pointer and the forward pointer (as in the situation of Fig. 3.3).
Edges are directed from one state of the transition diagram to another.
Each edge is labeled by a symbol or set of symbols. If we are in some state 5, and the
next input symbol is a, we look for an edge out of state s labeled by a (and perhaps by
other symbols, as well). If we find such an edge, we advance the forward pointer arid
enter the state of the transition diagram to which that edge leads. We shall assume that
all our transition diagrams are deterministic, meaning that there is never more than one
edge out of a given state with a given symbol among its labels. Starting in Section 3.5,
we shall relax the condition of determinism, making life much easier for the designer of
a lexical analyzer, although trickier for the implementer. Some important conventions
about transition diagrams are:

1. Certain states are said to be accepting, or final. These states indicate that a lexeme
has been found, although the actual lexeme may not consist of all positions between the
lexemeBegin and forward pointers. We always indicate an accepting state by a double
circle, and if there is an action to be taken — typically returning a token and an attribute
value to the parser — we shall attach that action to the accepting state.

2. In addition, if it is necessary to retract the forward pointer one position (i.e., the
lexeme does not include the symbol that got us to the accepting state), then we shall
additionally place a * near that accepting state. In our example, it is never necessary to
retract forward by more than one position, but if it were, we could attach any number of
*'s to the accepting state.

3. One state is designated the start state, or initial state; it is indicated by an edge,
labeled "start," entering from nowhere. The transition diagram always begins in the start
state before any input symbols have been read.

Example 3 . 9 : Figure 3.13 is a transition diagram that recognizes the


lexemes matching the token relop. We begin in state 0, the start state. If we see < as the
first input symbol, then among the lexemes that match the pattern for relop we can only
be looking at <, <>, or <=. We therefore go to state 1, and look at the next character. If
it is =, then we recognize lexeme <=, enter state 2, and return the token relop with
attribute LE, the symbolic constant representing this particular comparison operator. If
in state 1 the next character is >, then instead we have lexeme <>, and enter state 3 to
return an indication that the not-equals operator has been found. On any other character,
the lexeme is <, and we enter state 4 to return that information. Note, however, that
state 4 has a * to indicate that we must retract the input one position.
On the other hand, if in state 0 the first character we see is =, then this one character
must be the lexeme. We immediately return that fact from state 5.
The remaining possibility is that the first character is >. Then, we must enter state 6 and
decide, on the basis of the next character, whether the lexeme is >= (if we next see the =
sign), or just > (on any other character). Note that if, in state 0, we see any
character besides <, =, or >, we can not possibly be seeing a relop lexeme, so this
transition diagram will not be used.

2. Recognition of Reserved Words and Identifiers

Recognizing keywords and identifiers presents a problem. Usually, keywords like if


or then are reserved (as they are in our running example), so they are not identifiers
even though they look like identifiers. Thus, although we typically use a transition
diagram like that of Fig. 3.14 to search for identifier lexemes, this diagram will also
recognize the keywords if, then, and e l s e of our running example.

There are two ways that we can handle reserved words that look like identifiers:

Install the reserved words in the symbol table initially. A field of the symbol-
table entry indicates that these strings are never ordinary identi- fiers, and tells which
token they represent. We have supposed that this method is in use in Fig. 3.14. When
we find an identifier, a call to installlD places it in the symbol table if it is not already
there and returns a pointer to the symbol-table entry for the lexeme found. Of course,
any identifier not in the symbol table during lexical analysis cannot be a reserved word,
so its token is id. The function getToken examines the symbol table entry for the
lexeme found, and returns whatever token name the symbol table says this lexeme
represents — either id or one of the keyword tokens that was initially installed in the
table
1. Create separate transition diagrams for each keyword; an example for the
keyword then is shown in Fig. 3.15. Note that such a transition diagram consists of
states representing the situation after each successive letter of the keyword is seen,
followed by a test for a "nonletter-or-digit," i.e., any character that cannot be the
continuation of an identifier. It is necessary to check that the identifier has ended, or
else we would return token then in situations where the correct token was id, with a
lexeme like then e x t value that has then as a proper prefix. If we adopt this approach,
then we must prioritize the tokens so that the reserved-word tokens are recognized in
preference to id, when the lexeme matches both patterns. We do not use this approach
in our example, which is why the states in Fig. 3.15 are unnumbered.

3. Completion of the Running Example

The transition diagram for id's that we saw in Fig. 3.14 has a simple structure. Starting
in state 9, it checks that the lexeme begins with a letter and goes to state 10 if so. We
stay in state 10 as long as the input contains letters and digits. When we first encounter
anything but a letter or digit, we go to state 11 and accept the lexeme found. Since the
last character is not part of the identifier, we must retract the input one position, and as
discussed in Section 3.4.2, we enter what we have found in the symbol table and
determine whether we have a keyword or a true identifier.

The transition diagram for token n u m b e r is shown in Fig. 3.16, and is so far the most
complex diagram we have seen. Beginning in state 12, if we see a digit, we go to state
13. In that state, we can read any number of additional digits. However, if we see
anything but a digit or a dot, we have seen a number in the form of an integer; 123 is an
example. That case is handled by entering state 20, where we return token n u m b e r
and a pointer to a table of constants where the found lexeme is entered. These
mechanics are not shown on the diagram but are analogous to the way we handled
identifiers.

If we instead see a dot in state 13, then we have an "optional fraction." State 14 is
entered, and we look for one or more additional digits; state 15 is used for that purpose.
If we see an E, then we have an "optional exponent," whose recognition is the job of
states 16 through 19. Should we, in state 15, instead see anything but E or a digit, then
we have come to the end of the fraction, there is no exponent, and we return the lexeme
found, via state 21.
The final transition diagram, shown in Fig. 3.17, is for whitespace. In that diagram, we
look for one or more "whitespace" characters, represented by delim in that diagram —
typically these characters would be blank, tab, newline, and perhaps other characters
that are not considered by the language design to be part of any token.

Note that in state 24, we have found a block of consecutive whitespace characters,
followed by a nonwhitespace character. We retract the input to begin at the
nonwhitespace, but we do not return to the parser. Rather, we must restart the process
of lexical analysis after the whitespace.

4. Architecture of a Transition-Diagram-Based Lexical

Analyzer

There are several ways that a collection of transition diagrams can be used to build a
lexical analyzer. Regardless of the overall strategy, each state is represented by a piece
of code. We may imagine a variable s t a t e holding the number of the current state for
a transition diagram. A switch based on the value of state takes us to code for each of
the possible states, where we find the action of that state. Often, the code for a state
is itself a switch statement or multiway branch that determines the next state by reading
and examining the next input character.
E x a m p l e 3 . 1 0 : In Fig. 3.18 we see a sketch of g e t R e l o p O , a C + +
function whose job is to simulate the transition diagram of Fig. 3.13 and return an
object of type TOKEN, that is, a pair consisting of the token name (which must
be relop in this case) and an attribute value (the code for one of the six comparison
operators in this case). g e t R e l o p O first creates a new object r e t Token and
initializes its first component to RELOP, the symbolic code for token relop.

We see the typical behavior of a state in case 0, the case where the current state is 0. A
function n e x t C h a r ( ) obtains the next character from the input and assigns it to
local variable c. We then check c for the three characters we expect to find, making the
state transition dictated by the transition diagram of Fig. 3.13 in each case. For
example, if the next input character is =, we go to state 5.

If the next input character is not one that can begin a comparison operator, then a
function f a i l () is called. What f a i l () does depends on the global error-recovery
strategy of the lexical analyzer. It should reset the forward pointer to lexemeBegin, in
order to allow another transition diagram to be applied to
TOKEN getRelopO
{
TOKEN retToken = new(RELOP);
while(1) { /* repeat character processing until a return
or failure occurs */
switch(state) {

case 0: c = nextCharQ;

if ( c == '<» ) state = 1;

else if ( c == '=' ) state = 5; else if ( c == '>' ) state - 6;

else fail(); /* lexeme is not a relop */ break;

case 1: ...

case 8: retract(); retToken.attribute = GT; return(retToken);

Figure 3.18: Sketch of implementation of relop transition diagram

the true beginning of the unprocessed input. It might then change the value of state to
be the start state for another transition diagram, which will search for another token.
Alternatively, if there is no other transition diagram that remains unused, fail() could
initiate an error-correction phase that will try to repair the input and find a lexeme, as
discussed in Section 3.1.4.
We also show the action for state 8 in Fig. 3.18. Because state 8 bears a *, we must
retract the input pointer one position (i.e., put c back on the input stream). That task is
accomplished by the function retract ( ) . Since state 8 represents the recognition of
lexeme >=, we set the second component of the returned object, which we suppose is
named attribute, to GT, the code for this operator. •

To place the simulation of one transition diagram in perspective, let us consider the
ways code like Fig. 3.18 could fit into the entire lexical analyzer.

1. We could arrange for the transition diagrams for each token to be tried se-
quentially. Then, the function f ail() of Example 3.10 resets the pointer forward and
starts the next transition diagram, each time it is called. This method allows us to use
transition diagrams for the individual key-words, like the one suggested in
Fig. 3.15. We have oniy to use these before we use the diagram for id, in order for the
keywords to be reserved words.
We could run the various transition diagrams "in parallel," feeding the next input
character to all of them and allowing each one to make whatever transitions it required.
If we use this strategy, we must be careful to resolve the case where one diagram finds
a lexeme that matches its pattern, while one or more other diagrams are still able to
process input. The normal strategy is to take the longest prefix of the input that matches
any pattern. That rule allows us to prefer identifier the next to keyword then, or the
operator -> to -, for example.
The preferred approach, and the one we shall take up in the following sections, is to
combine all the transition diagrams into one. We allow the transition diagram to read
input until there is no possible next state, and then take the longest lexeme that matched
any pattern, as we discussed in item (2) above. In our running example, this
combination is easy, because no two tokens can start with the same character; i.e., the
first character immediately tells us which token we are looking for. Thus, we could
simply combine states 0, 9, 12, and 22 into one start state, leaving other transitions
intact. However, in general, the problem of combining transition diagrams for several
tokens is more complex, as we shall see shortly.
Exercises for Section 3.4

Exercise 3.4.1 : Provide transition diagrams to recognize the same languages as each of
the regular expressions in Exercise 3.3.2.

Exercise 3.4.2 : Provide transition diagrams to recognize the same languages as each of
the regular expressions in Exercise 3.3.5.

The following exercises, up to Exercise 3.4.12, introduce the Aho-Corasick algorithm


for recognizing a collection of keywords in a text string in time proportional to the
length of the text and the sum of the length of the keywords.
This algorithm uses a special form of transition diagram called a trie. A trie is a tree-
structured transition diagram with distinct labels on the edges leading from a node to its
children. Leaves of the trie represent recognized keywords.
Knuth, Morris, and Pratt presented an algorithm for recognizing a single keyword
&i&2 • • • K in a text string. Here the trie is a transition diagram with n states, 0
through n. State 0 is the initial state, and state n represents ac-ceptance, that is,
discovery of the keyword. From each state s from 0 through n - 1, there is a transition to
state s + 1, labeled by symbol ba+i. For example, the trie for the keyword ababaa is:

In order to process text strings rapidly and search those strings for a key-word, it is
useful to define, for keyword &i&2 • • • &n and position s in that keyword
(corresponding to state s of its trie), a failure function, f(s), computed as in
Fig. 3.19. The objective is that &i&2- " •&/(*) l s the longest proper prefix of &1&2 • •
• bs that is also a suffix of b1b2 • • • bs. The reason f(s) is important is that if we are
trying to match a text string for 61&2 • • • bn, and we have matched the first s positions,
but we then fail (i.e., the next position of the text string does not hold bs+i), then f(s) is
the longest prefix of &1&2 • * • bn that could possibly match the text string up to the
point we are at. Of course, the next character of the text string must be &/( s )+i, or else
we still have problems and must consider a yet shorter prefix, which will be &/(/( s )).

Figure 3.19: Algorithm to compute the failure function for keyword 6162 • • • bn

As an example, the failure function for the trie constructed above for ababaa is:

For instance, states 3 and 1 represent prefixes aba and a, respectively. /(3) =
1 because a is the longest proper prefix of aba that is also a suffix of aba. Also, /(2) =
0, because the longest proper prefix of ab that is also a suffix is the empty string.

Exercise 3.4.3: Construct the failure function for the strings:


abababaab.

aaaaaa.

abbaabb.

Exercise 3.4.4: Prove, by induction on s, that the algorithm of Fig. 3.19 correctly
computes the failure function.

!! Exercise 3.4.5: Show that the assignment t = f{t) in line (4) of Fig. 3.19 is executed
at most n times. Show that therefore, the entire algorithm takes only 0(n) time on a
keyword of length n.
Having computed the failure function for a keyword bib2 • • • bn, we can scan a string
a1a2---am in time 0(m) to tell whether the keyword occurs in the string. The algorithm,
shown in Fig. 3.20, slides the keyword along the string, trying to make progress by
matching the next character of the keyword with the next character of the string. If it
cannot do so after matching s characters, then it "slides" the keyword right s —
f(s) positions, so only the first / ( s ) characters of the keyword are considered matched
with the string.
s = 0;
for (i = 1; i < m; i++) {
while (s > 0 a{ ! = bs+1) s = f(s);
if,(a* == b8+i) s = s + 1;
if (s == n) return "yes";
}
return "no";

Figure 3.20: The KMP algorithm tests whether string a1a2 - ••am contains a single
keyword b1b2 • • • bn as a substring in 0(m + n) time

Exercise 3 . 4 . 6 : Apply Algorithm K M P to test whether keyword ababaa is a

substring of:

abababaab.

abababbaa.

Exercise 3 . 4 . 7: Show that the algorithm of Fig. 3.20 correctly tells whether the
keyword is a substring of the given string. Hint: proceed by induction on i. Show that
for all i, the value of s after line (4) is the length of the longest prefix of the keyword
that is a suffix of a1a2 • • • a^.
!! Exercise 3 . 4 . 8 : Show that the algorithm of Fig. 3.20 runs in time 0(m
+ n), assuming that function / is already computed and its values stored in an array
indexed by s.

Exercise 3 . 4 . 9 : The Fibonacci strings are defined as follows:

1. si = b.

2. s2 = a.

3. Sk = Sk-iSk-2 for k > 2.

For example, s3 = ab, S4 = aba, and S5 = abaab.

a) What is the length of s n ?


b) Construct the failure function for SQ.
c) Construct the failure function for sj.
!! d) Show that the failure function for any sn can be expressed by / ( l ) = /(2) = 0, and
for 2 < j < \sn\, f(j) is j - \sk-i\, where k is the largest integer such that < j + 1.
!! e) In the KMP algorithm, what is the largest number of consecutive applica-tions of
the failure function, when we try to determine whether keyword Sk appears in text
string s^+i?

Aho and Corasick generalized the KMP algorithm to recognize any of a set of
keywords in a text string. In this case, the trie is a true tree, with branching from the
root. There is one state for every string that is a prefix (not necessarily proper) of any
keyword. The parent of a state corresponding to string b1b2 • • • bk is the state that
corresponds to &1&2 • • • frfc-i- A state is accepting if it corresponds to a complete
keyword. For example, Fig. 3.21 shows the trie for the keywords he, she, h i s , and h e
rs.

The failure function for the general trie is defined as follows. Suppose s is the state that
corresponds to string &i&2 • • • bn. Then f(s) is the state that corresponds to the longest
proper suffix of &i62 • • -bn that is also a prefix of some keyword. For example, the
failure function for the trie of Fig. 3.21 is:

Top-down and bottom-up language Run-time storage Organization,


A program as a source code is merely a collection of text (code, statements etc.) and to
make it alive, it requires actions to be performed on the target machine. A program
needs memory resources to execute instructions. A program contains names for
procedures, identifiers etc., that require mapping with the actual memory location at
runtime.
By runtime, we mean a program in execution. Runtime environment is a state of the
target machine, which may include software libraries, environment variables, etc., to
provide services to the processes running in the system.
Runtime support system is a package, mostly generated with the executable program
itself and facilitates the process communication between the process and the runtime
environment. It takes care of memory allocation and de-allocation while the program is
being executed.
Activation Trees
A program is a sequence of instructions combined into a number of procedures.
Instructions in a procedure are executed sequentially. A procedure has a start and an
end delimiter and everything inside it is called the body of the procedure. The
procedure identifier and the sequence of finite instructions inside it make up the body of
the procedure.
The execution of a procedure is called its activation. An activation record contains all
the necessary information required to call a procedure. An activation record may
contain the following units (depending upon the source language used).
Temporaries Stores temporary and intermediate values of an
expression.
Local Data Stores local data of the called procedure.
Machine Status Stores machine status such as Registers, Program
Counter etc., before the procedure is called.
Control Link Stores the address of activation record of the caller
procedure.
Access Link Stores the information of data which is outside the
local scope.
Actual Stores actual parameters, i.e., parameters which are
Parameters used to send input to the called procedure.
Return Value Stores return values.
Whenever a procedure is executed, its activation record is stored on the stack, also
known as control stack. When a procedure calls another procedure, the execution of the
caller is suspended until the called procedure finishes execution. At this time, the
activation record of the called procedure is stored on the stack.
We assume that the program control flows in a sequential manner and when a
procedure is called, its control is transferred to the called procedure. When a called
procedure is executed, it returns the control back to the caller. This type of control flow
makes it easier to represent a series of activations in the form of a tree, known as
the activation tree.
To understand this concept, we take a piece of code as an example:
...
printf(“Enter Your Name: “);
scanf(“%s”, username);
show_data(username);
printf(“Press any key to continue…”);
...
int show_data(char *user)
{
printf(“Your name is %s”, username);
return 0;
}
...
Below is the activation tree of the code given.

Now we understand that procedures are executed in depth-first manner, thus stack
allocation is the best suitable form of storage for procedure activations.
Storage Allocation
Runtime environment manages runtime memory requirements for the following
entities:
 Code : It is known as the text part of a program that does not change at runtime.
Its memory requirements are known at the compile time.
 Procedures : Their text part is static but they are called in a random manner.
That is why, stack storage is used to manage procedure calls and activations.
 Variables : Variables are known at the runtime only, unless they are global or
constant. Heap memory allocation scheme is used for managing allocation and
de-allocation of memory for variables in runtime.
Static Allocation
In this allocation scheme, the compilation data is bound to a fixed location in the
memory and it does not change when the program executes. As the memory
requirement and storage locations are known in advance, runtime support package for
memory allocation and de-allocation is not required.
Stack Allocation
Procedure calls and their activations are managed by means of stack memory allocation.
It works in last-in-first-out (LIFO) method and this allocation strategy is very useful for
recursive procedure calls.
Heap Allocation
Variables local to a procedure are allocated and de-allocated only at runtime. Heap
allocation is used to dynamically allocate memory to the variables and claim it back
when the variables are no more required.
Except statically allocated memory area, both stack and heap memory can grow and
shrink dynamically and unexpectedly. Therefore, they cannot be provided with a fixed
amount of memory in the system.
As shown in the image above, the text part of the code is allocated a fixed amount of
memory. Stack and heap memory are arranged at the extremes of total memory
allocated to the program. Both shrink and grow against each other.
Parameter Passing
The communication medium among procedures is known as parameter passing. The
values of the variables from a calling procedure are transferred to the called procedure
by some mechanism. Before moving ahead, first go through some basic terminologies
pertaining to the values in a program.
r-value
The value of an expression is called its r-value. The value contained in a single variable
also becomes an r-value if it appears on the right-hand side of the assignment operator.
r-values can always be assigned to some other variable.
l-value
The location of memory (address) where an expression is stored is known as the l-value
of that expression. It always appears at the left hand side of an assignment operator.
For example:
day = 1;
week = day * 7;
month = 1;
year = month * 12;
From this example, we understand that constant values like 1, 7, 12, and variables like
day, week, month and year, all have r-values. Only variables have l-values as they also
represent the memory location assigned to them.
For example:
7 = x + y;
is an l-value error, as the constant 7 does not represent any memory location.
Formal Parameters
Variables that take the information passed by the caller procedure are called formal
parameters. These variables are declared in the definition of the called function.
Actual Parameters
Variables whose values or addresses are being passed to the called procedure are called
actual parameters. These variables are specified in the function call as arguments.
Example:
fun_one()
{
int actual_parameter = 10;
call fun_two(int actual_parameter);
}
fun_two(int formal_parameter)
{
print formal_parameter;
}
Formal parameters hold the information of the actual parameter, depending upon the
parameter passing technique used. It may be a value or an address.
Pass by Value
In pass by value mechanism, the calling procedure passes the r-value of actual
parameters and the compiler puts that into the called procedure’s activation record.
Formal parameters then hold the values passed by the calling procedure. If the values
held by the formal parameters are changed, it should have no impact on the actual
parameters.
Pass by Reference
In pass by reference mechanism, the l-value of the actual parameter is copied to the
activation record of the called procedure. This way, the called procedure now has the
address (memory location) of the actual parameter and the formal parameter refers to
the same memory location. Therefore, if the value pointed by the formal parameter is
changed, the impact should be seen on the actual parameter as they should also point to
the same value.
Pass by Copy-restore
This parameter passing mechanism works similar to ‘pass-by-reference’ except that the
changes to actual parameters are made when the called procedure ends. Upon function
call, the values of actual parameters are copied in the activation record of the called
procedure. Formal parameters if manipulated have no real-time effect on actual
parameters (as l-values are passed), but when the called procedure ends, the l-values of
formal parameters are copied to the l-values of actual parameters.
Example:
int y;
calling_procedure()
{
y = 10;
copy_restore(y); //l-value of y is passed
printf y; //prints 99
}
copy_restore(int x)
{
x = 99; // y still has value 10 (unaffected)
y = 0; // y is now 0
}
When this function ends, the l-value of formal parameter x is copied to the actual
parameter y. Even if the value of y is changed before the procedure ends, the l-value of
x is copied to the l-value of y making it behave like call by reference.
Pass by Name
Languages like Algol provide a new kind of parameter passing mechanism that works
like preprocessor in C language. In pass by name mechanism, the name of the
procedure being called is replaced by its actual body. Pass-by-name textually
substitutes the argument expressions in a procedure call for the corresponding
parameters in the body of the procedure so that it can now work on actual parameters,
much like pass-by-reference.

You might also like