CSC 409 Part 1 - 113507
CSC 409 Part 1 - 113507
The set of all strings that can be derived from a grammar is said to be the language
generated from that grammar. A language generated by a grammar G is a subset
L(G)={W|W ∈ ∑*, S ⇒G W}
formally defined by
Recognition of Tokens
Tokens can be recognized by Finite Automata
A Finite automaton(FA) is a simple idealized machine used to recognize patterns within
input taken from some character set(or Alphabet) C. The job of FA is to accept or reject
an input depending on whether the pattern defined by the FA occurs in the input.
There are two notations for representing Finite Automata. They are
Transition Diagram
Transition Table
Transition diagram is a directed labeled graph in which it contains nodes and edges
Nodes represents the states and edges represents the transition of a state
Every transition diagram is only one initial state represented by an arrow mark (-->) and
zero or more final states are represented by double circle
Example:
Recognition of Tokens
1 Transition Diagrams
2 Recognition of Reserved Words and Identifiers
3 Completion of the Running Example
4 Architecture of a Transition-Diagram-Based Lexical Analyzer
5 Exercises for Section 3.4
In the previous section we learned how to express patterns using regular expres-sions.
Now, we must study how to take the patterns for all the needed tokens and build a piece
of code that examines the input string and finds a prefix that is a lexeme matching one
of the patterns. Our discussion will make use of the following running example.
Example 3.8 : The grammar fragment of Fig. 3.10 describes a simple form of
branching statements and conditional expressions. This syntax is similar to that of the
language Pascal, in that t h e n appears explicitly after conditions.
For relop, we use the comparison operators of languages like Pascal or SQL, where = is
"equals" and <> is "not equals," because it presents an interesting structure of lexemes.
The terminals of the grammar, which are if, then, else, relop, id, and number, are the
names of tokens as far as the lexical analyzer is concerned. The patterns for these
tokens are described using regular definitions, as in Fig. 3.11. The patterns
for id and number are similar to what we saw in Example 3.7.
For this language, the lexical analyzer will recognize the keywords if, then, and e l s e ,
as well as lexemes that match the patterns for relop, id, and number. To simplify
matters, we make the common assumption that keywords are also reserved words: that
is, they are not identifiers, even though their lexemes match the pattern for identifiers.
In addition, we assign the lexical analyzer the job of stripping out white-space, by
recognizing the "token" ws defined by:
Here, blank, tab, and newline are abstract symbols that we use to express the ASCII
characters of the same names. Token ws is different from the other tokens in that, when
we recognize it, we do not return it to the parser, but rather restart the lexical analysis
from the character that follows the whitespace. It is the following token that gets
returned to the parser.
Our goal for the lexical analyzer is summarized in Fig. 3.12. That table shows, for each
lexeme or family of lexemes, which token name is returned to the parser and what
attribute value, as discussed in Section 3.1.3, is returned. Note that for the six relational
operators, symbolic constants LT, LE, and so on are used as the attribute value, in
order to indicate which instance of the token relop we have found. The particular
operator found will influence the code that is output from the compiler. •
1. Transition Diagrams
Transition diagrams have a collection of nodes or circles, called states. Each state
represents a condition that could occur during the process of scanning the input looking
for a lexeme that matches one of several patterns. We may think of a state as
summarizing all we need to know about what characters we have seen between the
lexemeBegin pointer and the forward pointer (as in the situation of Fig. 3.3).
Edges are directed from one state of the transition diagram to another.
Each edge is labeled by a symbol or set of symbols. If we are in some state 5, and the
next input symbol is a, we look for an edge out of state s labeled by a (and perhaps by
other symbols, as well). If we find such an edge, we advance the forward pointer arid
enter the state of the transition diagram to which that edge leads. We shall assume that
all our transition diagrams are deterministic, meaning that there is never more than one
edge out of a given state with a given symbol among its labels. Starting in Section 3.5,
we shall relax the condition of determinism, making life much easier for the designer of
a lexical analyzer, although trickier for the implementer. Some important conventions
about transition diagrams are:
1. Certain states are said to be accepting, or final. These states indicate that a lexeme
has been found, although the actual lexeme may not consist of all positions between the
lexemeBegin and forward pointers. We always indicate an accepting state by a double
circle, and if there is an action to be taken — typically returning a token and an attribute
value to the parser — we shall attach that action to the accepting state.
2. In addition, if it is necessary to retract the forward pointer one position (i.e., the
lexeme does not include the symbol that got us to the accepting state), then we shall
additionally place a * near that accepting state. In our example, it is never necessary to
retract forward by more than one position, but if it were, we could attach any number of
*'s to the accepting state.
3. One state is designated the start state, or initial state; it is indicated by an edge,
labeled "start," entering from nowhere. The transition diagram always begins in the start
state before any input symbols have been read.
There are two ways that we can handle reserved words that look like identifiers:
Install the reserved words in the symbol table initially. A field of the symbol-
table entry indicates that these strings are never ordinary identi- fiers, and tells which
token they represent. We have supposed that this method is in use in Fig. 3.14. When
we find an identifier, a call to installlD places it in the symbol table if it is not already
there and returns a pointer to the symbol-table entry for the lexeme found. Of course,
any identifier not in the symbol table during lexical analysis cannot be a reserved word,
so its token is id. The function getToken examines the symbol table entry for the
lexeme found, and returns whatever token name the symbol table says this lexeme
represents — either id or one of the keyword tokens that was initially installed in the
table
1. Create separate transition diagrams for each keyword; an example for the
keyword then is shown in Fig. 3.15. Note that such a transition diagram consists of
states representing the situation after each successive letter of the keyword is seen,
followed by a test for a "nonletter-or-digit," i.e., any character that cannot be the
continuation of an identifier. It is necessary to check that the identifier has ended, or
else we would return token then in situations where the correct token was id, with a
lexeme like then e x t value that has then as a proper prefix. If we adopt this approach,
then we must prioritize the tokens so that the reserved-word tokens are recognized in
preference to id, when the lexeme matches both patterns. We do not use this approach
in our example, which is why the states in Fig. 3.15 are unnumbered.
The transition diagram for id's that we saw in Fig. 3.14 has a simple structure. Starting
in state 9, it checks that the lexeme begins with a letter and goes to state 10 if so. We
stay in state 10 as long as the input contains letters and digits. When we first encounter
anything but a letter or digit, we go to state 11 and accept the lexeme found. Since the
last character is not part of the identifier, we must retract the input one position, and as
discussed in Section 3.4.2, we enter what we have found in the symbol table and
determine whether we have a keyword or a true identifier.
The transition diagram for token n u m b e r is shown in Fig. 3.16, and is so far the most
complex diagram we have seen. Beginning in state 12, if we see a digit, we go to state
13. In that state, we can read any number of additional digits. However, if we see
anything but a digit or a dot, we have seen a number in the form of an integer; 123 is an
example. That case is handled by entering state 20, where we return token n u m b e r
and a pointer to a table of constants where the found lexeme is entered. These
mechanics are not shown on the diagram but are analogous to the way we handled
identifiers.
If we instead see a dot in state 13, then we have an "optional fraction." State 14 is
entered, and we look for one or more additional digits; state 15 is used for that purpose.
If we see an E, then we have an "optional exponent," whose recognition is the job of
states 16 through 19. Should we, in state 15, instead see anything but E or a digit, then
we have come to the end of the fraction, there is no exponent, and we return the lexeme
found, via state 21.
The final transition diagram, shown in Fig. 3.17, is for whitespace. In that diagram, we
look for one or more "whitespace" characters, represented by delim in that diagram —
typically these characters would be blank, tab, newline, and perhaps other characters
that are not considered by the language design to be part of any token.
Note that in state 24, we have found a block of consecutive whitespace characters,
followed by a nonwhitespace character. We retract the input to begin at the
nonwhitespace, but we do not return to the parser. Rather, we must restart the process
of lexical analysis after the whitespace.
Analyzer
There are several ways that a collection of transition diagrams can be used to build a
lexical analyzer. Regardless of the overall strategy, each state is represented by a piece
of code. We may imagine a variable s t a t e holding the number of the current state for
a transition diagram. A switch based on the value of state takes us to code for each of
the possible states, where we find the action of that state. Often, the code for a state
is itself a switch statement or multiway branch that determines the next state by reading
and examining the next input character.
E x a m p l e 3 . 1 0 : In Fig. 3.18 we see a sketch of g e t R e l o p O , a C + +
function whose job is to simulate the transition diagram of Fig. 3.13 and return an
object of type TOKEN, that is, a pair consisting of the token name (which must
be relop in this case) and an attribute value (the code for one of the six comparison
operators in this case). g e t R e l o p O first creates a new object r e t Token and
initializes its first component to RELOP, the symbolic code for token relop.
We see the typical behavior of a state in case 0, the case where the current state is 0. A
function n e x t C h a r ( ) obtains the next character from the input and assigns it to
local variable c. We then check c for the three characters we expect to find, making the
state transition dictated by the transition diagram of Fig. 3.13 in each case. For
example, if the next input character is =, we go to state 5.
If the next input character is not one that can begin a comparison operator, then a
function f a i l () is called. What f a i l () does depends on the global error-recovery
strategy of the lexical analyzer. It should reset the forward pointer to lexemeBegin, in
order to allow another transition diagram to be applied to
TOKEN getRelopO
{
TOKEN retToken = new(RELOP);
while(1) { /* repeat character processing until a return
or failure occurs */
switch(state) {
case 0: c = nextCharQ;
if ( c == '<» ) state = 1;
case 1: ...
the true beginning of the unprocessed input. It might then change the value of state to
be the start state for another transition diagram, which will search for another token.
Alternatively, if there is no other transition diagram that remains unused, fail() could
initiate an error-correction phase that will try to repair the input and find a lexeme, as
discussed in Section 3.1.4.
We also show the action for state 8 in Fig. 3.18. Because state 8 bears a *, we must
retract the input pointer one position (i.e., put c back on the input stream). That task is
accomplished by the function retract ( ) . Since state 8 represents the recognition of
lexeme >=, we set the second component of the returned object, which we suppose is
named attribute, to GT, the code for this operator. •
To place the simulation of one transition diagram in perspective, let us consider the
ways code like Fig. 3.18 could fit into the entire lexical analyzer.
1. We could arrange for the transition diagrams for each token to be tried se-
quentially. Then, the function f ail() of Example 3.10 resets the pointer forward and
starts the next transition diagram, each time it is called. This method allows us to use
transition diagrams for the individual key-words, like the one suggested in
Fig. 3.15. We have oniy to use these before we use the diagram for id, in order for the
keywords to be reserved words.
We could run the various transition diagrams "in parallel," feeding the next input
character to all of them and allowing each one to make whatever transitions it required.
If we use this strategy, we must be careful to resolve the case where one diagram finds
a lexeme that matches its pattern, while one or more other diagrams are still able to
process input. The normal strategy is to take the longest prefix of the input that matches
any pattern. That rule allows us to prefer identifier the next to keyword then, or the
operator -> to -, for example.
The preferred approach, and the one we shall take up in the following sections, is to
combine all the transition diagrams into one. We allow the transition diagram to read
input until there is no possible next state, and then take the longest lexeme that matched
any pattern, as we discussed in item (2) above. In our running example, this
combination is easy, because no two tokens can start with the same character; i.e., the
first character immediately tells us which token we are looking for. Thus, we could
simply combine states 0, 9, 12, and 22 into one start state, leaving other transitions
intact. However, in general, the problem of combining transition diagrams for several
tokens is more complex, as we shall see shortly.
Exercises for Section 3.4
Exercise 3.4.1 : Provide transition diagrams to recognize the same languages as each of
the regular expressions in Exercise 3.3.2.
Exercise 3.4.2 : Provide transition diagrams to recognize the same languages as each of
the regular expressions in Exercise 3.3.5.
In order to process text strings rapidly and search those strings for a key-word, it is
useful to define, for keyword &i&2 • • • &n and position s in that keyword
(corresponding to state s of its trie), a failure function, f(s), computed as in
Fig. 3.19. The objective is that &i&2- " •&/(*) l s the longest proper prefix of &1&2 • •
• bs that is also a suffix of b1b2 • • • bs. The reason f(s) is important is that if we are
trying to match a text string for 61&2 • • • bn, and we have matched the first s positions,
but we then fail (i.e., the next position of the text string does not hold bs+i), then f(s) is
the longest prefix of &1&2 • * • bn that could possibly match the text string up to the
point we are at. Of course, the next character of the text string must be &/( s )+i, or else
we still have problems and must consider a yet shorter prefix, which will be &/(/( s )).
Figure 3.19: Algorithm to compute the failure function for keyword 6162 • • • bn
As an example, the failure function for the trie constructed above for ababaa is:
For instance, states 3 and 1 represent prefixes aba and a, respectively. /(3) =
1 because a is the longest proper prefix of aba that is also a suffix of aba. Also, /(2) =
0, because the longest proper prefix of ab that is also a suffix is the empty string.
aaaaaa.
abbaabb.
Exercise 3.4.4: Prove, by induction on s, that the algorithm of Fig. 3.19 correctly
computes the failure function.
!! Exercise 3.4.5: Show that the assignment t = f{t) in line (4) of Fig. 3.19 is executed
at most n times. Show that therefore, the entire algorithm takes only 0(n) time on a
keyword of length n.
Having computed the failure function for a keyword bib2 • • • bn, we can scan a string
a1a2---am in time 0(m) to tell whether the keyword occurs in the string. The algorithm,
shown in Fig. 3.20, slides the keyword along the string, trying to make progress by
matching the next character of the keyword with the next character of the string. If it
cannot do so after matching s characters, then it "slides" the keyword right s —
f(s) positions, so only the first / ( s ) characters of the keyword are considered matched
with the string.
s = 0;
for (i = 1; i < m; i++) {
while (s > 0 a{ ! = bs+1) s = f(s);
if,(a* == b8+i) s = s + 1;
if (s == n) return "yes";
}
return "no";
Figure 3.20: The KMP algorithm tests whether string a1a2 - ••am contains a single
keyword b1b2 • • • bn as a substring in 0(m + n) time
substring of:
abababaab.
abababbaa.
Exercise 3 . 4 . 7: Show that the algorithm of Fig. 3.20 correctly tells whether the
keyword is a substring of the given string. Hint: proceed by induction on i. Show that
for all i, the value of s after line (4) is the length of the longest prefix of the keyword
that is a suffix of a1a2 • • • a^.
!! Exercise 3 . 4 . 8 : Show that the algorithm of Fig. 3.20 runs in time 0(m
+ n), assuming that function / is already computed and its values stored in an array
indexed by s.
1. si = b.
2. s2 = a.
Aho and Corasick generalized the KMP algorithm to recognize any of a set of
keywords in a text string. In this case, the trie is a true tree, with branching from the
root. There is one state for every string that is a prefix (not necessarily proper) of any
keyword. The parent of a state corresponding to string b1b2 • • • bk is the state that
corresponds to &1&2 • • • frfc-i- A state is accepting if it corresponds to a complete
keyword. For example, Fig. 3.21 shows the trie for the keywords he, she, h i s , and h e
rs.
The failure function for the general trie is defined as follows. Suppose s is the state that
corresponds to string &i&2 • • • bn. Then f(s) is the state that corresponds to the longest
proper suffix of &i62 • • -bn that is also a prefix of some keyword. For example, the
failure function for the trie of Fig. 3.21 is:
Now we understand that procedures are executed in depth-first manner, thus stack
allocation is the best suitable form of storage for procedure activations.
Storage Allocation
Runtime environment manages runtime memory requirements for the following
entities:
Code : It is known as the text part of a program that does not change at runtime.
Its memory requirements are known at the compile time.
Procedures : Their text part is static but they are called in a random manner.
That is why, stack storage is used to manage procedure calls and activations.
Variables : Variables are known at the runtime only, unless they are global or
constant. Heap memory allocation scheme is used for managing allocation and
de-allocation of memory for variables in runtime.
Static Allocation
In this allocation scheme, the compilation data is bound to a fixed location in the
memory and it does not change when the program executes. As the memory
requirement and storage locations are known in advance, runtime support package for
memory allocation and de-allocation is not required.
Stack Allocation
Procedure calls and their activations are managed by means of stack memory allocation.
It works in last-in-first-out (LIFO) method and this allocation strategy is very useful for
recursive procedure calls.
Heap Allocation
Variables local to a procedure are allocated and de-allocated only at runtime. Heap
allocation is used to dynamically allocate memory to the variables and claim it back
when the variables are no more required.
Except statically allocated memory area, both stack and heap memory can grow and
shrink dynamically and unexpectedly. Therefore, they cannot be provided with a fixed
amount of memory in the system.
As shown in the image above, the text part of the code is allocated a fixed amount of
memory. Stack and heap memory are arranged at the extremes of total memory
allocated to the program. Both shrink and grow against each other.
Parameter Passing
The communication medium among procedures is known as parameter passing. The
values of the variables from a calling procedure are transferred to the called procedure
by some mechanism. Before moving ahead, first go through some basic terminologies
pertaining to the values in a program.
r-value
The value of an expression is called its r-value. The value contained in a single variable
also becomes an r-value if it appears on the right-hand side of the assignment operator.
r-values can always be assigned to some other variable.
l-value
The location of memory (address) where an expression is stored is known as the l-value
of that expression. It always appears at the left hand side of an assignment operator.
For example:
day = 1;
week = day * 7;
month = 1;
year = month * 12;
From this example, we understand that constant values like 1, 7, 12, and variables like
day, week, month and year, all have r-values. Only variables have l-values as they also
represent the memory location assigned to them.
For example:
7 = x + y;
is an l-value error, as the constant 7 does not represent any memory location.
Formal Parameters
Variables that take the information passed by the caller procedure are called formal
parameters. These variables are declared in the definition of the called function.
Actual Parameters
Variables whose values or addresses are being passed to the called procedure are called
actual parameters. These variables are specified in the function call as arguments.
Example:
fun_one()
{
int actual_parameter = 10;
call fun_two(int actual_parameter);
}
fun_two(int formal_parameter)
{
print formal_parameter;
}
Formal parameters hold the information of the actual parameter, depending upon the
parameter passing technique used. It may be a value or an address.
Pass by Value
In pass by value mechanism, the calling procedure passes the r-value of actual
parameters and the compiler puts that into the called procedure’s activation record.
Formal parameters then hold the values passed by the calling procedure. If the values
held by the formal parameters are changed, it should have no impact on the actual
parameters.
Pass by Reference
In pass by reference mechanism, the l-value of the actual parameter is copied to the
activation record of the called procedure. This way, the called procedure now has the
address (memory location) of the actual parameter and the formal parameter refers to
the same memory location. Therefore, if the value pointed by the formal parameter is
changed, the impact should be seen on the actual parameter as they should also point to
the same value.
Pass by Copy-restore
This parameter passing mechanism works similar to ‘pass-by-reference’ except that the
changes to actual parameters are made when the called procedure ends. Upon function
call, the values of actual parameters are copied in the activation record of the called
procedure. Formal parameters if manipulated have no real-time effect on actual
parameters (as l-values are passed), but when the called procedure ends, the l-values of
formal parameters are copied to the l-values of actual parameters.
Example:
int y;
calling_procedure()
{
y = 10;
copy_restore(y); //l-value of y is passed
printf y; //prints 99
}
copy_restore(int x)
{
x = 99; // y still has value 10 (unaffected)
y = 0; // y is now 0
}
When this function ends, the l-value of formal parameter x is copied to the actual
parameter y. Even if the value of y is changed before the procedure ends, the l-value of
x is copied to the l-value of y making it behave like call by reference.
Pass by Name
Languages like Algol provide a new kind of parameter passing mechanism that works
like preprocessor in C language. In pass by name mechanism, the name of the
procedure being called is replaced by its actual body. Pass-by-name textually
substitutes the argument expressions in a procedure call for the corresponding
parameters in the body of the procedure so that it can now work on actual parameters,
much like pass-by-reference.