System Software Manual
System Software Manual
Part A
Part B
Lex is a tool for building lexers or lexical analyzers. It takes an arbitrary input stream
and tokenizes it. The Lex utility generates a 'C' code which is nothing but a yylex()
function which can be used as an interface to YACC. A good amount of details on
Lex can be obtained from the Man Pages itself. A Practical approach to certain
fundamentals are given here.
The General Format of a Lex File consists of three sections:
1. Definitions
2. Rules
3. User Subroutines
Definitions consists of any external 'C' definitions used in the lex actions or
subroutines . e.g all preprocessor directives like #include, #define macros etc. These
are simply copied to the lex.yy.c file. The other type of definitions are Lex definitions
which are essentially the lex substitution strings,lex start states and lex table size
declarations. The Rules is the basic part which specifies the regular expressions and
their corresponding actions. The User Subroutines are the function definitions of the
functions that are used in the Lex actions.
Things to remember:
1. If there is no R.E for the input string , it will be copied to the standard output.
2. The Lex resolves the ambiguity in case of matching by choosing the longest match
first or by choosing the rule given first.
3. All the matched expressions are contained in yytext whose length is yyleng.
Definition Section
%%
Rules Section
%%
User Subroutines Section
2. Rules Section: Contains pattern lines and C code. Pattern is written using RE and C
code, also called the action part acts according to the pattern specified. If C code
exceeds one line, then it must be enclosed in braces { }.
3. User Subroutine Section: This section includes routines called from the rules.
main()
{
yylex(); /*lexer or scanner*/
}
Lex specifications are set of patterns, that is pattern part of the rules section, in which
Lex matches against the input. Each time one of the patterns matches, the Lex
program invokes C code, that is the action part of rules section, which takes some
action with the matched token.
Lex translates the lex specifications into a file containing C routine called yylex().The
yylex() will recognize expressions in a stream and perform the specified actions for
each expression as it is detected.
The pattern part of rules section is written using Regular Expressions (REs) RE is a
pattern description using a meta language. REs are composed of normal characters
and meta characters.
The characters/Meta characters that form regular expression along
with their descriptions are listed below:
. Matches any single character except the new line character “\n”
[] Matches any one of the characters within brackets. Also called as character
class. If the first character is circumflex “^”, it changes the meaning to match any
character except those within the brackets. A range of characters is indicated with ‘-‘.
Example:
1. [a-z0-9] indicates the character class containing all the lower case
letters, and the digits.
2. [^ask] matches all characters except a,s, and k
+ Matches one or more of the preceding expression Ex: a+ => a, aa, aaa….
[a-z]+ is all strings of lower case letters. [ab]+ => ab, abab, ababab…..
$ If the very last character is $, the expression will only be matched at the end of
a line. i.e., matches the end of line as the last character of RE. Ex:ab$ matches any
stream that ends with b.
{} Specify either repetitions (if the enclose numbers) or definition expansion (if
the enclose a name). Ex: {digit} looks for a predefined string named digit and inserts
it at that point in the expression. A{1,5} matches looks for 1 to 5 occurrences of a.
i.e., indicates how many times the previous RE is allowed to match when containing
one or two numbers.
| Indicates alternation Ex: (ab|cd) matches either ab or cd. i.e., matches either
the preceding RE or the following RE.
() Groups a series of REs together into a new RE. (ab|cd+)?(ef)* matches such
strings abefef, efef, cdef, cddd.
“..” Interprets everything within the quotation marks literally. Meta characters
other than C escape sequence lose their meaning. Ex:”/*” matches the two characters
* & /.
^ As the first character of RE, it matches the beginning of a line. Also used for
negation within [].
/ Matches the preceding RE but only if followed by the following RE. Ex:0/1
matches ‘0’ in the string ‘01’ but does not match anything in the string ‘0’or ‘02’.
Only one slash is permitted per pattern.
<> A name or list of names in angle brackets at the beginning of a pattern makes
that pattern apply only in the given start states.
a.out
C compiler
Lex Practice
Metacharacter Matches
\n newline
^ beginning of line
$ end of line
a|b a or b
[] character class
Expression Matches
abc abc
a(bc)? a, abc
[ \t\n]+ whitespace
[a^b] a, ^, b
[a|b] a, |, b
a|b a, b
Input to Lex is divided into three sections, with %% dividing the sections. This is
best illustrated by example. The first example is the shortest possible lex file:
%%
Input is copied to output, one character at a time. The first %% is always required, as
there must always be a rules section. However, if we don’t specify any rules, then the
default action is to match everything and copy it to output. Defaults for input and out-
put are stdin and stdout, respectively. Here is the same example, with defaults explic-
itly coded:
%%
/* match everything except newline */
. ECHO;
/* match newline */
\n ECHO;
%%
int yywrap(void) {
return 1;
}
int main(void) {
yylex();
return 0;
}
Two patterns have been specified in the rules section. Each pattern must begin in col-
umn one. This is followed by whitespace (space, tab or newline), and an optional ac-
tion associated with the pattern. The action may be a single C statement, or multiple C
statements enclosed in braces. Anything not starting in column one is copied verbatim
to the generated C file. We may take advantage of this behavior to specify comments
in our lex file. In this example there are two patterns, "." and "\n", with an ECHO ac-
tion associated for each pattern. Several macros and variables are predefined by lex.
ECHO is a macro that writes code matched by the pattern. This is the default action
for any unmatched strings. Typically, ECHO is defined as:
Name Function
Here is a program that does nothing at all. All input is matched, but no action is asso-
ciated with any pattern, so there will be no output.
%%
.
\n
The following example prepends line numbers to each line in a file. Some implemen-
tations of lex predefine and calculate yylineno. The input file for lex is yyin, and de-
faults to stdin.
%{
int yylineno;
%}
%%
^(.*)\n printf("%4d\t%s", ++yylineno, yytext);
%%
int main(int argc, char *argv[]) {
yyin = fopen(argv[1], "r");
yylex();
fclose(yyin);
}
The definitions section is composed of substitutions, code, and start states. Code in
the definitions section is simply copied as-is to the top of the generated C file, and
must be bracketed with "%{" and "%}" markers. Substitutions simplify pattern-
matching rules. For example, we may define digits and letters:
digit [0-9]
letter [A-Za-z]
%{
int count;
%}
%%
/* match identifier */
{letter}({letter}|{digit})* count++;
%%
int main(void) {
yylex();
printf("number of identifiers = %d\n", count);
return 0;
}
Whitespace must separate the defining term and the associated expression. References
to substitutions in the rules section are surrounded by braces ({letter}) to distinguish
them from literals. When we have a match in the rules section, the associated C code
is executed. Here is a scanner that counts the number of characters, words, and lines
in a file (similar to Unix wc):
%{
int nchar, nword, nline;
%}
%%
\n { nline++; nchar++; }
[^ \t\n]+ { nword++, nchar += yyleng; }
. { nchar++; }
%%
int main(void) {
yylex();
printf("%d\t%d\t%d\n", nchar, nword, nline);
return 0;
}
Yacc provides a general tool for imposing structure on the input to a computer
program. Yacc is the Utility which generates the function 'yyparse' which is indeed
the Parser. Yacc describes a context free , LALR(1) grammar and supports both
bottom-up and top-down parsing.The general format for the YACC file is very similar
to that of the Lex file.
1. Declarations
2. Grammar Rules
3. Subroutines
In Declarations apart from the legal 'C' declarations there are few Yacc specific
declarations which begins with a %sign.
The Yacc source must be turned into generated program in the host
general purpose language. i.e., C language, using the command $yacc -
d filename.y(-d is token definition), this yacc compiler generates a C
file called y.tab.c, the literal block, action part of rules section, and
user subroutine section of Yacc program where C valid statements will
be included gets copied as it is to this C file y.tab.c. This C file
contains the parser, yyparse().When Yacc parser runs, it in turn
repeatedly calls yylex, the lexical analyzer which supplies tokens to
yacc as and when required. When an error is detected, parse returns the
value 1, or the lexical analyzer returns the end marker token and the
parser accepts. In this case, yyparse returns the value 0.
This C file will be compiled using C compiler and loaded, usually with
a library of yacc and lex subroutines. Here first lex program must be
compiled as usual which generates the C file lex.yy.c, then Yacc
program must be compiled which generates the C file called y.tab.c.
Now both C files will be compiled using C compiler.
$cc lex.yy.c y.tab.c –ll -ly, where –ly is the loader flag accesss the
Yacc library.
The resulting program is placed on the usual file a.out for later
execution. To terminate, press Cntrl+d.
C compiler a.out
2. yytext => Whenever the lexer matches a token, the text of the token is
stored in the null terminated string yytext. It is array of characters whose
contents are replaced each time new token is matched.
3. yywrap() => When a lexer encounters an end of file, it calls the routine
yywrap() to find out what to do next. If yywrap() returns 0, the scanner
continues scanning, if it returns 1, the scanner returns zero token to report
end of file.
4. yyin,yyout => Standard input and output files of lex. Like stdin & stdout
files used in c.
5. Echo => Writes the token to the current output file yyout. Equivalent to
fprintf(yyout,”%s”,yytext);
7. output => Writes its arguments to the ouput file yyout.i.e putc(c,yyout).
Also yyout().
8. unput() => returns the character to the input stream. Also yyunput().
10. yyless() => yyless(n) is used to push back the ‘n’ characters of the token.
11. yymore() => Can be used to append more text to the token.
12. yyparse() => The entry point to the yacc generated parser. Returns zero on
success and non-zero on failure.
13. yyerror() => Simple error reporting routine, yyerror(char *msg).
14. % => Used to declare the definitions like %token, %start, %type, %left,
%right, %union.
15. $ => Introduces a value of reference in actions. Ex: $3 refers the value of
third symbol in the RHS of the rule, c=12+89, $3 refers to value 89.
17. ; => Each rule in the rule section end with a semicolon.
18. | => To specify the alternative RHS for the same LHS in a rule. Ex:
e : e’+’e|e’-‘e|e’*’e.
20. %token => Are the symbols that the lexer passes to the parser. So parser
need to call yylex() which returns the tokens required by the parser. All
tokens must be explicitly defined in the definition section.
21. %left, %right, %nonassoc => Explicit means of specifying left, right, and
no associativity.
22. %start <rule name> => Specifies the first rule that the parser should start.
23. %prec => Changes the precedence level associated with a particular
grammar rule. Ex: unary minus may be given highest level of precedence,
whereas binary minus will have lower level precedence.
26. %type => Sets the tpe for non-terminals. Ex: %union { double dval;}
%type <dval> expression.
27. YYABORT => Causes yyparse() to return immediately with a non zero
value(failure).
Input to yacc is divided into three sections. The definitions section consists of token
declarations, and C code bracketed by "%{" and "%}". The BNF grammar is placed
in the rules section, and user subroutines are added in the subroutines section.
This is best illustrated by constructing a small calculator that can add and subtract
numbers. We’ll begin by examining the linkage between lex and yacc. Here is the def-
initions section for the yacc input file:
%token INTEGER
This definition declares an INTEGER token. When we run yacc, it generates a parser
in file y.tab.c, and also creates an include file, y.tab.h:
#ifndef YYSTYPE
#define YYSTYPE int
#endif
#define INTEGER 258
extern YYSTYPE yylval;
Lex includes this file and utilizes the definitions for token values. To obtain tokens,
yacc calls yylex. Function yylex has a return type of int, and returns the token. Values
associated with the token are returned by lex in variable yylval. For example,
[0-9]+ {
yylval = atoi(yytext);
return INTEGER;
}
would store the value of the integer in yylval, and return token INTEGER to yacc.
The type of yylval is determined by YYSTYPE. Since the default type is integer, this
works well in this case. Token values 0-255 are reserved for character values. For ex-
ample, if you had a rule such as
the character value for minus or plus is returned. Note that we placed the minus sign
first so that it wouldn’t be mistaken for a range designator. Generated token values
typically start around 258, as lex reserves several values for end-of-file and error pro-
cessing. Here is the complete lex input specification for our calculator:
%{
#include "y.tab.h"
#include <stdlib.h>
void yyerror(char *);
%}
%%
[0-9]+ {
yylval = atoi(yytext);
return INTEGER;
}
. yyerror("invalid character");
%%
int yywrap(void) {
return 1;
}
Internally, yacc maintains two stacks in memory; a parse stack and a value stack. The
parse stack contains terminals and nonterminals, and represents the current parsing
state. The value stack is an array of YYSTYPE elements, and associates a value with
each element in the parse stack. For example, when lex returns an INTEGER token,
yacc shifts this token to the parse stack. At the same time, the corresponding yylval is
shifted to the value stack. The parse and value stacks are always synchronized, so
finding a value related to a token on the stack is easily accomplished. Here is the yacc
input specification for our calculator:
%{
int yylex(void);
void yyerror(char *);
%}
%token INTEGER
%%
program:
program expr '\n' { printf("%d\n", $2); }
|
;
expr:
INTEGER { $$ = $1; }
| expr '+' expr { $$ = $1 + $3; }
| expr '-' expr { $$ = $1 - $3; }
;
%%
int main(void) {
yyparse();
return 0;
}
The rules section resembles the BNF grammar discussed earlier. The left-hand side of
a production, or nonterminal, is entered left-justified, followed by a colon. This is fol-
lowed by the right-hand side of the production. Actions associated with a rule are en-
tered in braces.
By utilizing left-recursion, we have specified that a program consists of zero or more
expressions. Each expression terminates with a newline. When a newline is detected,
we print the value of the expression. When we apply the rule
we replace the right-hand side of the production in the parse stack with the left-hand
side of the same production. In this case, we pop "expr '+' expr" and push "expr".
We have reduced the stack by popping three terms off the stack, and pushing back one
term. We may reference positions in the value stack in our C code by specifying "$1"
for the first term on the right-hand side of the production, "$2" for the second, and so
on. "$$" designates the top of the stack after reduction has taken place. The above ac-
tion adds the value associated with two expressions, pops three terms off the value
stack, and pushes back a single sum. Thus, the parse and value stacks remain synchro-
nized.
Numeric values are initially entered on the stack when we reduce from INTEGER to
expr. After INTEGER is shifted to the stack, we apply the rule
The INTEGER token is popped off the parse stack, followed by a push of expr. For
the value stack, we pop the integer value off the stack, and then push it back on again.
In other words, we do nothing. In fact, this is the default action, and need not be spec-
ified. Finally, when a newline is encountered, the value associated with expr is
printed.
In the event of syntax errors, yacc calls the user-supplied function yyerror. If you
need to modify the interface to yyerror, you can alter the canned file that yacc in-
cludes to fit your needs. The last function in our yacc specification is main … in case
you were wondering where it was. This example still has an ambiguous grammar.
Yacc will issue shift-reduce warnings, but will still process the grammar using shift as
the default operation.
Yacc Practice, Part II
In this section we will extend the calculator from the previous section to incorporate
some new functionality. New features include arithmetic operators multiply, and di-
vide. Parentheses may be used to over-ride operator precedence, and single-character
variables may be specified in assignment statements. The following illustrates sample
input and calculator output:
user: 3 * (4 + 5)
calc: 27
user: x = 3 * (4 + 5)
user: y = 5
user: x
calc: 27
user: y
calc: 5
user: x + 2*y
calc: 37
The lexical analyzer returns VARIABLE and INTEGER tokens. For variables, yyl-
val specifies an index to sym, our symbol table. For this program, sym merely holds
the value of the associated variable. When INTEGER tokens are returned, yylval
contains the number scanned. Here is the input specification for lex:
%{
#include <stdlib.h>
#include "y.tab.h"
void yyerror(char *);
%}
%%
/* variables */
[a-z] {
yylval = *yytext - 'a';
return VARIABLE;
}
/* integers */
[0-9]+ {
yylval = atoi(yytext);
return INTEGER;
}
/* operators */
[-+()=/*\n] { return *yytext; }
/* skip whitespace */
[ \t] ;
%%
int yywrap(void) {
return 1;
}
The input specification for yacc follows. The tokens for INTEGER and VARIABLE
are utilized by yacc to create #defines in y.tab.h for use in lex. This is followed by
definitions for the arithmetic operators. We may specify %left, for left-associative, or
%right, for right associative. The last definition listed has the highest precedence.
Thus, multiplication and division have higher precedence than addition and subtrac-
tion. All four operators are left-associative. Using this simple technique, we are able
to disambiguate our grammar.
%{
void yyerror(char *);
int yylex(void);
int sym[26];
%}
%%
program:
program statement '\n'
|
;
statement:
expr { printf("%d\n", $1); }
| VARIABLE '=' expr { sym[$1] = $3; }
;
expr:
INTEGER
| VARIABLE { $$ = sym[$1]; }
| expr '+' expr { $$ = $1 + $3; }
| expr '-' expr { $$ = $1 - $3; }
| expr '*' expr { $$ = $1 * $3; }
| expr '/' expr { $$ = $1 / $3; }
| '(' expr ')' { $$ = $2; }
;
%%
void yyerror(char *s) {
fprintf(stderr, "%s\n", s);
}
int main(void) {
yyparse();
return 0;
}