Header Section Definitions Section Rules Section
Header Section Definitions Section Rules Section
sections:
1. Header Section (%{ %})
2. Definitions Section
3. Rules Section
Each section serves a specific purpose in defining and implementing the lexer.
2. Definitions Section
Purpose:
o This section defines patterns (using regular expressions) that can be reused in the
rules section. It often includes macros for frequently used patterns like digits,
letters, and comments.
o It’s written after the %} of the header section and before the rules section.
Contents:
o Macros: Simple names used to define patterns, like DIGIT, LETTER, or
COMMENT.
o These macros can be used later in the rules section to make patterns more readable
and maintainable.
o Optionally, this section can also declare Lex options like case-insensitivity or other
configurations.
Example:
DIGIT [0-9]
LETTER [a-zA-Z]
UNDERSCORE _
COMMENT ##.*\n
Explanation:
o DIGIT is defined as a regular expression that matches any single digit [0-9].
o LETTER matches any alphabetical character.
o UNDERSCORE is an underscore (_).
o COMMENT matches any line starting with ## and ending with a newline (\n).
3. Rules Section
Purpose:
o This is the core section of the Lex program. It defines the patterns (regular
expressions) the lexer will match in the input and the actions to take when a match
is found.
o The patterns are written on the left, and the corresponding actions (typically C code)
are written on the right, separated by whitespace.
Contents:
o Pattern-Action Pairs: Regular expressions that describe the tokens you want to
recognize, followed by C code that defines what to do when that pattern is matched.
o The action can be anything from printing the token type and updating position
counters to more complex actions like generating code or building data structures.
o Special rules for whitespace or unrecognized tokens are also included here.
Example:
[DIGIT]+ { printf("NUMBER %.*s\n", yyleng, yytext); }
int { printf("KEYWORD_INTEGER\n"); }
\n { line++; column = 0; }
[ \t]+ { column += yyleng; }
. { printf("Error: unrecognized symbol \"%.*s\"\n", yyleng, yytext); }
Explanation:
o When the input matches the keyword int, the lexer prints KEYWOD_INTEGER
and updates the column.
o The pattern [DIGIT]+ matches any sequence of digits, prints it as a number, and
increments the column by the number of characters matched (yyleng).
o The pattern \n increments the line counter and resets the column (for newlines).
o [ \t]+ matches spaces or tabs and increases the column accordingly without any
other output.
o The rule . catches any unrecognized symbols and generates an error message.
Workflow Example:
Given input "int 123;", the Lex program would:
o Recognize function as a keyword and print KEYWOD_INTEGER.
o Recognize 123 as a number and print NUMBER 123.
o Recognize ; as a semicolon and print SEMICOLON.
o Track the line and column throughout the process.
Summary:
Header Section (%{ %}): Contains C code and global variables/functions, included as-is
in the generated C file.
Definitions Section: Contains macros or reusable regular expressions for easy reference in
the rules.
Rules Section: Specifies patterns to match and the actions to execute when a match is
found. This is the core of the lexer and defines how to process input and recognize tokens.
Macros
Lex allows the use of macros in the Defini ons Sec on to simplify the specifica on of complex pa erns.
Once defined, macros can be reused in the Rules Sec on to make the code more readable and
maintainable.
2. %d:
%d is a format specifier used in printf to print an integer value. When printf encounters %d in the
format string, it looks for an integer argument to print at that position.
Example:
int line = 10;
printf("Line number: %d\n", line);
Output:
Line number: 10
3. "%.*s":
This format specifier is used to print a string with a specified length. It's particularly useful when
you want to print only part of a string.
%.*s: This tells printf to print a string, but the .* means that the number of characters to
print will be provided as an additional argument before the string itself.
o .: The dot represents precision.
o *: The * allows you to specify the precision dynamically via an argument.
Example:
char *text = "Hello, World!";
int length = 5;
printf("Text: %.*s\n", length, text);
Output:
Text: Hello
In this example, only the first 5 characters of "Hello, World!" are printed because length = 5.
In the context of Flex, yytext is the string containing the matched text, and yyleng is the length of
the matched text. So:
printf("NUMBER %.*s\n", yyleng, yytext);
This prints the word "NUMBER" followed by the matched text from yytext but only prints
yyleng characters (the length of the match).
For example, if yytext contains "12345" and yyleng is 5, it will print:
NUMBER 12345
Sample Flex Code (example.l)
%{
#include <stdio.h>
%}
%%
[0-9]+\.[0-9]+ { printf("FLOAT: %s\n", yytext); }
[0-9]+ { printf("INTEGER: %s\n", yytext); }
[a-zA-Z_][a-zA-Z0-9_]* { printf("IDENTIFIER: %s\n", yytext); }
"+"|"-"|"*"|"/"|”=” { printf("OPERATOR: %s\n", yytext); }
[ \t\n]+ ; /* Ignore whitespace */
Int {printf(“KEYWORD_INTEGER”);}
%%
int main(void) {
printf("Enter a String:\n"); int a = 10;
yylex();
return 0;
}
int yywrap() {
return 1;
}
Explanation of Key Parts
%{...%}: This section at the top is for C code, typically for #include statements or
variable declarations.
%% sections: Divides the code into three parts: declarations, rules, and user code.
Regular Expressions:
o [0-9]+\.[0-9]+: Matches floating-point numbers and uses yytext to output the
matched text.
o [0-9]+: Matches integers.
o [a-zA-Z_][a-zA-Z0-9_]*: Matches identifiers, which are letters or underscores
followed by alphanumeric characters.
o "+"|"-"|"*"|"/": Matches arithmetic operators.
o [ \t\n]+: Matches whitespace characters (ignored here).
o .: Catches any character not matched by other rules, marking it as "UNKNOWN."
Compilation and Execution
To compile and run this flex program:
1. Generate the C code from the flex file:
bash
flex example.l
2. Compile the generated C file:
gcc lex.yy.c -o lexer
3. Run the lexer on an input file or directly:
./lexer or lexer
Each line of input you enter will be tokenized, with output describing each token type based on
its pattern match.
Sample Flex Code with Definitions (example.l)
%{
#include <stdio.h>
%}
%%
%%
int main(void) {
yylex(); // Start scanning
return 0;
}
int yywrap() {
return 1;
}
Explanation of Key Parts
Definitions Section: Each macro is defined at the top, before the %% delimiter.
o DIGIT: Matches any single digit.
o LETTER: Matches any uppercase or lowercase letter.
o ID: Matches identifiers that start with a letter and may contain letters and digits.
o INT: Matches an integer (a sequence of digits).
o FLOAT: Matches a floating-point number (digits followed by a period and more
digits).
Rules Section: Each rule can use the defined macros by placing them in curly braces
{...}, like {ID}, {FLOAT}, etc. Flex substitutes these with the actual regular expressions
during processing.
Advantages of Using Definitions
Readability: You can clearly see what each token type is without reading the regex
details.
Modularity: If you need to change a pattern (e.g., redefine ID to allow underscores), you
only have to update it in the definitions section.
In a flex (lexical analyzer) file, these functions are standard boilerplate, allowing the scanner to
work correctly:
1. int main(void):
o This is the entry point of a C program. Here, yylex() is called, which is the
function generated by flex to perform the lexical analysis.
o yylex() reads input from stdin, matches the patterns defined in the flex rules, and
executes the corresponding actions.
o After yylex() finishes (usually when it reaches the end of the input), main returns
0, signaling that the program finished without errors.
2. int yywrap():
o yywrap() is a function flex calls when it reaches the end of the input file.
o By default, yywrap() should return 1 to indicate that there is no more input to
process.
o Flex will stop scanning when yywrap() returns 1. This function is often required
even if you’re only scanning a single file.
Why are these needed?
main runs the yylex() scanner, initiating the scanning process.
yywrap provides a way to handle multiple files or stop scanning cleanly after a single file
if there's no additional input.
Change Code Page to UTF-8 (It helps to detect Amharic lang..)
By default, the Windows Command Prompt(CMD) uses a legacy code page (like CP437 or
CP850), which doesn't support UTF-8. You can change it to UTF-8 by running the following
command:
chcp 65001
This sets the code page to UTF-8 (65001).
You can verify the code page change by typing:
chcp
TO DOWNLOAD