0% found this document useful (0 votes)
21 views15 pages

Header Section Definitions Section Rules Section

Uploaded by

gebremolla641
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views15 pages

Header Section Definitions Section Rules Section

Uploaded by

gebremolla641
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

In Lex (a lexical analyzer generator), the structure of a Lex program is divided into three main

sections:
1. Header Section (%{ %})
2. Definitions Section
3. Rules Section
Each section serves a specific purpose in defining and implementing the lexer.

1. Header Section (%{ %})


 Purpose:
o The header section is used to include code that will be copied verbatim into the
generated C file. This is where you typically add #include statements and declare
any global variables or functions that will be needed by the lexer.
o It's wrapped in %{ and %} to indicate that this code should be included as-is in the
C file generated by Lex.
 Contents:
o Includes: Include C standard libraries (like stdio.h, stdlib.h).
o Global Variables: Declare any global variables (e.g., int line, int column to track
the current line and column in the input).
o Function Declarations: Declare or define functions that are needed for error
handling (e.g., void yyerror(char *s) to handle error reporting).
 Example:
%{
#include <stdio.h>
#include <stdlib.h>

// Declare global variables


int line = 1, column = 0;

// Error handling function


void yyerror(char *s) {
printf("Error: %s\n", s);
}
%}
 Explanation: The above code will be included in the generated C code before the Lex rules
are applied, making functions like yyerror() and variables like line and column available
throughout the program.

2. Definitions Section
 Purpose:
o This section defines patterns (using regular expressions) that can be reused in the
rules section. It often includes macros for frequently used patterns like digits,
letters, and comments.
o It’s written after the %} of the header section and before the rules section.
 Contents:
o Macros: Simple names used to define patterns, like DIGIT, LETTER, or
COMMENT.
o These macros can be used later in the rules section to make patterns more readable
and maintainable.
o Optionally, this section can also declare Lex options like case-insensitivity or other
configurations.
 Example:
DIGIT [0-9]
LETTER [a-zA-Z]
UNDERSCORE _
COMMENT ##.*\n
 Explanation:
o DIGIT is defined as a regular expression that matches any single digit [0-9].
o LETTER matches any alphabetical character.
o UNDERSCORE is an underscore (_).
o COMMENT matches any line starting with ## and ending with a newline (\n).

3. Rules Section
 Purpose:
o This is the core section of the Lex program. It defines the patterns (regular
expressions) the lexer will match in the input and the actions to take when a match
is found.
o The patterns are written on the left, and the corresponding actions (typically C code)
are written on the right, separated by whitespace.
 Contents:
o Pattern-Action Pairs: Regular expressions that describe the tokens you want to
recognize, followed by C code that defines what to do when that pattern is matched.
o The action can be anything from printing the token type and updating position
counters to more complex actions like generating code or building data structures.
o Special rules for whitespace or unrecognized tokens are also included here.
 Example:
[DIGIT]+ { printf("NUMBER %.*s\n", yyleng, yytext); }
int { printf("KEYWORD_INTEGER\n"); }
\n { line++; column = 0; }
[ \t]+ { column += yyleng; }
. { printf("Error: unrecognized symbol \"%.*s\"\n", yyleng, yytext); }
 Explanation:
o When the input matches the keyword int, the lexer prints KEYWOD_INTEGER
and updates the column.
o The pattern [DIGIT]+ matches any sequence of digits, prints it as a number, and
increments the column by the number of characters matched (yyleng).
o The pattern \n increments the line counter and resets the column (for newlines).
o [ \t]+ matches spaces or tabs and increases the column accordingly without any
other output.
o The rule . catches any unrecognized symbols and generates an error message.
Workflow Example:
 Given input "int 123;", the Lex program would:
o Recognize function as a keyword and print KEYWOD_INTEGER.
o Recognize 123 as a number and print NUMBER 123.
o Recognize ; as a semicolon and print SEMICOLON.
o Track the line and column throughout the process.
Summary:
 Header Section (%{ %}): Contains C code and global variables/functions, included as-is
in the generated C file.
 Definitions Section: Contains macros or reusable regular expressions for easy reference in
the rules.
 Rules Section: Specifies patterns to match and the actions to execute when a match is
found. This is the core of the lexer and defines how to process input and recognize tokens.
Macros
Lex allows the use of macros in the Defini ons Sec on to simplify the specifica on of complex pa erns.
Once defined, macros can be reused in the Rules Sec on to make the code more readable and
maintainable.

Why Use Macros?


 Reusability: Define a pattern once and use it multiple times in different rules.
 Readability: Instead of repeating complex regular expressions, you give them a
meaningful name.
 Maintainability: If a pattern changes, you only need to update it in one place.
How Macros Work:
 You define macros in the Definitions Section of a Lex file, right after the header section.
 Macros are typically written using regular expressions.
 You can refer to a macro by its name in the Rules Section instead of rewriting the regular
expression multiple times.
Example:
Consider a scenario where you frequently match digits, letters, or comments. Instead of repeating
these regular expressions in multiple places, you define macros:
DIGIT [0-9]
LETTER [a-zA-Z]
UNDERSCORE _
COMMENT ##.*\n
In the Rules Section, you can then use these macros like so:
{LETTER}({LETTER}|{DIGIT}|{UNDERSCORE})* { printf("Identifier found\n"); }
{DIGIT}+ { printf("Number found\n"); }
{COMMENT} { printf("Comment found\n"); }
Here, the macros {LETTER}, {DIGIT}, and {COMMENT} replace the corresponding regular
expressions. When Lex processes this input, it substitutes the macro with the defined regular
expression.
Benefits of Macros in Lex:
1. Simplification: For example, instead of repeating [0-9] everywhere in your rules where
you want to match digits, you can define DIGIT as a macro and use {DIGIT}.
2. Efficiency: Makes your Lex file easier to modify. If you decide to redefine what a "digit"
is (e.g., to include more characters), you only need to change the definition of the macro.
3. Organization: Macros make your Lex specification cleaner and easier to understand by
abstracting out complex patterns.
Example without Macros:
[0-9]+ { printf("Number found\n"); }
[a-zA-Z]+ { printf("Identifier found\n"); }
##.*\n { printf("Comment found\n"); }
Now compare this to the example with macros:
DIGIT [0-9]
LETTER [a-zA-Z]
COMMENT ##.*\n

{DIGIT}+ { printf("Number found\n"); }


{LETTER}+ { printf("Identifier found\n"); }
{COMMENT} { printf("Comment found\n"); }
Both examples do the same thing, but the second one using macros is much clearer, easier to
maintain, and avoids repetition.
Summary:
 A macro in Lex is a symbolic name for a pattern, typically defined using regular
expressions.
 Macros are defined in the Definitions Section and used in the Rules Section.
 They improve code readability, maintainability, and reduce repetition.
Explanation of printf, %d, and "%.*s"
1. printf:
printf is a standard C function used to print formatted output to the console. The general format
is:
printf(format_string, arguments);
 format_string: A string that specifies how to format the output. It can contain text and
format specifiers (e.g., %d, %s) to insert values.
 arguments: The values that will be substituted for the format specifiers in the format
string.

2. %d:
%d is a format specifier used in printf to print an integer value. When printf encounters %d in the
format string, it looks for an integer argument to print at that position.
Example:
int line = 10;
printf("Line number: %d\n", line);
Output:
Line number: 10

3. "%.*s":
This format specifier is used to print a string with a specified length. It's particularly useful when
you want to print only part of a string.
 %.*s: This tells printf to print a string, but the .* means that the number of characters to
print will be provided as an additional argument before the string itself.
o .: The dot represents precision.
o *: The * allows you to specify the precision dynamically via an argument.
Example:
char *text = "Hello, World!";
int length = 5;
printf("Text: %.*s\n", length, text);
Output:

Text: Hello
In this example, only the first 5 characters of "Hello, World!" are printed because length = 5.
In the context of Flex, yytext is the string containing the matched text, and yyleng is the length of
the matched text. So:
printf("NUMBER %.*s\n", yyleng, yytext);
This prints the word "NUMBER" followed by the matched text from yytext but only prints
yyleng characters (the length of the match).
For example, if yytext contains "12345" and yyleng is 5, it will print:
NUMBER 12345
Sample Flex Code (example.l)
%{
#include <stdio.h>
%}
%%
[0-9]+\.[0-9]+ { printf("FLOAT: %s\n", yytext); }
[0-9]+ { printf("INTEGER: %s\n", yytext); }
[a-zA-Z_][a-zA-Z0-9_]* { printf("IDENTIFIER: %s\n", yytext); }
"+"|"-"|"*"|"/"|”=” { printf("OPERATOR: %s\n", yytext); }
[ \t\n]+ ; /* Ignore whitespace */
Int {printf(“KEYWORD_INTEGER”);}

. { printf("UNKNOWN: %s\n", yytext); }

%%

int main(void) {
printf("Enter a String:\n"); int a = 10;
yylex();
return 0;
}

int yywrap() {
return 1;
}
Explanation of Key Parts
 %{...%}: This section at the top is for C code, typically for #include statements or
variable declarations.
 %% sections: Divides the code into three parts: declarations, rules, and user code.
 Regular Expressions:
o [0-9]+\.[0-9]+: Matches floating-point numbers and uses yytext to output the
matched text.
o [0-9]+: Matches integers.
o [a-zA-Z_][a-zA-Z0-9_]*: Matches identifiers, which are letters or underscores
followed by alphanumeric characters.
o "+"|"-"|"*"|"/": Matches arithmetic operators.
o [ \t\n]+: Matches whitespace characters (ignored here).
o .: Catches any character not matched by other rules, marking it as "UNKNOWN."
Compilation and Execution
To compile and run this flex program:
1. Generate the C code from the flex file:
bash
flex example.l
2. Compile the generated C file:
gcc lex.yy.c -o lexer
3. Run the lexer on an input file or directly:
./lexer or lexer
Each line of input you enter will be tokenized, with output describing each token type based on
its pattern match.
Sample Flex Code with Definitions (example.l)
%{
#include <stdio.h>
%}

/* Definitions for common patterns */


DIGIT [0-9]
LETTER [a-zA-Z]
ID {LETTER}({LETTER}|{DIGIT})*
INT {DIGIT}+
FLOAT {DIGIT}+\.{DIGIT}+

%%

{FLOAT} { printf("FLOAT: %s\n", yytext); }


{INT} { printf("INTEGER: %s\n", yytext); }
{ID} { printf("IDENTIFIER: %s\n", yytext); }
"+"|"-"|"*"|"/" { printf("OPERATOR: %s\n", yytext); }
[ \t\n]+ ; /* Ignore whitespace */
. { printf("UNKNOWN: %s\n", yytext); }

%%

int main(void) {
yylex(); // Start scanning
return 0;
}
int yywrap() {
return 1;
}
Explanation of Key Parts
 Definitions Section: Each macro is defined at the top, before the %% delimiter.
o DIGIT: Matches any single digit.
o LETTER: Matches any uppercase or lowercase letter.
o ID: Matches identifiers that start with a letter and may contain letters and digits.
o INT: Matches an integer (a sequence of digits).
o FLOAT: Matches a floating-point number (digits followed by a period and more
digits).
 Rules Section: Each rule can use the defined macros by placing them in curly braces
{...}, like {ID}, {FLOAT}, etc. Flex substitutes these with the actual regular expressions
during processing.
Advantages of Using Definitions
 Readability: You can clearly see what each token type is without reading the regex
details.
 Modularity: If you need to change a pattern (e.g., redefine ID to allow underscores), you
only have to update it in the definitions section.
In a flex (lexical analyzer) file, these functions are standard boilerplate, allowing the scanner to
work correctly:
1. int main(void):
o This is the entry point of a C program. Here, yylex() is called, which is the
function generated by flex to perform the lexical analysis.
o yylex() reads input from stdin, matches the patterns defined in the flex rules, and
executes the corresponding actions.
o After yylex() finishes (usually when it reaches the end of the input), main returns
0, signaling that the program finished without errors.
2. int yywrap():
o yywrap() is a function flex calls when it reaches the end of the input file.
o By default, yywrap() should return 1 to indicate that there is no more input to
process.
o Flex will stop scanning when yywrap() returns 1. This function is often required
even if you’re only scanning a single file.
Why are these needed?
 main runs the yylex() scanner, initiating the scanning process.
 yywrap provides a way to handle multiple files or stop scanning cleanly after a single file
if there's no additional input.
Change Code Page to UTF-8 (It helps to detect Amharic lang..)
By default, the Windows Command Prompt(CMD) uses a legacy code page (like CP437 or
CP850), which doesn't support UTF-8. You can change it to UTF-8 by running the following
command:
chcp 65001
 This sets the code page to UTF-8 (65001).
 You can verify the code page change by typing:
chcp
TO DOWNLOAD

download flex 2.5.4a

You might also like