Lex Tool
Lex Tool
Lex is a tool that reads a specification file (typically with the .l extension), containing
regular expressions, and generates C code that implements a lexical analyzer. The
lexical analyzer scans the input text, recognizing patterns and converting it into a
sequence of tokens.
A Lex program consists of three parts and is separated by %% delimiters:-
Declarations
%%
Translation rules
%%
Auxiliary procedures
Definitions Section: You can define macros, global variables, or regular expressions.
This section contains user-defined macros, regular expressions, and includes any
necessary header files.
It’s where you define constants, include libraries, or define regular expressions that will
be reused in the rules section.
Rules Section: This is where you write regular expressions and the actions for each
matched pattern.
The rules section defines regular expressions and the actions associated with them.
Each rule consists of a regular expression pattern and an action.
The pattern specifies the strings to be matched, and the action specifies the operation
to perform when a pattern is matched. Actions are usually written in C code.
The rules section is placed between two %% delimiters.
User Code Section: C code that is included in the generated lexical analyzer, typically
for initialization or cleanup.
The user code section is where you can add C code that needs to be included in the
generated C file.
This section typically includes the main() function and any other necessary initialization
or cleanup functions. The function yylex() (generated by Lex) is called in this section to
start the lexical analysis.
The user code section is placed after the second %% delimiter.
Explanation of the Example:
● %{ ... %}: Code enclosed in this section is copied directly into the generated C
file. Here, we include the stdio.hheader.
● %%: Delimiters that separate the different sections of the Lex file.
● Pattern-Action Pairs:
○ [0-9]+: Matches one or more digits. When matched, it prints the number.
○ [ \t\n]+: Matches whitespace and ignores it.
○ [a-zA-Z]+: Matches identifiers (alphabetic strings).
○ "+" and "=": Match the + operator and = assignment, respectively,
printing the corresponding output.
● int main(): The main function calls yylex(), which is the generated function to
start scanning the input.
Processing of the Lex File
Once you write the Lex specification file, you can feed it into the Lex tool. The Lex tool
then processes the file and performs the following steps:
1. Lexical Analysis:
○ Lex reads the specification file and generates a C source file (lex.yy.c) that
implements a lexical analyzer.
○ The Lex tool internally creates a finite automaton for the regular
expressions and embeds this automaton in the generated C code.
2. Finite Automaton (DFA/NFA):
○ NFA Construction: Internally, Lex converts each regular expression into a
Non-deterministic Finite Automaton (NFA).
○ DFA Construction: Then, it converts the NFA into a Deterministic Finite
Automaton (DFA), which is used for efficient pattern matching. Lex
optimizes the DFA to improve scanning performance.
3. Generating C Code:
○ Lex generates a C source file (lex.yy.c) that contains the code for the
lexical analyzer.
○ This file includes:
■ yylex(): A function that performs the scanning of the input stream
and matches patterns against the defined regular expressions.
■ yytext: A global variable that holds the current text matched by a
pattern.
■ yyin and yyout: Input and output files, respectively, which are used
by Lex for reading and writing data.
■ State Machine: The core of the generated code, which implements
the DFA/NFA.
4. Compiling the Generated Code:
○ After generating the lex.yy.c file, the next step is to compile the C file into
an object code. This can be done using a C compiler.
Example Command:
lex example.l # Generate lex.yy.c
gcc lex.yy.c -o lexer -lfl # Compile and link with the Flex library
5. Running the Lexer:
○ Once compiled, the generated lexical analyzer can be executed, which
reads the input, matches the patterns, and executes the associated
actions (e.g., printing tokens, recognizing keywords, etc.).
Example Command:
./lexer < input.txt # Runs the lexer on input.txt
Key Predefined Variables in Lex
1. Yytext
● Purpose: Contains the text of the current token matched by the regular
expression.
● Type: char* (a pointer to a string).
● Usage:
○ You can access yytext in the action part of a rule to refer to the
string that was matched.
○ It is automatically updated after each successful pattern match.
Example: [a-zA-Z]+ { printf("Identifier: %s\n", yytext); }
2. Yyleng
● Purpose: Stores the length of the string in yytext.
● Type: int.
● Usage:
○ Indicates the number of characters matched by the regular
expression.
○ Useful for validating the length of tokens.
Example: [a-zA-Z]+ { printf("Identifier (%d characters): %s\n", yyleng, yytext); }
3. Yylineno
● Purpose: Tracks the current line number in the input file being scanned.
● Type: int.
● Usage:
○ Helps in error reporting and debugging by providing the line number
where a token is found.
○ You need to enable line tracking by defining #define YYLMAX or
including -lfl (in Flex).
Example: { printf("Unrecognized character '%s' at line %d\n", yytext, yylineno); }
4. Yyin
● Purpose: Points to the input file being scanned.
● Type: FILE*.
● Default Value: stdin.
● Usage:
○ By default, Lex reads from the standard input. You can assign yyin
to another file pointer to scan from a file.
Example: yyin = fopen("input.txt", "r"); // Redirect input to "input.txt"
5. Yyout
● Purpose: Points to the output file for token processing.
● Type: FILE*.
● Default Value: stdout.
● Usage:
○ You can redirect output to a specific file by changing the value of
yyout.
Example: yyout = fopen("output.txt", "w"); // Redirect output to "output.txt"
6. yywrap()
● Purpose: Determines what happens when the end of the input is reached.
● Type: Function.
● Default Behavior: Returns 1, signaling the end of input.
● Usage:
○ If you want to provide additional input after reaching the end of a
file, you can override yywrap() to return 0 and continue scanning.
Example: int yywrap() {
return 1; // Indicate end of input
}
7. YYSTATE
● Purpose: Represents the current state of the scanner.
● Type: int.
● Usage:
○ Used in conjunction with start conditions to manage lexical analysis
in different contexts.
○ You can explicitly set or check the scanner's state.
Example:
%x COMMENT
%%
"/*" { BEGIN(COMMENT); }
<COMMENT>"*/" { BEGIN(INITIAL); }
8. YY_START
● Purpose: Indicates the starting condition for the scanner.
● Type: Constant.
● Usage:
○ Represents the current state in terms of start conditions.
○ Can be used in actions to check or set the scanner’s state.
Example:
%x STRING
%%
"\"" { printf("Entering STRING mode\n"); BEGIN(STRING); }
<STRING>. { printf("STRING content: %s\n", yytext); }
<STRING>"\"" { printf("Exiting STRING mode\n"); BEGIN(INITIAL); }
9. YY_USER_ACTION
● Purpose: Allows you to insert user-defined code that will execute before
any action for a matched token.
● Type: Macro.
● Usage:
○ Can be used for debugging or tracking purposes.
○ Typically defined in the definitions section.
Example:
%{
#define YY_USER_ACTION printf("Matched token: %s\n", yytext);
%}
%%
[a-zA-Z]+ { /* Your action here */ }
10. YY_FATAL_ERROR()
● Purpose: Handles fatal errors during lexical analysis.
● Type: Function/Macro.
● Usage:
○ You can override this function to customize error handling in case of
scanner failures.
Example:
void YY_FATAL_ERROR(const char* msg) {
fprintf(stderr, "Fatal Error: %s\n", msg);
exit(1);
}
Variable Purpose
Routine Purpose