Ch3 LexicalAnalysis
Ch3 LexicalAnalysis
2
Lexical Analysis Phase
3
Lexical Analysis Phase
● The main task of the lexical analyzer is to read the input characters and, as a result, produce
a sequence of lexical units that the syntactic analyzer will use.
● The lexical analyzer and the parser (syntax analyzer) form a producer/consumer pair.
● The channel between the lexical analyzer and the syntactic analyzer is a buffer with a
capacity of a certain number of tokens. The parser sometimes needs to consult the next
tokens without consuming them.
● On receiving a "next lexical unit?" command from the parser, the lexical analyzer reads
the input characters until it can identify the next lexical unit.
provide a token
and its
Read a character attributes
Lexical Syntactic
Input
Analyzer Analyzer
4
Role of Lexical Analyzer
● Read characters from the input text: reading is done character by character until a lexical
unit is formed.
● Remove blanks, comments, etc.: although spacing and comments can play a role in
separating tokens, the lexical analyzer eliminates them.
● Form lexical units (tokens).
● Pass <lexical unit,lexical value> pairs to the Syntax Analyzer:
Example:
● when the lexical analyzer comes across a sequence of numbers as input, it sends the num
token to the parser. The value of the number is sent as an attribute of the token.
● For example, the input 25+11 is transformed into the sequence of token/attribute pairs:
● <num,25> <+,> <num,11>
5
Role of Lexical Analyzer
Current TOKEN
Source code Lexical Syntactic
Analyzer Analyzer
Next TOKEN
Table of symbols
6
Lexical Units, lexemes and models
● In terms of a lexical unit recognized in the source text, we need to distinguish four
important concepts:
7
Lexical Units, lexemes and models
● A lexical unit, also known as a lexical token, is a pair consisting of a name and an
optional value.
● The name of the lexical unit is a class of lexemes.
● For most programming languages, the following constructs are treated as lexical
units:
● Keywords.
● Operators.
● Identifiers.
● Constants.
● Punctuation symbols ( '(', ')', ',', ':', etc.).
8
Lexical Units, lexemes and models
● A lexeme is a sequence of characters in the source program that matches the lexical
unit model.
● Example :
● const max_length = 256;
● In the previous declaration, the string max_length is a lexeme of the lexical unit Identifier.
9
Lexical Units, lexemes and models
● For reserved words (key-words) such as const, if, while, etc., the lexeme and model
generally coincide. The model for the lexical unit const is the string const.
● For a rel_oper lexical unit representing relational operators, the model is the set of relational
operators: <, < =, ==, >=, >, !=.
● To precisely describe the models (patterns) of more complex lexical units such as identifiers
and numbers, we use regular expressions.
● Languages and tools are available for efficient recognition of regular expressions by finite
automata.
10
Lexical Units, lexemes and models
● The attribute, if it exists, depends on the lexical unit in question, and completes it. Only the
last of the two preceding units have an attribute:
● For a number, this is its value (123, -5).
● For an identifier, this is a reference to a table containing all identifiers encountered.
● For diagnostic purposes, the line number where a lexical unit first appears can be stored.
Both the lexical unit and the line number can be stored in the associated symbol table entry.
11
Lexical Units, lexemes and models
if if if
12
Specification of lexical units
● Words (Strings).
● Languages.
● Regular expressions.
● Regular definitions.
13
Specification of lexical units
Words (Strings).
● An alphabet is a finite set of symbols.
● Examples: {0; 1}, {A; C; G; T}, the set of all letters, the set of all numbers, the ASCII code, etc.
● Blank characters (i.e. spaces, tabs and end-of-line marks) are generally not part of
alphabets.
● A string (or word) in the alphabet is a finite sequence of symbols extracted from it.
● Examples, relating respectively to the preceding alphabets:
● 00011011,
● ACCAGTTGAAGTGGACCTTT,
● Bonjour,
● 2001.
● An empty string with no characters is
● The length of a string s, |s|, is the number of occurrences of symbols in s.
● The string is of length 0.
● A language on an alphabet is a set of strings built on it.
● Trivial examples: , the empty language, {}, the language reduced to the single empty string. More
interesting examples (relative to the preceding alphabets): the set of numbers in binary notation, the set
of DNA strings, the set of words in the French language, etc.
14
Recognizing lexical units
Goal
● Build a lexical analyzer that isolates the lexemes associated with the next lexical unit and
produces a pair consisting of the appropriate lexical unit and an attribute value using the
table described in the following.
● Blanks defined by the ER "blank" are eliminated by the lexical analyzer.
● When the lexical analyzer encounters blanks, it continues searching for a significant lexical
unit, which it returns to the syntactic analyzer.
15
Recognizing lexical units
16
Transition diagrams
● Transition diagrams describe the actions that are performed when the
parser calls a lexical analyzer to provide the next lexical unit.
● If the label of an arc coming out of the current state matches the input
character, we move on to the state pointed to by this arc. Otherwise, an
error is signalled.
17
Transition diagrams
Accept state
Letter | digit
18
Transition diagrams
19
Transition diagrams
20
Construction of a Transition diagram
begin < =
0 1 2 return (relop, LE)
>
3 return (relop, DIFF)
other
= 4 * return (relop, LT)
6 =
7 return (relop, GE)
other
Letter | digit 8 * return (relop, GT)
letter other *
9 10 return (token_id (), insert_id())
digit digit digit
digit . digit E +|- digit other *
11 12 13 14 15 16 18
digit E digit digit
digit other *
23 24
blank
blank other *
25 26
21
Transition diagrams
22
Transition diagrams
token lexical () {
while (TRUE) {
switch (state) {
case 0: c = nextchar (); case 24: back (1);
if (c == SPACE || c == TAB || c == EOL); insert_nb ();
else if (c == '<') state = 1; return NB;
else if (c == '=') state = 5; }
else if (c == '>') state = 6; }
else state = failure (); }
break;
/* ... States 1 to 8 */
case 9: c = nextchar (input);
if (isalnum (c)) state = 10;
else state = failure ();
break;
case 10: back (1); back (n) moves back n characters in the buffer
insert_id (); failure () : error recovery routine
return (token_id());
/* ... State 12 to 22 */
case 23: c = nextchar ();
if (!isdigit (c)) state = 24;
break;
23
Error Handling
● Some errors are lexical in nature. For example, encountering the ASCII
character number 14 (shift out), which should never appear in the source
programs of a given language.
25
TP 1: Flex – What is a lexical analyzer?
26
TP 1: Flex – How to build a lexical analyzer?
27
TP1 : Flex – Flex tool
28
TP1 : Flex – Flex tool
29
TP1 : Flex – Flex tool
%{
Declarations in c
%}
%%
Translation rules
%%
30
TP1 : Flex – Flex tool
● Regular definitions :
● A regular definition allows you to associate a name with a regular expression and
then refer to that name, rather than the regular expression, in the rules section.
● Translation rules :
● exp1 { action1}
exp2 { action2}...
...
expn { actionn}
31
TP1 : Flex – Flex tool
32
TP1 : Flex – Flex tool
● Variables:
● yyin: read file (default: stdin)
● Functions :
● int yylex ( ) : function that starts the parser.
● int yywrap ( ): function always called at the end of the input text. It returns 0
if the analysis is to continue and 1 otherwise.
33
TP1 : Flex – Setting up and preparing resources
34
First step with Flex – First Example
%%
(0|1)+ printf (« it is a binary number");
.* printf (« it is not a binary number");
%%
35
First step with Flex – First Example
36
First step with Flex – Other Examples
37
First step with Flex – Other Examples
38
First step with Flex – Other Examples
39
First step with Flex – Example: if (vitesse >= 110)
i.f : LU KW m1
(a|..|z).(a|..|z|0|..|9)* = [a-z].[a-z0-9]* : LU ID m2
[0-9]+ : LU NB m3
…
40