0% found this document useful (0 votes)
6 views40 pages

Ch3 LexicalAnalysis

The document outlines the principles of lexical analysis, which is the first phase of compilation that identifies lexical units from source code. It covers the roles of the lexical analyzer, the formation of tokens, and the use of regular expressions and transition diagrams for recognizing lexical units. Additionally, it discusses error handling strategies and introduces the Flex tool for generating lexical analyzers.

Uploaded by

Ryma Mahfoudhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views40 pages

Ch3 LexicalAnalysis

The document outlines the principles of lexical analysis, which is the first phase of compilation that identifies lexical units from source code. It covers the roles of the lexical analyzer, the formation of tokens, and the use of regular expressions and transition diagrams for recognizing lexical units. Additionally, it discusses error handling strategies and introduces the Flex tool for generating lexical analyzers.

Uploaded by

Ryma Mahfoudhi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Language Theory and Automata

Ch3: Lexical Analysis


by Dr. Ameni Mejri

Academic Year 2023/2024


Outline

● Generalities (Lexical analyzer, Interface to lexical analyzer, Role of


lexical analyzer).
● Lexical units, Lexemes and Models
● Regular definition
● Transition diagram
● Error handling

2
Lexical Analysis Phase

● First compilation phase.


● Recognition of lexical units from source code (character streams).
● Main lexical units :
● Single special characters: +, =, etc.
● Double special characters: <=, ++, etc.
● Key words: if, while, etc.Literal constants: 123, -5, etc.
● Identifiers: i, wind_speed, etc.

3
Lexical Analysis Phase

● The main task of the lexical analyzer is to read the input characters and, as a result, produce
a sequence of lexical units that the syntactic analyzer will use.
● The lexical analyzer and the parser (syntax analyzer) form a producer/consumer pair.
● The channel between the lexical analyzer and the syntactic analyzer is a buffer with a
capacity of a certain number of tokens. The parser sometimes needs to consult the next
tokens without consuming them.
● On receiving a "next lexical unit?" command from the parser, the lexical analyzer reads
the input characters until it can identify the next lexical unit.

provide a token
and its
Read a character attributes
Lexical Syntactic
Input
Analyzer Analyzer

Return a character next lexical unit?

4
Role of Lexical Analyzer

● Read characters from the input text: reading is done character by character until a lexical
unit is formed.
● Remove blanks, comments, etc.: although spacing and comments can play a role in
separating tokens, the lexical analyzer eliminates them.
● Form lexical units (tokens).
● Pass <lexical unit,lexical value> pairs to the Syntax Analyzer:

Example:
● when the lexical analyzer comes across a sequence of numbers as input, it sends the num
token to the parser. The value of the number is sent as an attribute of the token.
● For example, the input 25+11 is transformed into the sequence of token/attribute pairs:
● <num,25> <+,> <num,11>

● Link error messages from the compiler to the source code.


● Pre-processor processing, if any..

5
Role of Lexical Analyzer

Typical implementation diagram for the lexical analyzer


- interaction with the syntactic analyzer

Current TOKEN
Source code Lexical Syntactic
Analyzer Analyzer
Next TOKEN

Table of symbols

6
Lexical Units, lexemes and models

● In terms of a lexical unit recognized in the source text, we need to distinguish four
important concepts:

● the lexical unit


● the lexeme.
● possibly, an attribute.
● the model.

7
Lexical Units, lexemes and models

● A lexical unit, also known as a lexical token, is a pair consisting of a name and an
optional value.
● The name of the lexical unit is a class of lexemes.
● For most programming languages, the following constructs are treated as lexical
units:
● Keywords.
● Operators.
● Identifiers.
● Constants.
● Punctuation symbols ( '(', ')', ',', ':', etc.).

8
Lexical Units, lexemes and models

● A lexeme is a sequence of characters in the source program that matches the lexical
unit model.

● Example :
● const max_length = 256;
● In the previous declaration, the string max_length is a lexeme of the lexical unit Identifier.

9
Lexical Units, lexemes and models

● A model is a rule describing the strings that correspond to a lexical unit.

● For reserved words (key-words) such as const, if, while, etc., the lexeme and model
generally coincide. The model for the lexical unit const is the string const.

● For a rel_oper lexical unit representing relational operators, the model is the set of relational
operators: <, < =, ==, >=, >, !=.

● To precisely describe the models (patterns) of more complex lexical units such as identifiers
and numbers, we use regular expressions.

● In pattern-driven programming, patterns are expressed by regular expressions.

● Languages and tools are available for efficient recognition of regular expressions by finite
automata.

10
Lexical Units, lexemes and models

● An attribute is defined as a pointer to the symbol table entry in which information


about the lexical unit is stored.

● The attribute, if it exists, depends on the lexical unit in question, and completes it. Only the
last of the two preceding units have an attribute:
● For a number, this is its value (123, -5).
● For an identifier, this is a reference to a table containing all identifiers encountered.

● For diagnostic purposes, the line number where a lexical unit first appears can be stored.
Both the lexical unit and the line number can be stored in the associated symbol table entry.

11
Lexical Units, lexemes and models

Lexical unit Lexemes Informal description of models

const const const

if if if

rel_oper < <= == != >= > < <= == != >= >

identifier e pi length Letter followed by letters or numbers or the '_' character

number 3.141 256 0.196 Numerical constants.

literal « stack overflow » Strings enclosed in quotation marks.

12
Specification of lexical units

● Words (Strings).
● Languages.
● Regular expressions.
● Regular definitions.

13
Specification of lexical units

Words (Strings).
● An alphabet is a finite set of symbols.
● Examples: {0; 1}, {A; C; G; T}, the set of all letters, the set of all numbers, the ASCII code, etc.
● Blank characters (i.e. spaces, tabs and end-of-line marks) are generally not part of
alphabets.
● A string (or word) in the alphabet is a finite sequence of symbols extracted from it.
● Examples, relating respectively to the preceding alphabets:
● 00011011,
● ACCAGTTGAAGTGGACCTTT,
● Bonjour,
● 2001.
● An empty string with no characters is 
● The length of a string s, |s|, is the number of occurrences of symbols in s.
● The string  is of length 0.
● A language on an alphabet is a set of strings built on it.
● Trivial examples: , the empty language, {}, the language reduced to the single empty string. More
interesting examples (relative to the preceding alphabets): the set of numbers in binary notation, the set
of DNA strings, the set of words in the French language, etc.

14
Recognizing lexical units

Goal
● Build a lexical analyzer that isolates the lexemes associated with the next lexical unit and
produces a pair consisting of the appropriate lexical unit and an attribute value using the
table described in the following.
● Blanks defined by the ER "blank" are eliminated by the lexical analyzer.
● When the lexical analyzer encounters blanks, it continues searching for a significant lexical
unit, which it returns to the syntactic analyzer.

15
Recognizing lexical units

Regular Expression Lexical Unit Value of attribute


blank
if if
then then
else else
id id Pointer to an entry in the symbol table
float float Pointer to an entry in the symbol table
< relop LT
<= relop LE
= relop EQ
<> relop DIFF
>= relop GE
> relop GT

16
Transition diagrams

● Transition diagrams describe the actions that are performed when the
parser calls a lexical analyzer to provide the next lexical unit.

● An initial state of the diagram.

● Entering a state reads the next character.

● If the label of an arc coming out of the current state matches the input
character, we move on to the state pointed to by this arc. Otherwise, an
error is signalled.

17
Transition diagrams

ExAmples begin < = return (relop, LE)


0 1 2
>
3 return (relop, DIFF)
other
= 4 * return (relop, LT)

> 5 return (relop,EQ)

6 = 7 return (relop, GE)


other *
8 return (relop, GT)

transition Diagram for relation operators

Accept state
Letter | digit

begin letter other *


9 10 11 return (token_id (), insert_id())

transition Diagram for identifiers and key-words

18
Transition diagrams

● Separate keywords from identifiers.


● Place the keywords (if, then, else) in the symbol table before starting the
analysis.
● Note in the symbol table the lexical unit to be returned when one of these
strings is recognized.
● The insert_id procedure accesses the buffer where the lexical unit was
found. It works as follows:
● The symbol table is examined. If the lexeme is found with the keyword
indication, 0 is returned.
● If the lexeme is found as a variable, a pointer to a symbol table entry is
returned.
● If the lexeme is not found in the symbol table, it is stored there and a pointer to
this new entry is returned.
● The token_id procedure searches for the lexeme in the symbol table; if it's
a keyword, the corresponding lexical unit is returned, otherwise an id is
returned.

19
Transition diagrams

Keyword processing (if and else) by automata

20
Construction of a Transition diagram

begin < =
0 1 2 return (relop, LE)
>
3 return (relop, DIFF)
other
= 4 * return (relop, LT)

> 5 return (relop,EQ)

6 =
7 return (relop, GE)
other
Letter | digit 8 * return (relop, GT)

letter other *
9 10 return (token_id (), insert_id())
digit digit digit
digit . digit E +|- digit other *
11 12 13 14 15 16 18
digit E digit digit

digit . digit other *


19 20 21 22
digit

digit other *
23 24
blank
blank other *
25 26

21
Transition diagrams

● There are a few principles to observe:

● an accepting state does not consume characters;


● a non-accepting state (usually) consumes characters;
● each arc can only consume one character at a time, possibly selected from
several possibilities;
● automata must consume as many characters as possible with each token
recognized (lexical voracious analysis or maximal-munch tokenization).

● Voracious lexical analysis is the operation used to read the maximum


number of characters before accepting, when recognizing a token.

22
Transition diagrams

token lexical () {
while (TRUE) {
switch (state) {
case 0: c = nextchar (); case 24: back (1);
if (c == SPACE || c == TAB || c == EOL); insert_nb ();
else if (c == '<') state = 1; return NB;
else if (c == '=') state = 5; }
else if (c == '>') state = 6; }
else state = failure (); }
break;
/* ... States 1 to 8 */
case 9: c = nextchar (input);
if (isalnum (c)) state = 10;
else state = failure ();
break;
case 10: back (1); back (n) moves back n characters in the buffer
insert_id (); failure () : error recovery routine
return (token_id());
/* ... State 12 to 22 */
case 23: c = nextchar ();
if (!isdigit (c)) state = 24;
break;

23
Error Handling
● Some errors are lexical in nature. For example, encountering the ASCII
character number 14 (shift out), which should never appear in the source
programs of a given language.

● Many errors cannot be handled by the lexical analyzer. For example


fi ( a == f(x) )….
● This error seems lexical to us because of the spelling mistake, but 'fi' is a
perfectly valid identifier.

● Examples of lexical errors:


● Identifier too long
● Character constant too long
● Invalid character
● End of file in a comment: fatal error
● End of file in a string: fatal error
● Error in a numerical constant
● Line too long...
24
Error Handling
There are several ways to handle a lexical error:
● Stop the whole compilation with an error message.
● Drop characters from the input until a well-formed token is available.
● Perform one or more editing operations, such as deleting a character:
● Delete a character (ensures termination).
● Insert a [suitable] character
● Replace one character with another.
● Reverse the order of two consecutive characters.
● Find a minimum editing distance to obtain a lexically valid program from
the source program.

25
TP 1: Flex – What is a lexical analyzer?

● reads a source text (sequence of characters): input


● produces a sequence of lexical units: output

⇨ Lexical unit recognition is based on the notion of regular


expressions.

26
TP 1: Flex – How to build a lexical analyzer?

● define lexical units


● model each lexical unit by a RE
● represent each RE by an automaton
● build the global diagram
● implement the resulting diagram by hand

● Implementing a diagram with a large number of states by hand is not an easy


enough task.
● If you need to add or modify a lexical unit, you have to go through the whole prog
to make the necessary changes.

⇨ several tools to simplify these tasks: for example, FLEX

27
TP1 : Flex – Flex tool

● GNU version of Lex


● lexical analyzer generator
● accepts as input lexical units in the form of RE and produces a prog
written in C
● Once the C prog has been compiled, it can recognize these lexical units.
● The resulting executable reads the input text character by character until
it identifies the longest prefix in the source text that matches one of the
REs.

28
TP1 : Flex – Flex tool

29
TP1 : Flex – Flex tool

● A .l specification file consists of 4 parts:

%{
Declarations in c
%}

Declaring regular definitions

%%
Translation rules
%%

Main block and auxiliary functions in C

30
TP1 : Flex – Flex tool

● Regular definitions :
● A regular definition allows you to associate a name with a regular expression and
then refer to that name, rather than the regular expression, in the rules section.

● Translation rules :
● exp1 { action1}
exp2 { action2}...
...
expn { actionn}

Each expi is a regular expression that models a lexical unit.


Each actioni is a sequence of instructions in C.

31
TP1 : Flex – Flex tool

● Flex regular expressions:


● c : the character c.
● . : any single character, except carriage return
● [ ]: one of the specified characters , exp: [abcdABCD0123]
● - : a character interval if in [ ] , exp : [a-dA-D0-3]
● * : repeat (zero or more) , exp : [ab]*
● +: repetition (at least one), exp : [ab]+ is [ab][ab]*
● | : alternative , exp : 000|110|101|011
● ( ): expression grouping , exp : (0|2|4|8)*

32
TP1 : Flex – Flex tool

● Variables:
● yyin: read file (default: stdin)

● yyout: write file (default: stdout)

● char yytext []: character array containing the accepted lexeme.

● int yyleng: length of the accepted lexeme.

● Functions :
● int yylex ( ) : function that starts the parser.
● int yywrap ( ): function always called at the end of the input text. It returns 0
if the analysis is to continue and 1 otherwise.

33
TP1 : Flex – Setting up and preparing resources

● install the Flex and Codeblocks tools


● add to the path environment variable the "bin" location in both the
CodeBlocks and GNUWin32 folders
C:\Program Files (x86)\CodeBlocks\MinGW\bin
C:\Program Files (x86)\GnuWin32\bin (flex)
● create a folder on the desktop and start the tutorial
● From the command prompt, issue the command: flex FileName.l
● If successful, the file lex.yy.c is generated in the same directory.
● Compile the file lex.yy.c to generate the executable: gcc lex.yy.c
● Test the a.exe file obtained to check that it works correctly.

34
First step with Flex – First Example

1. Start by installing the Flex and CodeBloks tools (development tools


using the C language).
2. Write the following Flex program to tell whether an input string is a
binary number or not: Open a new text file and type in the above code.
The file must be saved with the .l extension (e.g. binary.l).

%%
(0|1)+ printf (« it is a binary number");
.* printf (« it is not a binary number");
%%

int yywrap(){return 1;}


main() { yylex();
}

35
First step with Flex – First Example

1. Place the resulting file in 'C:\Program Files\GnuWin32\bin ‘.


2. From the command prompt, issue the command: C:\Program
Files\GnuWin32\bin> flex binary.l
3. If successful, the file lex.yy.c is generated in the same directory.
4. Compile the lex.yy.c file from the command prompt, issue the
command: C:\Program Files\GnuWin32\bin> gcc lex.yy.c to generate
the executable.
5. Test the lex.yy.exe file obtained to check that it works correctly.

36
First step with Flex – Other Examples

1. Modify the previous exercise to display only recognized binary


numbers.
2. Write and compile the following specification file:

pairpair (aa|bb)*((ab|ba) (aa|bb)*(ab|ba) (aa|bb)*)*


%%
{pairpair} printf ("[%s]: even number of a and b \n", yytext);
a*b* printf ("[%s]: first a's, then b’s \n", yytext);
.
%%
int yywrap(){return 1;}
main() { yylex();
}

37
First step with Flex – Other Examples

1. Test babbaaab abbb aabb baabbbb bbaabbba baabbbbab aaabbbba


inputs.
2. Same question, swapping the two lines:

pairpair (aa|bb)*((ab|ba) (aa|bb)*(ab|ba) (aa|bb)*)*


%%
a*b* printf ("[%s]: first a's, then b’s \n", yytext);
{pairpair} printf ("[%s]: even number of a and b \n", yytext);
.
%%
int yywrap(){return 1;}
main() { yylex();
}

38
First step with Flex – Other Examples

1. Is there a difference? What difference? Why or why not?


2. Consider the lexical unit id defined as follows: an identifier is a
sequence of letters and numbers. The first character must be a letter.
Using Flex, write a lexical analyzer that can recognize the lexical unit id
from an input string.
3. Modify the previous exercise so that the lexical analyzer recognizes the
two lexical units id and nb, knowing that nb is a lexical unit that
designates natural integers.

39
First step with Flex – Example: if (vitesse >= 110)
i.f : LU KW m1
(a|..|z).(a|..|z|0|..|9)* = [a-z].[a-z0-9]* : LU ID m2
[0-9]+ : LU NB m3

40

You might also like