MOD 04 - Language Description & Lexical Analysis
MOD 04 - Language Description & Lexical Analysis
Lexical Analysis
2
Language Description & Lexical Analysis
Compiler phases
---------
---------
Lexemes
---------
--------- Lexical analyzer Syntax analyzer
----
Source code 4
6
Language Description & Lexical Analysis
Introduction to lexical analysis
Techniques and Tools:
Regular Expressions: Define patterns for recognizing tokens using regular expressions.
Finite Automata: Construct finite automata to recognize lexical patterns efficiently.
Lexical Analyzers (Lexers): Implement lexical analyzers using tools like Lex, Flex, or
hand-written code.
Tokenization: Categorize lexemes into different token types and generate a token stream.
Error Handling:
Lexical Errors: Detect and report lexical errors such as invalid characters or unrecognized
tokens.
Error Recovery: Implement strategies for recovering from errors and continuing the lexical
analysis process.
7
Language Description & Lexical Analysis
Introduction to lexical analysis
Three major terms are used in lexical analysis:
Lexical unit: a pair consisting of a lexical unit name and an optional attribute value
(optional).
Pattern: is a description of the form that the lexemes in a lexical unit can take. It can
be simple or complex.
Lexeme: is a sequence of characters in the source program that is recognized by the
lexical unit pattern.
Example
• Symbols (lexical units):
• Identifier, String, Numeric constant.
• Keyword (IF, WHILE, DO, ...)
• Operator (special symbol): <, >=,<=, ==, =, ...
• Analyze the following sentences:
• A+15+B ➔ A, +, 15, +, B 8
16. Delimiter: }
Language Description & Lexical Analysis
Terminology
Alphabet
The term "alphabet" refers to the set of symbols from which strings (sequences of
symbols) are formed.
An alphabet is defined as a finite, non-empty set of symbols.
Formally, let Σ be an alphabet.
Then: ∑= {𝑎1 , 𝑎2 , … , 𝑎𝑛 }
Where:
𝑎1 , 𝑎2 , … , 𝑎𝑛 are individual symbols or characters.
𝑛 is the cardinality (number of elements) of the alphabet Σ, which must be finite.
Example
∑ = 𝑎, 𝑏, … , 𝑧 11
∑ = 𝛼, 𝛽, … , 𝜑, +,∗, −,/
Language Description & Lexical Analysis
Terminology
Alphabet
Properties:
1. Finite Set: The alphabet Σ contains a finite number of symbols.
2. Non-Empty: The alphabet Σ cannot be empty; it must contain at least one symbol.
Examples:
1. Binary Alphabet: {0,1}
Contains two symbols: 0 and 1.
2. English Alphabet: {𝑎,𝑏,𝑐,...,𝑧}
Contains 26 lowercase letters from a to z.
3. Numeric Alphabet: {0,1,2,...,9}
Contains 10 digits from 0 to 9.
12
Language Description & Lexical Analysis
Terminology
A "word" refers to a basic unit of language, typically representing a single entity within the
source code.
Words are identified during lexical analysis (scanning) and are often referred to as tokens.
Each word/token represents a specific syntactic construct in the programming language.
Example 1
• On the alphabet ∑ = {0, 1}, we can construct words like: 101001, 11, 100.
• 𝛆 is the empty word.
• The concatenation of two words is a word.
Example 2
Let ∑ = {0, 1}, if α = 100 and β = 1010 then αβ = 1001010.
α² = αα = 100100
α3 = ααα = 100100100 13
Language Description & Lexical Analysis
Terminology
Question
• How to formally define the symbols of a language?
Answer
• The best models defining lexical units which lexemes belong are regular
languages.
• Two ways to describe regular languages:
• Regular expressions
• Finite automata
14
Language Description & Lexical Analysis
Formal language
Example
Note
15
Language Description & Lexical Analysis
Formal language
Example of languages
Let ∑={a, …, z}
• L0 = {a}
• L1 = {aa, ab}
• L3 = {α ∈Σ* / |α|a ≤ 10} : the set of all words whose number of occurrences of a ≤ 10.
16
Language Description & Lexical Analysis
Formal language
Operations on formal languages
Union
𝐿1 𝐿 ڂ2 = {𝑥 / 𝑥 ∈ 𝐿1 𝑜𝑟 𝑥 ∈ 𝐿2}
Intersection
𝐿1 𝐿 ځ2 = {𝑥 / 𝑥 ∈ 𝐿1 𝑎𝑛𝑑 𝑥 ∈ 𝐿2}
Difference
𝐿1 − 𝐿2 = {𝑥 / 𝑥 ∈ 𝐿1 𝑎𝑛𝑑 𝑥 ∉ 𝐿2}
𝐿1 = 𝛴 ∗ − 𝐿1
Concatenation
𝐿1𝐿2 = {𝑥𝑦 /𝑥 ∈ 𝐿1 𝑎𝑛𝑑 𝑦 ∈ 𝐿2} L1={a,b,c} L2={1,2}
𝐿𝑛 = 𝐿𝐿 … 𝐿 L1L2={a1,a2,b1,b2,c1,c2} 17
L2L1={1a,1b,1c,2a,2b,2c}
𝑛 𝑡𝑖𝑚𝑒𝑠
Language Description & Lexical Analysis
Formal language
Kleene closure
Let L be a language
𝐿∗ = ≥𝐾ڂ0 𝐿𝑘
Example
𝐿1 = 𝑎, 𝑏
𝐿∗1 = 𝐿01 𝐿 ڂ11 𝐿 ڂ21 𝐿 ڂ31 𝜀 = … ڂ, 𝑎, 𝑏, 𝑎𝑎, 𝑎𝑏, 𝑏𝑏, …
Definition
A regular language L on an alphabet Σ is defined recursively as follows:
{𝜀} is a regular language on Σ.
Let 𝑎 ∈ Σ then {a} is a regular language on Σ.
If 𝑅 is a regular language, then 𝑅𝑘 and 𝑅∗ are regular languages on Σ.
If 𝑅1 and 𝑅2 are two regular languages, then 𝑅1 𝑅 ڂ2 and 𝑅1 𝑅2 are regular languages.
19
Language Description & Lexical Analysis
Regular expression (RE)
Given an alphabet Σ, the regular expressions and the languages they describe are defined
as follows:
1. Basic Elements:
Let a ∈ Σ, then a is a regular expression (RE) which describes the language {a}.
ε (epsilon) is a regular expression that describes the language {ε}, where ε
denotes the empty string.
20
Language Description & Lexical Analysis
Regular expression (RE)
22
Language Description & Lexical Analysis
Regular expression (RE)
23
Language Description & Lexical Analysis
Regular expression (RE)
Example 3: Concatenation
Regular Expressions:
ab is a RE that describes the language {ab}.
a(b+c) or a(b|c) is a RE that describes the language {ab,ac}.
Language Description:
The language described by ab is the set containing the single string "ab".
The language described by a(b+c) is the set of strings that starts with a and is
followed by either b or c.
24
Language Description & Lexical Analysis
Regular expression (RE)
25
Language Description & Lexical Analysis
Regular expression (RE)
26
Language Description & Lexical Analysis
Regular expression (RE)
27
Language Description & Lexical Analysis
Regular expression (RE)
Exercice 1
Give the regular expressions which describe:
1. The language of words on Σ={a, b} that start with a and end by b.
2. The language of all words on {a, b} concatenated with words on {c, d}:
ab, aab, abb, aaab, abab, abababbab, , a, b, c,d,ac,abc, abcd, ababcdcdc.
Answer
• a(a|b)*b
• (a|b)*(c|d)*
28
Language Description & Lexical Analysis
Regular expression (RE)
Exercice 2
1. Does the word w belong to the language described by the RE r in the
following cases:
• w = 10100010 r =(0+10)*
• w= 01110110 r =(0+(11)*)*
• w= 000111100 r = ((011+11)*(00)*)*
2. Simplify the following REs:
• 𝜀 + ab + ab + abab(ab)*
• aa(b* + a) + a(ab* + aa)
• a(a+b)* + aa(a+b)*+ aaa(a+b)*
29
Language Description & Lexical Analysis
Regular expression (RE)
Answer
• 𝜀 + ab + abab(ab)* = 𝜀 + ab + ab(ab)*
= 𝜀 + ab(𝜀 + (ab)*)
= 𝜀 + ab(ab)*
= 𝜀 + (ab)*
= (ab)*
• aa(b* + a) + a(ab* + aa) = aa(b* + a)+ aa(b* + a)
= aa(b* + a)
• a(a + b)*+ aa(a + b)*+ aaa(a + b)* = (a + aa + aaa)(a + b)*
30
Language Description & Lexical Analysis
Regular expression (RE)
Exercice 3
1. Prove the following equalities:
• b + ab* + aa*b + aa*ab* = a*(b + ab*)
• a*(b+ab*) = b + aa*b*
2. Give an expression r of the language formed on {a,b} having at most 3a.
31
Language Description & Lexical Analysis
Regular expression (RE)
Answer
1. Verification
• b + ab* + aa*b + aa*ab* = a*(b+ab*)
(b + ab*) + a+(b + ab*) = (𝜀 + a+)(b + ab*)
= a*(b+ab*)
• a*(b + ab*) = a*b + a*ab*
= (𝜀 + a+)b + a+b*
= b + a+b + a+b*
= b + a+ (b + b*)
= b + a+b*
= b+aa*b*
2. b*(a+𝜀)b*(a+𝜀)b*(a+𝜀)b*
32
Language Description & Lexical Analysis
Introduction to automata
Example :
A coffee machine delivers a cup of coffee for 3 DH. This distributor accepts
1DH and 2DH coins but does not give change. A C button allows you to get
coffee. At first the C button is blocked. If you insert enough coins, the button C
becomes free and you cannot insert coins.
Model the operation of the distributor.
Automata:
Initial state e0: button blocked
State e1: when we insert 1DH
State e2: when we insert 2DH
State e3: button C is released
33
Language Description & Lexical Analysis
Introduction to automata
→ : Initial state
→ : final state
Actual State
Entry ribbon
a e D 2 i N S
Transition table
Reading head 34
Finit Automata
Language Description & Lexical Analysis
Introduction to automata
An automaton reads a word written on its input ribbon. It starts from an initial
state and with each letter read, it changes state. If at the end of the word, it is in a
final state, we say that it accepts the word.
Word belongs to language ➔ last state is a final state.
Word does not belong to language ➔ Otherwise.
35
Language Description & Lexical Analysis
Deterministic finite automaton (DFA)
• Σ is an alphabet,
a
If 𝜹 𝒒𝟎 , 𝒂 = 𝒒𝟏 ➔ 𝒒𝟎 𝒒𝟏
Exercice
Represent the following automaton graphically:
𝑨 =< 𝑸, Σ, 𝜹, 𝒒𝟎 , 𝑭 > with Σ={0,1} , 𝑸 = {𝒒𝟎 , 𝒒𝟏 } , 𝒒𝟎 is the initial state and 𝑭 = 𝒒𝟏 .
𝜹 0 1
𝒒𝟎 𝒒𝟎 𝒒𝟏 37
𝒒𝟏 𝒒𝟎 𝒒𝟏
Language Description & Lexical Analysis
Language accepted by DFA
Example of verification
• 11111 (Yes)
𝛿(𝛿 𝛿 𝛿 𝛿 𝑞0 , 1 , 1 , 1 , 1 , 1) ∈ F
• 10010 (No)
𝛿(𝛿 𝛿 𝛿 𝛿 𝑞0 , 1 , 0 , 0 , 1 , 0) ∉ F
38
Language Description & Lexical Analysis
Nondeterministic finite automaton (NFA)
An NFA is defined by a 5-tuple M = < Q, Σ , δ , q0, F >, where:
Q is a finite set of states.
Σ is a finite set of input symbols (alphabet).
δ is the transition function, defined as δ : Q × ( Σ ∪ { ε } ) → 𝟐𝑸 .
This function takes a state and an input symbol (or epsilon, which represents an
empty string) and returns a set of states.
q0 ∈ Q is the initial state.
F ⊆ Q is the set of accepting (or final) states.
39
Language Description & Lexical Analysis
Nondeterministic finite automaton (NFA)
Characteristics of an NFA
1. Multiple Transitions: From a given state and input symbol, the NFA can
transition to any number of possible next states.
2. Epsilon Transitions: The NFA can transition from one state to another
without consuming any input symbols (using epsilon transitions).
3. Acceptance of Strings: An NFA accepts an input string if there exists at least
one sequence of transitions (including epsilon transitions) starting from the
initial state and ending in an accepting state, such that the entire input string is
consumed.
40
Language Description & Lexical Analysis
Nondeterministic finite automaton (NFA)
a a
Example of an NFA
Consider an NFA M = < Q, Σ , δ , q0, F > with the following components: a b
q0 q1 q2
States: Q={q0,q1,q2}
Alphabet: Σ={a,b}
Transition function δ: b
δ(q0,a)={q0,q1} (on input a, from state q0, the NFA can stay in q0 or
move to q1)
δ(q0,b)={q0} (on input b, from state q0, the NFA can stay in q0)
δ(q1,b)={q2} (on input b, from state q1, the NFA can move to q2)
δ(q2,a)={q2} (on input a, from state q2, the NFA can stay in q2)
Initial state: q0
41
Accepting states: F={q2}
Language Description & Lexical Analysis
Nondeterministic finite automaton (NFA)
Example Execution:
To check if the NFA accepts the string "ab":
Start at q0.
On input a, move to q0 or q1.
From q0 on input b, stay in q0 (this path does not lead to acceptance).
From q1 on input b, move to q2 (this path leads to acceptance).
Since there exists a path where the NFA ends in an accepting state (q2), the NFA
accepts the string "ab". a a
a b
q0 q1 q2
42
b
Language Description & Lexical Analysis
Nondeterministic finite automaton (NFA)
It is difficult to move directly from a RE to a DFA.
It’s easy to move from a RE to a NFA.
Definition
A NFA 𝑨 =< 𝑸, Σ, 𝜹, 𝒒𝟎 , 𝑭 > is characterized by :
𝜹: 𝑸 × Σ 𝜀 ڂ → 𝑷(𝑸)
𝑃({𝑎, 𝑏, 𝑐}){∅, {𝑎} , {𝑏} , {𝑐} , {𝑎, 𝑏} , {𝑎, 𝑐} , {𝑏, 𝑐} , {𝑎, 𝑏, 𝑐}}
Several arcs, recognizing the same symbol, can exit the same state.
There may be 𝜀 −transitions
belong on the
alphabet
Language Description & Lexical Analysis
Nondeterministic finite automaton (NFA)
Example
Represent with a Graph the following NFA:
A=<Q, Σ, δ, q0 , F> with Σ={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, -, .} and
Q={q0 , q1 , q2 , q3 , q4 , q5 } q0 is the initial state and F={q5}
44
Language Description & Lexical Analysis
Transformation of NFA into DFA
45
Language Description & Lexical Analysis
Transformation of NFA into DFA
1. Definitions
NFA: A 5-tuple (Q,Σ,δ,q0,F)
Q: Finite set of states
Σ: Alphabet
δ: Transition function δ:Q×Σ→2𝑄
q0: Initial state
F: Set of accepting states
DFA: A 5-tuple (Q′,Σ,δ′,q0′,F′)
Q′: Finite set of states
Σ: Alphabet
δ′: Transition function δ′:Q′×Σ→Q′
q0′: Initial state
46
F′: Set of accepting states
Language Description & Lexical Analysis
Transformation of NFA into DFA
3. Example
Consider the following NFA:
States: Q={q0,q1,q2}
a a
Alphabet: Σ={a,b}
Transition function δ: a b
q0 q1 q2
δ(q0,a)={q0,q1}
δ(q0,b)={q0}
δ(q1,b)={q2} b
δ(q2,a)={q2}
Initial state: q0
48
Accepting state: F={q2}
Language Description & Lexical Analysis
Transformation of NFA into DFA a a
a b
q0 q1 q2
3. Example b
b
Step-by-Step Construction of the DFA:
1. Initial State:
q0′=ϵ-closure({q0})={q0} a
q0 q0,q1
2. Transitions from {q0} :
On a:
δ({q0},a)={q0,q1}
New state in DFA: {q0,q1}
On b:
δ({q0},b)={q0}
New state in DFA: {q0}
49
Language Description & Lexical Analysis
Transformation of NFA into DFA a a
a b
q0 q1 q2
3. Example
a
b
b
Step-by-Step Construction of the DFA:
3. Transitions from {q0,q1} :
On a: a
q0 q0,q1
δ({q0,q1},a)={q0,q1} (reachable from both b
q0 and q1 on a)
New state in DFA: {q0,q1} (already exists)
On b:
q0,q2
δ({q0,q1},b)={q0,q2} (reachable from q0 on
b is q0, and from q1 on b is q2)
New state in DFA: {q0,q2}
50
Language Description & Lexical Analysis
Transformation of NFA into DFA a a
a b
q0 q1 q2
3. Example
a
b
b
Step-by-Step Construction of the DFA:
4. Transitions from {q0,q2} :
On a: a
q0 q0,q1
δ({q0,q2},a)={q0,q1,q2} b
New state in DFA: {q0,q1,q2} b
On b:
a
δ({q0,q2},b)={q0} q0,q1,q2 q0,q2
New state in DFA: {q0} (already exists)
51
Language Description & Lexical Analysis
Transformation of NFA into DFA a a
a b
q0 q1 q2
3. Example
a
b
b
Step-by-Step Construction of the DFA:
5. Transitions from {q0,q1,q2} :
On a: a
q0 q0,q1
δ({q0,q1,q2},a)={q0,q1,q2} b
New state in DFA: {q0,q1,q2} (already exists) b
On b:
a
δ({q0,q1,q2},b)={q0,q2} q0,q1,q2 q0,q2
New state in DFA: {q0,q2} (already exists)
a b
52
Language Description & Lexical Analysis
Transformation of NFA into DFA a a
a b
q0 q1 q2
3. Example
a
b
DFA States: b
Q′={{q0},{q0,q1},{q0,q2},{q0,q1,q2}}
DFA Transitions:
a
δ′({q0},a)={q0,q1}
q0 q0,q1
δ′({q0},b)={q0} b
δ′({q0,q1},a)={q0,q1}
b
δ′({q0,q1},b)={q0,q2}
δ′({q0,q2},a)={q0,q1,q2} a
δ′({q0,q2},b)={q0} q0,q1,q2 q0,q2
δ′({q0,q1,q2},a)={q0,q1,q2}
δ′({q0,q1,q2},b)={q0,q2}
DFA Initial State:
q0′={q0}
a b
DFA Accepting States: 53
Exercise:
Given the following NFA, convert it into an equivalent DFA using the subset construction method.
NFA Definition
States: Q={q0,q1,q2}
Alphabet: Σ={a,b}
Transition function δ:
δ(q0,a)={q0,q1}
δ(q0,b)={q0}
δ(q1,a)={q2}
δ(q1,b)={q2}
δ(q2,a)={}
δ(q2,b)={q2}
Initial state: q0 54
The process of minimizing a DFA involves reducing the number of states while
preserving the language it accepts.
55
Language Description & Lexical Analysis
Minimization of a DFA: Hopcroft's Algorithm
Example
a
Consider a DFA with the following components:
b
States: Q={q0,q1,q2,q3,q4}
Alphabet: Σ={a,b} a
q0 q1 q3
Transition function δ:
δ(q0,a)=q1 δ(q2,b)=q0 b b a
δ(q0,b)=q2 δ(q3,a)=q4
a a
b
δ(q1,a)=q0 δ(q3,b)=q1
δ(q1,b)=q3 δ(q4,a)=q3 q2 q4
δ(q2,a)=q4 δ(q4,b)=q2
Initial state: q0 b 57
b b a
a a
Step-by-Step Minimization b
P0
1. Initialization: q2 q4
q1 P1
q4
q2
58
a
b b a
a a
Step-by-Step Minimization b
P0
2. Refinement: q2 q4
Iteration 1:
For each group and each input symbol,
check if transitions lead to different groups.
Checking P0={q0,q3} :
On a:
q1 P1
q4
δ(q0,a)=q1 (in P1)
δ(q3,a)=q4 (in P1)
q2
No split needed.
On b:
δ(q0,b)=q2 (in P1)
δ(q3,b)=q1 (in P1) 59
No split needed.
a
b b a
a a
Step-by-Step Minimization b
P0
2. Refinement: q2 q4
Iteration 1:
... q0 q3 b
Checking P1={q1,q2,q4} :
On a:
δ(q1,a)=q0 (in P0)
δ(q2,a)=q4 (in P1) P1
δ(q4,a)=q3 (in P0) P1
q1 q4
Split P1 into {q2} and {q1,q4}.
Updated partitions:
P0={q0,q3}
P1={q1,q4}
P2={q2} P2
q2
60
a
Step-by-Step Minimization P0 b
a
b
a
a
2. Refinement: b
Iteration 2: q2 q4
... q0 q3
Checking P1={q1,q4} :
b
On a:
δ(q1,a)=q0 (in P0)
δ(q4,a)=q3 (in P0)
No split needed. P1
On b:
q1 q4
δ(q1,b)=q3 (in P0)
δ(q4,b)=q2 (in P2)
Split P1 No split needed.
Updated partitions:
P0={q0,q3} P2
q2
P1={q1, q4}
P2={q2}
a
Step-by-Step Minimization
b b a
a a
Theorem
A language is regular if and only if it is generated by a finite automaton.
65
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression
66
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression
67
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression
68
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression
69
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression
70
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression
Example
Let's construct an NFA from the regular expression : (a|b)*abb
1. Construct NFA for a and b:
For a:
a
q0 q1
For b:
b
q2 q3
71
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression
Example
Let's construct an NFA from the regular expression : (a|b)*abb
2. Construct NFA for a|b:
Using union:
Combined NFA for a|b:
a
ε q0 q1 ε
q4 q5
ε b ε
q2 q3
72
73
Example
Let's construct an NFA from the regular expression : (a|b)*abb
3. Construct NFA for (a|b)*: ε
Using Kleene star:
Combined NFA for (a|b)*:
a
ε q0 q1 ε
ε
q6 q4 q5 q7
ε
ε b ε
q2 q3
ε
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression
Example
Let's construct an NFA from the regular expression : (a|b)*abb
4. Construct NFA for abb:
Concatenation: ε
a
ε q0 q1 ε
ε a b b
q6 q4 q5 q7 q8 q9 q10 q11
ε
ε b ε
q2 q3
ε
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression
Example
Let's construct an NFA from the regular expression : (a|b)*abb
4. Combine (a|b)* with abb using concatenation:
Concatenation: ε
a
ε q0 q1 ε
ε ε a b b
q6 q4 q5 q7 q8 q9 q10 q11
ε
ε b ε
q2 q3
ε
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression
Example
Let's construct an NFA from the regular expression : (a|b)*abb
Full NFA for (a|b)*abb:
1. States:
Q={q0,q1,q2,q3,q4,q5,q6,q7,q8,q9,q10,q11}
2. Alphabet:
Σ={a,b}
3. Transitions:
δ(q6, ε) = {q4, q7} δ(q2, b) = {q3} δ(q8, a) = {q9}
δ(q4, ε) = {q0, q2} δ(q3, ε) = {q5} δ(q9, b) = {q10}
δ(q0, a) = {q1} δ(q5, ε) = {q4, q7} δ(q10, b) = {q11}
δ(q1, ε) = {q5} δ(q7, ε) = {q8}
4. Initial State:
q6
5. Accepting State: 76
F={q11}
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression
Exercice:
Construct an NFA for the regular expression ((ab)*|a*) over the alphabet Σ = {a, b}
77
Language Description & Lexical Analysis
Flex tool
Flex (Fast Lexical Analyzer) is a tool for generating scanners, also known as lexical analyzers.
A scanner reads input text and breaks it into tokens, which are the basic building blocks for
syntax analysis in the compilation process.
Flex is widely used in conjunction with Bison, a parser generator, to create a complete
compiler or interpreter.
Overview of Flex
Input: A specification file containing patterns and corresponding actions.
Output: A C source file (typically lex.yy.c) containing the code for the lexical analyzer.
Usage: The generated lexical analyzer reads input text, matches it against the specified
patterns, and performs the associated actions.
78
Language Description & Lexical Analysis
Flex tool
79
Language Description & Lexical Analysis
Flex tool
80
Language Description & Lexical Analysis
Flex tool
81
Language Description & Lexical Analysis
Flex tool
85
Language Description & Lexical Analysis
Flex tool
Description: A pointer to the matched text. It points to the start of the current token in the input stream.
Usage: Access the matched text in actions, e.g., printf("Matched text: %s\n", yytext);
2. yyleng:
Type: int
Description: The length of the matched text, excluding the null terminator.
Usage: Get the length of the current token, e.g., printf("Length of matched text:
%d\n", yyleng);
86
Language Description & Lexical Analysis
Flex tool
Predefined variables and functions
Predefined Variables
3. yylval:
Type: Depends on the user definition
Description: Used to return values from lexical analyzer to the parser. Typically used in conjunction
with Yacc/Bison.
Usage: Set semantic values for tokens, e.g., yylval = atoi(yytext);
4. yyin:
Type: FILE *
Description: The file pointer from which Flex reads input. By default, it is stdin.
5. yyout:
Type: FILE *
Description: The file pointer to which Flex writes its output. By default, it is stdout.
Description: The main lexical analyzer function. It matches the next token in the input and executes
yylex();
2. yywrap():
Type: int
Description: Called when the end of the input file is reached. By default, it returns 1, indicating no
int yywrap() {
return 1; // Indicate end of input
}
Language Description & Lexical Analysis
Flex tool
Predefined variables and functions
Predefined Functions
3. yyrestart(FILE *new_file):
Type: void
5. yy_delete_buffer(YY_BUFFER_STATE buffer):
Type: void
This section, enclosed in %{ and %}, contains C code that will be copied
verbatim to the top of the generated C file. Here, it includes the standard I/O
library for printing.
92
Language Description & Lexical Analysis
Flex tool
Example of a Flex Specification File
Explanation of the Example
Token Definitions
NUMBER_TOKEN [0-9]*
ID_TOKEN [a-zA-Z_][a-zA-Z0-9_]*
...
These lines define named patterns (macros) for the various tokens using regular
expressions:
NUMBER_TOKEN: Matches zero or more digits ([0-9]*).
ID_TOKEN: Matches an identifier, which starts with a letter or underscore and is
followed by letters, digits, or underscores ([a-zA-Z_][a-zA-Z0-9_]*).
PLUS_TOKEN: Matches the plus sign ("+").
MINUS_TOKEN: Matches the minus sign ("-").
TIMES_TOKEN: Matches the asterisk ("*") used for multiplication.
DIV_TOKEN: Matches the forward slash ("/") used for division.
MOD_TOKEN: Matches the percent sign ("%") used for modulo operation.
SPACE_TOKEN: Matches one or more whitespace characters ([ \t\n]+).
Language Description & Lexical Analysis
Flex tool
Example of a Flex Specification File
Explanation of the Example
Rules Section
%%
{NUMBER_TOKEN} {printf("NUMBER_TOKEN\n");}
...
. {printf("ERROR\n");}
%%
This section, enclosed between %% markers, contains the rules for matching the patterns defined
above and the actions to perform when a match is found:
{NUMBER_TOKEN}: If the input matches NUMBER_TOKEN, it prints "NUMBER_TOKEN".
{ID_TOKEN}: If the input matches ID_TOKEN, it prints "ID_TOKEN".
{PLUS_TOKEN}: If the input matches PLUS_TOKEN, it prints "PLUS_TOKEN".
{MINUS_TOKEN}: If the input matches MINUS_TOKEN, it prints "MINUS_TOKEN".
{TIMES_TOKEN}: If the input matches TIMES_TOKEN, it prints "TIMES_TOKEN".
{DIV_TOKEN}: If the input matches DIV_TOKEN, it prints "DIV_TOKEN".
{MOD_TOKEN}: If the input matches MOD_TOKEN, it prints "MOD_TOKEN".
{SPACE_TOKEN}: If the input matches SPACE_TOKEN, it prints “SPACE_TOKEN".
.: If the input matches any other character not specified above, it prints "ERREUR".
Language Description & Lexical Analysis
Flex tool
Example of a Flex Specification File
Explanation of the Example
Main Function
int main(int argc, char ** argv){
if(argc >= 1){
yyin = fopen(argv[1], "r");
yylex();
}else{
printf("Insufficient number of arguments\n");
}
}
Running the lexer on this file will produce the following output:
NUMBER_TOKEN
PLUS_TOKEN
ID_TOKEN
MINUS_TOKEN
NUMBER_TOKEN
TIMES_TOKEN
ID_TOKEN
DIV_TOKEN
ID_TOKEN
MOD_TOKEN 97
ID_TOKEN
Language Description & Lexical Analysis
Lexical analyzer in C Code
Example of Lexical Analyzer in C Code
A Comprehensive example ‘lexer.c’ of a lexical analyzer in C that recognizes arithmetic
expressions with numbers, identifiers, and operators, including some basic error handling.
#include <stdio.h>
#include <ctype.h>
#include <string.h>
typedef enum {
TOKEN_IDENTIFIER,
TOKEN_NUMBER,
TOKEN_OPERATOR,
TOKEN_PAREN_OPEN,
TOKEN_PAREN_CLOSE,
TOKEN_UNKNOWN,
TOKEN_END
} TokenType;
typedef struct {
TokenType type; 98
char text[100];
} Token;
Language Description & Lexical Analysis
Lexical analyzer in C Code
Example of Lexical Analyzer in C Code
...
const char *tokenTypeToString(TokenType type) {
switch (type) {
case TOKEN_IDENTIFIER: return "Identifier";
case TOKEN_NUMBER: return "Number";
case TOKEN_OPERATOR: return "Operator";
case TOKEN_PAREN_OPEN: return "Paren Open";
case TOKEN_PAREN_CLOSE: return "Paren Close";
case TOKEN_UNKNOWN: return "Unknown";
case TOKEN_END: return "End";
default: return "Invalid";
}
}
...
Token getNextToken(const char **input) {
while (isspace(**input)) (*input)++; // Skip whitespace
if (**input == '\0') {
return (Token){TOKEN_END, ""};
}
Token token;
if (isalpha(**input)) {
// Identifier
token.type = TOKEN_IDENTIFIER;
int length = 0;
while (isalnum(**input)) {
token.text[length++] = *(*input)++;
}
token.text[length] = '\0';
} else if (isdigit(**input)) {
...
100
Language Description & Lexical Analysis
Lexical analyzer in C Code
Example of Lexical Analyzer in C Code
...
// Number
token.type = TOKEN_NUMBER;
int length = 0;
while (isdigit(**input)) {
token.text[length++] = *(*input)++;
}
token.text[length] = '\0';
} else if (strchr("+-*/=", **input)) {
// Operator
token.type = TOKEN_OPERATOR;
token.text[0] = *(*input)++;
token.text[1] = '\0';
} else if (**input == '(') {
// Opening Parenthesis
token.type = TOKEN_PAREN_OPEN;
token.text[0] = *(*input)++;
token.text[1] = '\0';
101
} else if (**input == ')’) {
...
Language Description & Lexical Analysis
Lexical analyzer in C Code
Example of Lexical Analyzer in C Code
...
// Closing Parenthesis
token.type = TOKEN_PAREN_CLOSE;
token.text[0] = *(*input)++;
token.text[1] = '\0';
} else {
// Unknown token
token.type = TOKEN_UNKNOWN;
token.text[0] = *(*input)++;
token.text[1] = '\0';
}
return token;
}
...
102
Language Description & Lexical Analysis
Lexical analyzer in C Code
Example of Lexical Analyzer in C Code
...
int main() {
const char *input = "var1 = 42 + (var2 * 3) / num3";
printf("Input: %s\n", input);
Token token;
while ((token = getNextToken(&input)).type != TOKEN_END) {
printToken(token);
}
return 0;
}
103
Language Description & Lexical Analysis
Lexical analyzer in C Code
Example of Lexical Analyzer in C Code
Explanation of the Code
TokenType Enumeration: Defines the different types of tokens: TOKEN_IDENTIFIER,
TOKEN_NUMBER, TOKEN_OPERATOR, TOKEN_PAREN_OPEN, TOKEN_PAREN_CLOSE,
TOKEN_UNKNOWN, and TOKEN_END.
Token Structure: Represents a token with a type and a text value.
tokenTypeToString Function: Converts a TokenType to a string for printing purposes.
printToken Function: Prints a token's type and text value.
getNextToken Function:
Takes a pointer to the input string and returns the next token.
Skips whitespace.
Recognizes identifiers (alphanumeric strings starting with a letter).
Recognizes numbers (strings of digits).
Recognizes operators (+, -, *, /, =).
Recognizes parentheses (( and )).
Recognizes unknown tokens (any other single character).
main Function:
Defines the input string to be tokenized.
104
Calls getNextToken in a loop to tokenize the entire input string and print each token.
Language Description & Lexical Analysis
Lexical analyzer in C Code
Example of Lexical Analyzer in C Code
Compilation and Execution
To compile and run the lexical analyzer, follow these steps:
1. Save the code to a file named lexer.c.
2. Compile the code using a C compiler, e.g., gcc:
gcc -o lexer lexer.c
3. Run the executable:
./lexer
105
Language Description & Lexical Analysis
Lexical analyzer in C Code
Example of Lexical Analyzer in C Code
Example Output
When you run the program with the input string "var1 = 42 + (var2 * 3) / num3", the output will be:
The End