0% found this document useful (0 votes)
13 views

MOD 04 - Language Description & Lexical Analysis

Description is lecical something

Uploaded by

zaki04amrani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

MOD 04 - Language Description & Lexical Analysis

Description is lecical something

Uploaded by

zaki04amrani
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 107

Language Description &

Lexical Analysis

Pr. Moulay Youssef Hadi


Learning Objectives

 After completing this module, you should be able to:


 Specify when lexical analyzers act in the process of translating PLs into machine
language.
 Explain the inputs and outputs of a lexical analyzer and its relationship to other
compilation phases.
 Apply regular expressions (REs) to describe the syntax of legal tokens.
 Map between REs, Finite State Automata (FSAs) and their implementation.
 Describe how a lexical analyzer can be implemented.
 Distinguish the errors caught by a lexical analyzer from those caught in later
phases of the translation process.

2
Language Description & Lexical Analysis
Compiler phases

Phases of compilation process : Lexical Analysis


1. Lexical Analysis: Converts source code into tokens.
2. Syntax Analysis: Constructs a parse tree from the Syntax Analysis Analysis Phase
tokens.
3. Semantic Analysis: Checks for semantic errors and Semantic Analysis
annotates the parse tree with type information.
4. Intermediate Code Generation: Produces an Intermediate Code
intermediate representation of the code. Generation
5. Optimization: Improves the intermediate code for Code Optimization Synthesis Phase
performance.
6. Code Generation: Translates the intermediate code Target Code Generation
into target machine code.
3
Language Description & Lexical Analysis
Introduction to lexical analysis
 Lexical analysis is a fundamental step in the compilation process, responsible for breaking
down the source code into a sequence of meaningful units called tokens.
 This process lays the groundwork for the subsequent phases of parsing, semantic analysis,
and code generation.
 By understanding the task of lexical analysis and the tools used to implement it, such as
regular expressions and finite automata, we can gain valuable insights into the inner
workings of compilers and their ability to transform high-level programming languages into
executable machine code.
Tokens

---------
---------
Lexemes
---------
--------- Lexical analyzer Syntax analyzer
----
Source code 4

Request for tokens


Language Description & Lexical Analysis
Introduction to lexical analysis
 Purpose:
 Tokenization: Divide the source code into tokens or lexemes, such as keywords,
identifiers, literals, and operators.
 Ignored Characters: Discard whitespace characters, comments, and other non-essential
elements from the source code.
 Process:
1. Input: Receive the source code as input, typically in the form of a character stream.
2. Scanning: Read characters from the input stream and group them into lexemes
according to predefined rules.
 Patterns: Define regular expressions or finite automata to recognize the lexical
patterns of tokens.
 Tokenization: Identify the type of each lexeme (e.g., keyword, identifier, operator)
and its position in the source code.
3. Output: Generate a stream of tokens, each representing a lexeme along with its
associated information (e.g., token type, value, position).
5
Language Description & Lexical Analysis
Introduction to lexical analysis
 Common Lexical Elements (Tokens):
 Keywords: Reserved words with special meanings in the language (e.g., if, else, while, int).
 Identifiers: Names given to variables, functions, classes, etc. (e.g., variable, functionName,
ClassName).
 Literals: Constants representing fixed values (e.g., 123, "hello", 3.14).
 Operators: Symbols representing operations or calculations (e.g., +, -, *, /).
 Delimiters: Symbols used to separate or group elements in the code (e.g., (), {}, ;, ,).
 Comments: Annotations or explanations ignored by the compiler
(e.g., // single-line comment, /* multi-line comment */).

6
Language Description & Lexical Analysis
Introduction to lexical analysis
 Techniques and Tools:
 Regular Expressions: Define patterns for recognizing tokens using regular expressions.
 Finite Automata: Construct finite automata to recognize lexical patterns efficiently.
 Lexical Analyzers (Lexers): Implement lexical analyzers using tools like Lex, Flex, or
hand-written code.
 Tokenization: Categorize lexemes into different token types and generate a token stream.
 Error Handling:
 Lexical Errors: Detect and report lexical errors such as invalid characters or unrecognized
tokens.
 Error Recovery: Implement strategies for recovering from errors and continuing the lexical
analysis process.

7
Language Description & Lexical Analysis
Introduction to lexical analysis
 Three major terms are used in lexical analysis:
 Lexical unit: a pair consisting of a lexical unit name and an optional attribute value
(optional).
 Pattern: is a description of the form that the lexemes in a lexical unit can take. It can
be simple or complex.
 Lexeme: is a sequence of characters in the source program that is recognized by the
lexical unit pattern.
Example
• Symbols (lexical units):
• Identifier, String, Numeric constant.
• Keyword (IF, WHILE, DO, ...)
• Operator (special symbol): <, >=,<=, ==, =, ...
• Analyze the following sentences:
• A+15+B ➔ A, +, 15, +, B 8

• test+3148 ➔ test, +, 3148


Language Description & Lexical Analysis
Introduction to lexical analysis

Symbol character Symbol code


by character Lexical Analyzer Symbol string

 Lexical analyzer input :


WHILE A>3*B DO A<A-1 END
 Lexical analyzer output :

Lexeme Lexical Unit


WHILE WHILE_TOKEN
A IDF_TOKEN
> SUP_TOKEN
… …
9
Language Description & Lexical Analysis
Introduction to lexical analysis
After lexical analysis, the code might be
represented as a sequence of tokens:
 Example: 1. Keyword: int
Consider the following C code snippet: 2. Identifier: main
3. Delimiter: (
int main() { 4. Delimiter: )
int x = 10; 5. Delimiter: {
return x * 2; 6. Keyword: int
} 7. Identifier: x
8. Operator: =
9. Literal: 10
10. Delimiter: ;
11. Keyword: return
12. Identifier: x
13. Operator: *
14. Literal: 2
15. Delimiter: ; 10

16. Delimiter: }
Language Description & Lexical Analysis
Terminology
 Alphabet
 The term "alphabet" refers to the set of symbols from which strings (sequences of
symbols) are formed.
 An alphabet is defined as a finite, non-empty set of symbols.
 Formally, let Σ be an alphabet.
Then: ∑= {𝑎1 , 𝑎2 , … , 𝑎𝑛 }
Where:
 𝑎1 , 𝑎2 , … , 𝑎𝑛 ​ are individual symbols or characters.
 𝑛 is the cardinality (number of elements) of the alphabet Σ, which must be finite.

Example
∑ = 𝑎, 𝑏, … , 𝑧 11
∑ = 𝛼, 𝛽, … , 𝜑, +,∗, −,/
Language Description & Lexical Analysis
Terminology
 Alphabet
 Properties:
1. Finite Set: The alphabet Σ contains a finite number of symbols.
2. Non-Empty: The alphabet Σ cannot be empty; it must contain at least one symbol.
 Examples:
1. Binary Alphabet: {0,1}
Contains two symbols: 0 and 1.
2. English Alphabet: {𝑎,𝑏,𝑐,...,𝑧}
Contains 26 lowercase letters from a to z.
3. Numeric Alphabet: {0,1,2,...,9}
Contains 10 digits from 0 to 9.
12
Language Description & Lexical Analysis
Terminology
 A "word" refers to a basic unit of language, typically representing a single entity within the
source code.
 Words are identified during lexical analysis (scanning) and are often referred to as tokens.
 Each word/token represents a specific syntactic construct in the programming language.

Example 1
• On the alphabet ∑ = {0, 1}, we can construct words like: 101001, 11, 100.
• 𝛆 is the empty word.
• The concatenation of two words is a word.

Example 2
Let ∑ = {0, 1}, if α = 100 and β = 1010 then αβ = 1001010.
α² = αα = 100100
α3 = ααα = 100100100 13
Language Description & Lexical Analysis
Terminology
Question
• How to formally define the symbols of a language?

Answer
• The best models defining lexical units which lexemes belong are regular
languages.
• Two ways to describe regular languages:
• Regular expressions
• Finite automata

14
Language Description & Lexical Analysis
Formal language

A language based on an alphabet Σ is a set of words built on Σ.

Example

• Σ* is the set of all words constructed on Σ.

Note

• A language constructed on an alphabet Σ is a part of Σ*.

• We distinguish two particular languages : ø, Σ

15
Language Description & Lexical Analysis
Formal language

Example of languages
Let ∑={a, …, z}

• L0 = {a}

• L1 = {aa, ab}

• L2 = {αaα / α ∈Σ *} = {a, bab, cdacd, ...}

• L3 = {α ∈Σ* / |α|a ≤ 10} : the set of all words whose number of occurrences of a ≤ 10.

| . |x∈Σ : Σ∗ →IN , | . |x calculate the number of occurrences of x in a word.

16
Language Description & Lexical Analysis
Formal language
Operations on formal languages
 Union
 𝐿1 ‫𝐿 ڂ‬2 = {𝑥 / 𝑥 ∈ 𝐿1 𝑜𝑟 𝑥 ∈ 𝐿2}
 Intersection
 𝐿1 ‫𝐿 ځ‬2 = {𝑥 / 𝑥 ∈ 𝐿1 𝑎𝑛𝑑 𝑥 ∈ 𝐿2}
 Difference
 𝐿1 − 𝐿2 = {𝑥 / 𝑥 ∈ 𝐿1 𝑎𝑛𝑑 𝑥 ∉ 𝐿2}
 𝐿1 = 𝛴 ∗ − 𝐿1
 Concatenation
 𝐿1𝐿2 = {𝑥𝑦 /𝑥 ∈ 𝐿1 𝑎𝑛𝑑 𝑦 ∈ 𝐿2} L1={a,b,c} L2={1,2}
 𝐿𝑛 = 𝐿𝐿 … 𝐿 L1L2={a1,a2,b1,b2,c1,c2} 17
L2L1={1a,1b,1c,2a,2b,2c}
𝑛 𝑡𝑖𝑚𝑒𝑠
Language Description & Lexical Analysis
Formal language

Kleene closure
 Let L be a language
 𝐿∗ = ‫≥𝐾ڂ‬0 𝐿𝑘

Example
𝐿1 = 𝑎, 𝑏
𝐿∗1 = 𝐿01 ‫𝐿 ڂ‬11 ‫𝐿 ڂ‬21 ‫𝐿 ڂ‬31 ‫𝜀 = … ڂ‬, 𝑎, 𝑏, 𝑎𝑎, 𝑎𝑏, 𝑏𝑏, …

 𝐿+ is called positive Kleene closure of 𝐿 : 𝐿+ = 𝐿∗ − {𝜀}


 Proposition
 The following identities are true for all languages (X, Y, Z) :
 X(YZ) = (XY)Z
 X𝜀 = 𝜀X = X
 X(Y ‫ ڂ‬Z) = XY ‫ ڂ‬XZ
 (X ‫ ڂ‬Y)Z = XZ ‫ ڂ‬YZ
18
 (X*)* = X*
Language Description & Lexical Analysis
Regular languages

Definition
 A regular language L on an alphabet Σ is defined recursively as follows:
 {𝜀} is a regular language on Σ.
 Let 𝑎 ∈ Σ then {a} is a regular language on Σ.
 If 𝑅 is a regular language, then 𝑅𝑘 and 𝑅∗ are regular languages on Σ.
 If 𝑅1 and 𝑅2 are two regular languages, then 𝑅1 ‫𝑅 ڂ‬2 and 𝑅1 𝑅2 are regular languages.

19
Language Description & Lexical Analysis
Regular expression (RE)

 Given an alphabet Σ, the regular expressions and the languages they describe are defined
as follows:
1. Basic Elements:
 Let a ∈ Σ, then a is a regular expression (RE) which describes the language {a}.
 ε (epsilon) is a regular expression that describes the language {ε}, where ε
denotes the empty string.

20
Language Description & Lexical Analysis
Regular expression (RE)

2. Operations on Regular Expressions:


 Union (Alternation): If α and β are two regular expressions describing
languages A and B respectively, then α+β (or α|β) is a regular expression
which describes the language A∪B. This denotes the union of languages A
and B.
 Concatenation: If α and β are regular expressions describing languages A
and B respectively, then αβ is a regular expression which describes the
concatenated language AB. This means strings formed by concatenating any
string from A with any string from B.
 Kleene Star: If α is a regular expression describing a language A, then α* is a
regular expression which describes the language A*. This denotes the Kleene
star operation, representing the set of strings that can be made by
concatenating zero or more strings from A.
21
Language Description & Lexical Analysis
Regular expression (RE)

 Example 1: Basic Elements


 Alphabet Σ: {a,b}
 Regular Expressions:
 a is a RE that describes the language {a}.
 b is a RE that describes the language {b}.
 ε is a RE that describes the language {ε}, which is the language containing
only the empty string.

22
Language Description & Lexical Analysis
Regular expression (RE)

 Example 2: Union (Alternation)


 Regular Expressions:
 a + b or a | b is a RE that describes the language {a, b}.
 Language Description:
 The language described by a + b is the set of strings that contains either a or b.

23
Language Description & Lexical Analysis
Regular expression (RE)

 Example 3: Concatenation
 Regular Expressions:
 ab is a RE that describes the language {ab}.
 a(b+c) or a(b|c) is a RE that describes the language {ab,ac}.
 Language Description:
 The language described by ab is the set containing the single string "ab".
 The language described by a(b+c) is the set of strings that starts with a and is
followed by either b or c.

24
Language Description & Lexical Analysis
Regular expression (RE)

 Example 4: Kleene Star


 Regular Expressions:
 a* is a RE that describes the language {ε,a,aa,aaa,…}.
 (ab)* is a RE that describes the language {ε,ab,abab,ababab,…}.
 Language Description:
 The language described by a* is the set of all strings consisting of zero or more
occurrences of the character a.
 The language described by (ab)* is the set of all strings consisting of zero or
more occurrences of the string "ab".

25
Language Description & Lexical Analysis
Regular expression (RE)

 Example 5: Combined Operations


 Regular Expressions:
 (a+b)c or (a|b)c is a RE that describes the language {ac,bc}.
 (a+b)*c or (a|b)*c is a RE that describes the language {c,ac,bc,aac,abc,bac,bbc,…}.
 Language Description:
 The language described by (a+b)c is the set of strings where either a or b is
followed by c.
 The language described by (a+b)*c is the set of strings consisting of zero or more
occurrences of a or b followed by c.

26
Language Description & Lexical Analysis
Regular expression (RE)

 Example 6: Complex Regular Expression


 Regular Expressions:
 (a*b*)c is a RE that describes the language {c, ac, bc, aac, abc, abbc, aabbc, …}.
 Language Description:
 The language described by (a*b*)c is the set of strings where zero or more
occurrences of a followed by zero or more occurrences of b are followed by c.

27
Language Description & Lexical Analysis
Regular expression (RE)

Exercice 1
Give the regular expressions which describe:
1. The language of words on Σ={a, b} that start with a and end by b.
2. The language of all words on {a, b} concatenated with words on {c, d}:
ab, aab, abb, aaab, abab, abababbab, , a, b, c,d,ac,abc, abcd, ababcdcdc.

Answer
• a(a|b)*b
• (a|b)*(c|d)*

28
Language Description & Lexical Analysis
Regular expression (RE)

Exercice 2
1. Does the word w belong to the language described by the RE r in the
following cases:
• w = 10100010 r =(0+10)*
• w= 01110110 r =(0+(11)*)*
• w= 000111100 r = ((011+11)*(00)*)*
2. Simplify the following REs:
• 𝜀 + ab + ab + abab(ab)*
• aa(b* + a) + a(ab* + aa)
• a(a+b)* + aa(a+b)*+ aaa(a+b)*

29
Language Description & Lexical Analysis
Regular expression (RE)

Answer
• 𝜀 + ab + abab(ab)* = 𝜀 + ab + ab(ab)*
= 𝜀 + ab(𝜀 + (ab)*)
= 𝜀 + ab(ab)*
= 𝜀 + (ab)*
= (ab)*
• aa(b* + a) + a(ab* + aa) = aa(b* + a)+ aa(b* + a)
= aa(b* + a)
• a(a + b)*+ aa(a + b)*+ aaa(a + b)* = (a + aa + aaa)(a + b)*

30
Language Description & Lexical Analysis
Regular expression (RE)

Exercice 3
1. Prove the following equalities:
• b + ab* + aa*b + aa*ab* = a*(b + ab*)
• a*(b+ab*) = b + aa*b*
2. Give an expression r of the language formed on {a,b} having at most 3a.

31
Language Description & Lexical Analysis
Regular expression (RE)

Answer
1. Verification
• b + ab* + aa*b + aa*ab* = a*(b+ab*)
 (b + ab*) + a+(b + ab*) = (𝜀 + a+)(b + ab*)
= a*(b+ab*)
• a*(b + ab*) = a*b + a*ab*
= (𝜀 + a+)b + a+b*
= b + a+b + a+b*
= b + a+ (b + b*)
= b + a+b*
= b+aa*b*
2. b*(a+𝜀)b*(a+𝜀)b*(a+𝜀)b*

32
Language Description & Lexical Analysis
Introduction to automata

 Example :
 A coffee machine delivers a cup of coffee for 3 DH. This distributor accepts
1DH and 2DH coins but does not give change. A C button allows you to get
coffee. At first the C button is blocked. If you insert enough coins, the button C
becomes free and you cannot insert coins.
 Model the operation of the distributor.

Automata:
Initial state e0: button blocked
State e1: when we insert 1DH
State e2: when we insert 2DH
State e3: button C is released
33
Language Description & Lexical Analysis
Introduction to automata

→ : Initial state

→ : final state

Actual State
Entry ribbon

a e D 2 i N S

Transition table
Reading head 34

Finit Automata
Language Description & Lexical Analysis
Introduction to automata

 An automaton reads a word written on its input ribbon. It starts from an initial
state and with each letter read, it changes state. If at the end of the word, it is in a
final state, we say that it accepts the word.
 Word belongs to language ➔ last state is a final state.
 Word does not belong to language ➔ Otherwise.

35
Language Description & Lexical Analysis
Deterministic finite automaton (DFA)

 Deterministic finite automaton (DFA)


Definition
A DFA A is defined as a 5-tuple 𝑨 =< 𝑸, Σ, 𝜹, 𝒒𝟎 , 𝑭 > :

• Q is the set of states,

• Σ is an alphabet,

• 𝜹: Q × Σ → 𝑸 is the transition function,

• 𝒒𝟎 is the initial state (𝒒𝟎 ∈ 𝑸 ),

• 𝑭 is the set of final states (𝑭 ⊆ Q ).


36
Language Description & Lexical Analysis
Deterministic finite automaton (DFA)
 Automata can be represented by graphs where the states are the nodes and the
edges represent the transition function.

Initial state final state

a
 If 𝜹 𝒒𝟎 , 𝒂 = 𝒒𝟏 ➔ 𝒒𝟎 𝒒𝟏

Exercice
Represent the following automaton graphically:
𝑨 =< 𝑸, Σ, 𝜹, 𝒒𝟎 , 𝑭 > with Σ={0,1} , 𝑸 = {𝒒𝟎 , 𝒒𝟏 } , 𝒒𝟎 is the initial state and 𝑭 = 𝒒𝟏 .

𝜹 0 1
𝒒𝟎 𝒒𝟎 𝒒𝟏 37

𝒒𝟏 𝒒𝟎 𝒒𝟏
Language Description & Lexical Analysis
Language accepted by DFA

 We say that an automaton 𝑨 =< 𝑸, Σ, 𝜹, 𝒒𝟎 , 𝑭 > accepts a word 𝑥 = 𝑥1 𝑥2 … 𝑥𝑛 , if


and only if 𝜹(… 𝜹 𝜹 𝜹 𝒒𝟎 , 𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 … ), 𝒙𝒏 ) ∈ F .

Example of verification
• 11111 (Yes)
𝛿(𝛿 𝛿 𝛿 𝛿 𝑞0 , 1 , 1 , 1 , 1 , 1) ∈ F
• 10010 (No)
𝛿(𝛿 𝛿 𝛿 𝛿 𝑞0 , 1 , 0 , 0 , 1 , 0) ∉ F

38
Language Description & Lexical Analysis
Nondeterministic finite automaton (NFA)
 An NFA is defined by a 5-tuple M = < Q, Σ , δ , q0, F >, where:
 Q is a finite set of states.
 Σ is a finite set of input symbols (alphabet).
 δ is the transition function, defined as δ : Q × ( Σ ∪ { ε } ) → 𝟐𝑸 .
 This function takes a state and an input symbol (or epsilon, which represents an
empty string) and returns a set of states.
 q0 ∈ Q is the initial state.
 F ⊆ Q is the set of accepting (or final) states.

39
Language Description & Lexical Analysis
Nondeterministic finite automaton (NFA)
 Characteristics of an NFA
1. Multiple Transitions: From a given state and input symbol, the NFA can
transition to any number of possible next states.
2. Epsilon Transitions: The NFA can transition from one state to another
without consuming any input symbols (using epsilon transitions).
3. Acceptance of Strings: An NFA accepts an input string if there exists at least
one sequence of transitions (including epsilon transitions) starting from the
initial state and ending in an accepting state, such that the entire input string is
consumed.

40
Language Description & Lexical Analysis
Nondeterministic finite automaton (NFA)
a a
 Example of an NFA
 Consider an NFA M = < Q, Σ , δ , q0, F > with the following components: a b
q0 q1 q2
 States: Q={q0,q1,q2}
 Alphabet: Σ={a,b}
 Transition function δ: b
 δ(q0,a)={q0,q1} (on input a, from state q0​, the NFA can stay in q0​ or
move to q1)
 δ(q0,b)={q0} (on input b, from state q0​, the NFA can stay in q0​)
 δ(q1,b)={q2} (on input b, from state q1​, the NFA can move to q2​)
 δ(q2,a)={q2} (on input a, from state q2​, the NFA can stay in q2​)
 Initial state: q0
41
 Accepting states: F={q2}
Language Description & Lexical Analysis
Nondeterministic finite automaton (NFA)
 Example Execution:
 To check if the NFA accepts the string "ab":
 Start at q0​.
 On input a, move to q0​ or q1​.
 From q0​ on input b, stay in q0​ (this path does not lead to acceptance).
 From q1​ on input b, move to q2​ (this path leads to acceptance).
 Since there exists a path where the NFA ends in an accepting state (q2), the NFA
accepts the string "ab". a a

a b
q0 q1 q2

42

b
Language Description & Lexical Analysis
Nondeterministic finite automaton (NFA)
 It is difficult to move directly from a RE to a DFA.
 It’s easy to move from a RE to a NFA.

Definition
A NFA 𝑨 =< 𝑸, Σ, 𝜹, 𝒒𝟎 , 𝑭 > is characterized by :
𝜹: 𝑸 × Σ ‫𝜀 ڂ‬ → 𝑷(𝑸)

 𝑃({𝑎, 𝑏, 𝑐}){∅, {𝑎} , {𝑏} , {𝑐} , {𝑎, 𝑏} , {𝑎, 𝑐} , {𝑏, 𝑐} , {𝑎, 𝑏, 𝑐}}
 Several arcs, recognizing the same symbol, can exit the same state.
 There may be 𝜀 −transitions

𝜀 is empty word, not 43

belong on the
alphabet
Language Description & Lexical Analysis
Nondeterministic finite automaton (NFA)

Example
Represent with a Graph the following NFA:
A=<Q, Σ, δ, q0 , F> with Σ={0, 1, 2, 3, 4, 5, 6, 7, 8, 9, +, -, .} and
Q={q0 , q1 , q2 , q3 , q4 , q5 } q0 is the initial state and F={q5}

44
Language Description & Lexical Analysis
Transformation of NFA into DFA

 The transformation of a Nondeterministic Finite Automaton (NFA) into a


Deterministic Finite Automaton (DFA) is known as the subset construction or
powerset construction method.
 This process involves creating states in the DFA that correspond to sets of NFA
states.

45
Language Description & Lexical Analysis
Transformation of NFA into DFA

1. Definitions
 NFA: A 5-tuple (Q,Σ,δ,q0,F)
 Q: Finite set of states
 Σ: Alphabet
 δ: Transition function δ:Q×Σ→2𝑄
 q0​: Initial state
 F: Set of accepting states
 DFA: A 5-tuple (Q′,Σ,δ′,q0′,F′)
 Q′: Finite set of states
 Σ: Alphabet
 δ′: Transition function δ′:Q′×Σ→Q′
 q0′​: Initial state
46
 F′: Set of accepting states
Language Description & Lexical Analysis
Transformation of NFA into DFA

2. Subset Construction Algorithm


1. Initialization:
 The initial state of the DFA, q0′​, is the epsilon-closure of the NFA's initial state q0​.
The epsilon-closure of a state is the set of states reachable from that state by
epsilon transitions alone.
2. Processing States:
 For each state in the DFA, which is a set of NFA states, calculate its transitions for
each symbol in Σ.
 For a state S in the DFA and an input symbol a:
 Compute the set of NFA states reachable from any state in S on input a.
 Take the epsilon-closure of this set to form the new DFA state.
3. Mark Accepting States:
47
 A state in the DFA is an accepting state if it includes at least one accepting state of
the NFA.
Language Description & Lexical Analysis
Transformation of NFA into DFA

3. Example
 Consider the following NFA:
 States: Q={q0,q1,q2}
a a
 Alphabet: Σ={a,b}
 Transition function δ: a b
q0 q1 q2
 δ(q0,a)={q0,q1}
 δ(q0,b)={q0}
 δ(q1,b)={q2} b
 δ(q2,a)={q2}
 Initial state: q0​
48
 Accepting state: F={q2}
Language Description & Lexical Analysis
Transformation of NFA into DFA a a

a b
q0 q1 q2

3. Example b
b
Step-by-Step Construction of the DFA:
1. Initial State:
 q0′=ϵ-closure({q0})={q0} a
q0 q0,q1
2. Transitions from {q0} :
 On a:
 δ({q0},a)={q0,q1}
 New state in DFA: {q0,q1}
 On b:
 δ({q0},b)={q0}
 New state in DFA: {q0}
49
Language Description & Lexical Analysis
Transformation of NFA into DFA a a

a b
q0 q1 q2

3. Example
a
b
b
Step-by-Step Construction of the DFA:
3. Transitions from {q0,q1} :
 On a: a
q0 q0,q1
 δ({q0,q1},a)={q0,q1} (reachable from both b
q0​ and q1 on a)
 New state in DFA: {q0,q1} (already exists)
 On b:
q0,q2
 δ({q0,q1},b)={q0,q2} (reachable from q0​ on
b is q0​, and from q1​ on b is q2​)
 New state in DFA: {q0,q2}

50
Language Description & Lexical Analysis
Transformation of NFA into DFA a a

a b
q0 q1 q2

3. Example
a
b
b
Step-by-Step Construction of the DFA:
4. Transitions from {q0,q2} :
 On a: a
q0 q0,q1
 δ({q0,q2},a)={q0,q1,q2} b
 New state in DFA: {q0,q1,q2} b
 On b:
a
 δ({q0,q2},b)={q0} q0,q1,q2 q0,q2
 New state in DFA: {q0} (already exists)

51
Language Description & Lexical Analysis
Transformation of NFA into DFA a a

a b
q0 q1 q2

3. Example
a
b
b
Step-by-Step Construction of the DFA:
5. Transitions from {q0,q1,q2} :
 On a: a
q0 q0,q1
 δ({q0,q1,q2},a)={q0,q1,q2} b
 New state in DFA: {q0,q1,q2} (already exists) b
 On b:
a
 δ({q0,q1,q2},b)={q0,q2} q0,q1,q2 q0,q2
 New state in DFA: {q0,q2} (already exists)

a b
52
Language Description & Lexical Analysis
Transformation of NFA into DFA a a

a b
q0 q1 q2

3. Example
a
b

DFA States: b
 Q′={{q0},{q0,q1},{q0,q2},{q0,q1,q2}}
DFA Transitions:
a
 δ′({q0},a)={q0,q1}
q0 q0,q1
 δ′({q0},b)={q0} b
 δ′({q0,q1},a)={q0,q1}
b
 δ′({q0,q1},b)={q0,q2}
 δ′({q0,q2},a)={q0,q1,q2} a
 δ′({q0,q2},b)={q0} q0,q1,q2 q0,q2
 δ′({q0,q1,q2},a)={q0,q1,q2}
 δ′({q0,q1,q2},b)={q0,q2}
DFA Initial State:
 q0′={q0}
a b
DFA Accepting States: 53

 F′={{q0,q2},{q0,q1,q2}} (any DFA state containing q2​)


Language Description & Lexical Analysis
Transformation of NFA into DFA

 Exercise:
 Given the following NFA, convert it into an equivalent DFA using the subset construction method.
 NFA Definition
 States: Q={q0,q1,q2}
 Alphabet: Σ={a,b}
 Transition function δ:
 δ(q0,a)={q0,q1}
 δ(q0,b)={q0}
 δ(q1,a)={q2}
 δ(q1,b)={q2}
 δ(q2,a)={}
 δ(q2,b)={q2}
 Initial state: q0​ 54

 Accepting state: F={q2}


Language Description & Lexical Analysis
Minimization of a DFA: Hopcroft's Algorithm

 The process of minimizing a DFA involves reducing the number of states while
preserving the language it accepts.

 Hopcroft's algorithm is a well-known method for minimizing a DFA efficiently.

55
Language Description & Lexical Analysis
Minimization of a DFA: Hopcroft's Algorithm

 Steps of Hopcroft's Algorithm


1. Initialization:
 Divide the states of the DFA into two groups: accepting states and non-accepting states.
2. Refinement:
 Iteratively refine these groups (partitions) until no further refinement is possible.
 For each group and each input symbol, check if the transitions under that symbol lead to
states in different groups. If they do, split the group accordingly.
3. Construct the Minimized DFA:
 Create new states for each of the final groups.
 Define the transitions based on the transitions in the original DFA.
 The initial state of the minimized DFA corresponds to the group containing the initial state of
the original DFA.
 The accepting states of the minimized DFA correspond to the groups containing accepting
56
states of the original DFA.
Language Description & Lexical Analysis
Minimization of a DFA: Hopcroft's Algorithm

 Example
a
 Consider a DFA with the following components:
b
 States: Q={q0,q1,q2,q3,q4}
 Alphabet: Σ={a,b} a
q0 q1 q3
 Transition function δ:
 δ(q0,a)=q1​  δ(q2,b)=q0​ b b a
 δ(q0,b)=q2  δ(q3,a)=q4​
a a
b
 δ(q1,a)=q0  δ(q3,b)=q1​
 δ(q1,b)=q3​  δ(q4,a)=q3​ q2 q4

 δ(q2,a)=q4  δ(q4,b)=q2
 Initial state: q0​ b 57

 Accepting states: F={q0,q3}


a

Language Description & Lexical Analysis a


b

Minimization of a DFA: Hopcroft's Algorithm q0 q1 q3

b b a
a a

 Step-by-Step Minimization b
P0
1. Initialization: q2 q4

 Partition the states into two groups: q3


q0 b

 Accepting states: P0={q0,q3}


 Non-accepting states: P1={q1,q2,q4}

q1 P1
q4

q2

58
a

Language Description & Lexical Analysis a


b

Minimization of a DFA: Hopcroft's Algorithm q0 q1 q3

b b a
a a

 Step-by-Step Minimization b
P0
2. Refinement: q2 q4

 Iteratively refine the partitions based on the


transitions. q0 q3 b

 Iteration 1:
 For each group and each input symbol,
check if transitions lead to different groups.
 Checking P0={q0,q3} :
 On a:
q1 P1
q4
 δ(q0,a)=q1​ (in P1​)
 δ(q3,a)=q4 (in P1​)
q2
 No split needed.
 On b:
 δ(q0,b)=q2​ (in P1​)
 δ(q3,b)=q1 (in P1​) 59

 No split needed.
a

Language Description & Lexical Analysis a


b

Minimization of a DFA: Hopcroft's Algorithm q0 q1 q3

b b a
a a

 Step-by-Step Minimization b
P0
2. Refinement: q2 q4

 Iteration 1:
... q0 q3 b

 Checking P1={q1,q2,q4} :
 On a:
 δ(q1,a)=q0 (in P0​)
 δ(q2,a)=q4​ (in P1​) P1
 δ(q4,a)=q3 (in P0​) P1
q1 q4
 Split P1​ into {q2} and {q1,q4}.
 Updated partitions:
 P0={q0,q3}
 P1={q1,q4}
 P2={q2} P2
q2
60
a

Language Description & Lexical Analysis a


b

Minimization of a DFA: Hopcroft's Algorithm q0 q1 q3

 Step-by-Step Minimization P0 b
a
b
a
a

2. Refinement: b

 Iteration 2: q2 q4
... q0 q3
 Checking P1={q1,q4} :
b
 On a:
 δ(q1,a)=q0 (in P0​)
 δ(q4,a)=q3 (in P0​)
 No split needed. P1
 On b:
q1 q4
 δ(q1,b)=q3 (in P0​)
 δ(q4,b)=q2​ (in P2​)
 Split P1​ No split needed.
 Updated partitions:
 P0={q0,q3} P2
q2
 P1={q1, q4}
 P2={q2}
a

Language Description & Lexical Analysis a


b

Minimization of a DFA: Hopcroft's Algorithm q0 q1 q3

Step-by-Step Minimization
b b a
 a a

3. Construct the Minimized DFA: b

 New states: Each partition becomes a new state. q2 q4


 P0={q0,q3} becomes q0′
 P1={q1, q4} becomes q1′​
b
 P2={q2} becomes q2′​
 Transition function:
 δ′(q0′,a)=q1′​ (since δ(q0,a)=q1​ and δ(q3,a)=q4)
 δ′(q0′,b)=q2′ (since δ(q0,b)=q2 and δ(q3,b)=q1)
 δ′(q1′,a)=q0′​ (since δ(q1,a)=q0 and δ(q4,a)=q3)
 δ′(q1′,b)=q0′ (since δ(q1,b)=q3​ and δ(q4,b)=q2)
 δ′(q2′,a)=q3′​ (since δ(q2,a)=q4​)
 δ′(q2′,b)=q0′​ (since δ(q2,b)=q0​)
 Initial state: q0′
 Accepting states: {q0′} (since q0′​ contains q0​ and q3​)
62
Language Description & Lexical Analysis
Minimization of a DFA: Hopcroft's Algorithm
b
 Minimized DFA
 States:
 Q′={q0′,q1′,q2′} a
 Alphabet:
 Σ={a,b}
 Transition Function: q0’ q1’
 δ′(q0′,a)=q1′​
 δ′(q0′,b)=q2′​
 δ′(q1′,a)=q0′​
a
 δ′(q1′,b)=q0′ b
 δ′(q2′,a)=q3′​ b
q2’
 δ′(q2′,b)=q0′​
 Initial State:
q0′​

 Accepting States:
 F′={q0′} 63
Language Description & Lexical Analysis
Minimization of a DFA: Hopcroft's Algorithm
 Exercise: Minimize a DFA Using Hopcroft's Algorithm
 Given the following DFA, use Hopcroft's algorithm to minimize it.
 DFA Definition
 States: Q={q0,q1,q2,q3,q4,q5}
 Alphabet: Σ={a,b}
 Transition function δ:
 δ(q0,a)=q1​  δ(q3,a)=q0​
 δ(q0,b)=q2​  δ(q3,b)=q3​
 δ(q1,a)=q0​  δ(q4,a)=q4​
 δ(q1,b)=q3​  δ(q4,b)=q5​
 δ(q2,a)=q4​  δ(q5,a)=q4​
 δ(q2,b)=q5​  δ(q5,b)=q5
 Initial state: q0​ 64

 Accepting states: F={q0,q3}


Language Description & Lexical Analysis
Equivalence of finite automata to regular expressions

Theorem
A language is regular if and only if it is generated by a finite automaton.

65
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression

 The construction of a Nondeterministic Finite Automaton (NFA) from a regular


expression (RE) can be systematically achieved using Thompson's Construction
algorithm.
 This method allows us to build an NFA that recognizes the same language as the
given regular expression.

66
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression

 Thompson's Construction Algorithm


1. Basic Operations:
 Single Symbol: For a single symbol a, the NFA is constructed as:

 Epsilon (ε): For the epsilon symbol, the NFA is:

 Empty String: The NFA accepts only the empty string.

67
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression

 Thompson's Construction Algorithm


2. Concatenation (AB):
 If A and B are two NFAs constructed for regular expressions R and S respectively, the NFA for
RS is constructed by connecting the final state of A to the initial state of B using epsilon
transitions.

68
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression

 Thompson's Construction Algorithm


3. Union (A | B):
 If A and B are two NFAs for regular expressions R and S respectively, the NFA for R ∣ S is
constructed by introducing a new initial state with epsilon transitions to the initial states of both
A and B, and new final state with epsilon transitions from the final states of both A and B.

69
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression

 Thompson's Construction Algorithm


3. Kleene Star (A)* :
 If A is an NFA for regular expression R, the NFA for R* is constructed by adding epsilon
transitions from the final state of A back to its initial state, and from a new initial state to the
initial state of A and from the final state of A to a new final state.

70
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression

 Example
 Let's construct an NFA from the regular expression : (a|b)*abb
1. Construct NFA for a and b:
 For a:
a
q0 q1

 For b:

b
q2 q3

71
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression

 Example
 Let's construct an NFA from the regular expression : (a|b)*abb
2. Construct NFA for a|b:
 Using union:
 Combined NFA for a|b:

a
ε q0 q1 ε

q4 q5

ε b ε
q2 q3

72
73

Language Description & Lexical Analysis


Construction of an NFA from a Regular Expression

 Example
 Let's construct an NFA from the regular expression : (a|b)*abb
3. Construct NFA for (a|b)*: ε
 Using Kleene star:
 Combined NFA for (a|b)*:
a
ε q0 q1 ε
ε
q6 q4 q5 q7
ε
ε b ε
q2 q3

ε
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression

 Example
 Let's construct an NFA from the regular expression : (a|b)*abb
4. Construct NFA for abb:
 Concatenation: ε

a
ε q0 q1 ε
ε a b b
q6 q4 q5 q7 q8 q9 q10 q11
ε
ε b ε
q2 q3

ε
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression

 Example
 Let's construct an NFA from the regular expression : (a|b)*abb
4. Combine (a|b)* with abb using concatenation:
 Concatenation: ε

a
ε q0 q1 ε
ε ε a b b
q6 q4 q5 q7 q8 q9 q10 q11
ε
ε b ε
q2 q3

ε
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression

 Example
 Let's construct an NFA from the regular expression : (a|b)*abb
 Full NFA for (a|b)*abb:
1. States:
 Q={q0,q1,q2,q3,q4,q5,q6,q7,q8,q9,q10,q11}

2. Alphabet:
 Σ={a,b}

3. Transitions:
 δ(q6, ε) = {q4, q7}  δ(q2, b) = {q3}  δ(q8, a) = {q9}
 δ(q4, ε) = {q0, q2}  δ(q3, ε) = {q5}  δ(q9, b) = {q10}
 δ(q0, a) = {q1}  δ(q5, ε) = {q4, q7}  δ(q10, b) = {q11}
 δ(q1, ε) = {q5}  δ(q7, ε) = {q8}
4. Initial State:
 q6

5. Accepting State: 76

 F={q11}
Language Description & Lexical Analysis
Construction of an NFA from a Regular Expression

 Exercice:

 Construct an NFA for the regular expression ((ab)*|a*) over the alphabet Σ = {a, b}

77
Language Description & Lexical Analysis
Flex tool

 Flex (Fast Lexical Analyzer) is a tool for generating scanners, also known as lexical analyzers.
 A scanner reads input text and breaks it into tokens, which are the basic building blocks for
syntax analysis in the compilation process.
 Flex is widely used in conjunction with Bison, a parser generator, to create a complete
compiler or interpreter.
 Overview of Flex
 Input: A specification file containing patterns and corresponding actions.
 Output: A C source file (typically lex.yy.c) containing the code for the lexical analyzer.
 Usage: The generated lexical analyzer reads input text, matches it against the specified
patterns, and performs the associated actions.

78
Language Description & Lexical Analysis
Flex tool

79
Language Description & Lexical Analysis
Flex tool

 Structure of a Flex Specification File


 A Flex specification file consists of three sections separated by %%:

< DEFINITIONS >


%%
< RULES>
%%
< USER SUB-PROGRAMS >

 Only the first "%%" is required

80
Language Description & Lexical Analysis
Flex tool

 Structure of a Flex Specification File


 Definitions Section: Contains declarations and definitions, such as C code,
#include directives, and macro definitions.
 Rules Section: Contains patterns and associated actions. Each pattern is a
regular expression, and each action is a block of C code to execute when the
pattern is matched.
 User Code Section: Contains additional C code, such as helper functions, that
are used by the actions.

81
Language Description & Lexical Analysis
Flex tool

Defining Regular Expressions in Flex


 Regular expressions in Flex are used to describe patterns in the input text. These
patterns are matched against the input to identify tokens.
 Flex uses a syntax similar to traditional regular expressions with some additional
features tailored for lexical analysis.
Basic Elements of Regular Expressions in Flex
1. Literals:
 Matches the exact character(s).
 Example: a matches the character 'a’.
2. Character Classes:
 Matches any one of the characters in the set.
 Example: [abc] matches 'a', 'b', or 'c'.
82
Language Description & Lexical Analysis
Flex tool

Basic Elements of Regular Expressions in Flex


3. Ranges:
 Matches any character within the specified range.
 Example: [a-z] matches any lowercase letter.
4. Negated Character Classes:
 Matches any character not in the set.
 Example: [^a-z] matches any character except lowercase letters.
5. Predefined Character Classes:
 . matches any character except newline.
 \d matches any digit (equivalent to [0-9]).
 \w matches any word character (equivalent to [a-zA-Z0-9_]).
 \s matches any whitespace character (space, tab, newline, etc.).
83
Language Description & Lexical Analysis
Flex tool

Basic Elements of Regular Expressions in Flex


6. Quantifiers:
 * matches zero or more occurrences.
 + matches one or more occurrences.
 ? matches zero or one occurrence.
 {n} matches exactly n occurrences.
 {n,} matches n or more occurrences.
 {n,m} matches between n and m occurrences.
7. Anchors:
 ^ matches the beginning of a line.
 $ matches the end of a line.
8. Grouping and Alternation:
84
 (abc) groups the characters 'abc' together.
 a|b matches either 'a' or 'b'.
Language Description & Lexical Analysis
Flex tool

Basic Elements of Regular Expressions in Flex


Flex-Specific Extensions
1. Named Patterns:
 Define reusable patterns using the name syntax.
 Example: DIGIT [0-9] defines a named pattern DIGIT.
2. Action Blocks:
 { ... } is used to enclose C code that is executed when the pattern is matched.
 Example: DIGIT { printf("Digit: %s\n", yytext); }
3. State Transitions:
 <STATE> matches a pattern only in the specified start condition state.
 Example: <INITIAL>hello matches "hello" only in the INITIAL state.

85
Language Description & Lexical Analysis
Flex tool

Predefined variables and functions


Flex, the fast lexical analyzer generator, provides several predefined variables and functions that help in writing
lexical analyzers.
 Predefined Variables
1. yytext:
 Type: char *

 Description: A pointer to the matched text. It points to the start of the current token in the input stream.

 Usage: Access the matched text in actions, e.g., printf("Matched text: %s\n", yytext);

2. yyleng:
 Type: int

 Description: The length of the matched text, excluding the null terminator.

 Usage: Get the length of the current token, e.g., printf("Length of matched text:

%d\n", yyleng);

86
Language Description & Lexical Analysis
Flex tool
Predefined variables and functions
 Predefined Variables
3. yylval:
 Type: Depends on the user definition

 Description: Used to return values from lexical analyzer to the parser. Typically used in conjunction

with Yacc/Bison.
 Usage: Set semantic values for tokens, e.g., yylval = atoi(yytext);

4. yyin:
 Type: FILE *

 Description: The file pointer from which Flex reads input. By default, it is stdin.

 Usage: Change the input source, e.g., yyin = fopen("input.txt", "r");

5. yyout:
 Type: FILE *

 Description: The file pointer to which Flex writes its output. By default, it is stdout.

 Usage: Change the output destination, e.g., yyout = fopen("output.txt", "w");


Language Description & Lexical Analysis
Flex tool
Predefined variables and functions
 Predefined Functions
1. yylex():
 Type: int

 Description: The main lexical analyzer function. It matches the next token in the input and executes

the corresponding action.


 Usage: Called in the main function or by a parser to get the next token, e.g., int token =

yylex();
2. yywrap():
 Type: int

 Description: Called when the end of the input file is reached. By default, it returns 1, indicating no

more input. Returning 0 allows further input.


 Usage: Override to handle end-of-file conditions, e.g.,

int yywrap() {
return 1; // Indicate end of input
}
Language Description & Lexical Analysis
Flex tool
Predefined variables and functions
 Predefined Functions
3. yyrestart(FILE *new_file):
 Type: void

 Description: Switches the input stream to new_file.

 Usage: Change input source at runtime, e.g., yyrestart(fopen("new_input.txt", "r"));

4. yy_scan_string(const char *str):


 Type: YY_BUFFER_STATE

 Description: Scans a string as if it were input from a file.

 Usage: Scan an in-memory string, e.g., yy_scan_string("some text");

5. yy_delete_buffer(YY_BUFFER_STATE buffer):
 Type: void

 Description: Deletes a buffer created by yy_scan_string or similar functions.

 Usage: Clean up buffer, e.g.,

YY_BUFFER_STATE buffer = yy_scan_string("some text");


yy_delete_buffer(buffer);
Language Description & Lexical Analysis
Flex tool
Example:
%{
#include <stdio.h>
%}
%%
[0-9]+ { printf("Integer: %s\n", yytext); }
[a-zA-Z_][a-zA-Z0-9_]* { printf("Identifier: %s\n", yytext); }
" " { /* Ignore whitespace */ }
\n { /* Ignore newlines */ }
. { printf("Unknown character: %s\n", yytext); }
%%
int main(void) {
printf("Enter input:\n");
yylex(); // Start the lexical analysis
return 0;
}
int yywrap() {
return 1; // Indicate end of input
}
Language Description & Lexical Analysis
Flex tool %{
#include <stdio.h>
%}
NUMBER_TOKEN [0-9]*
ID_TOKEN [a-zA-Z_][a-zA-Z0-9_]*
PLUS_TOKEN "+"
Example of a Flex Specification File MINUS_TOKEN "-"
TIMES_TOKEN "*"
 Here is an example of a Flex DIV_TOKEN "/"
specification file for a simple lexical MOD_TOKEN "%"
SPACE_TOKEN [ \t\n]+
analyzer that recognizes arithmetic %%
expressions with numbers, identifiers, {NUMBER_TOKEN} {printf("NUMBER_TOKEN\n");}
and operators. {ID_TOKEN} {printf("ID_TOKEN\n");}
{PLUS_TOKEN} {printf("PLUS_TOKEN\n");}
{MINUS_TOKEN} {printf("MINUS_TOKEN\n");}
{TIMES_TOKEN} {printf("TIMES_TOKEN\n");}
{DIV_TOKEN} {printf("DIV_TOKEN\n");}
{MOD_TOKEN} {printf("MOD_TOKEN\n");}
{MOD_TOKEN} {printf("MOD_TOKEN\n");}
. {printf("ERROR\n");}
%%
int main(int argc, char ** argv){
if(argc >= 1){
yyin = fopen(argv[1], "r");
yylex(); 91
}else{ printf("Insufficient number of arguments\n"); }
}
Language Description & Lexical Analysis
Flex tool

Example of a Flex Specification File


 Explanation of the Example
 Definitions Section:
%{
#include <stdio.h>
%}

 This section, enclosed in %{ and %}, contains C code that will be copied
verbatim to the top of the generated C file. Here, it includes the standard I/O
library for printing.

92
Language Description & Lexical Analysis
Flex tool
Example of a Flex Specification File
 Explanation of the Example
 Token Definitions
NUMBER_TOKEN [0-9]*
ID_TOKEN [a-zA-Z_][a-zA-Z0-9_]*
...

 These lines define named patterns (macros) for the various tokens using regular
expressions:
 NUMBER_TOKEN: Matches zero or more digits ([0-9]*).
 ID_TOKEN: Matches an identifier, which starts with a letter or underscore and is
followed by letters, digits, or underscores ([a-zA-Z_][a-zA-Z0-9_]*).
 PLUS_TOKEN: Matches the plus sign ("+").
 MINUS_TOKEN: Matches the minus sign ("-").
 TIMES_TOKEN: Matches the asterisk ("*") used for multiplication.
 DIV_TOKEN: Matches the forward slash ("/") used for division.
 MOD_TOKEN: Matches the percent sign ("%") used for modulo operation.
 SPACE_TOKEN: Matches one or more whitespace characters ([ \t\n]+).
Language Description & Lexical Analysis
Flex tool
Example of a Flex Specification File
 Explanation of the Example
 Rules Section
%%
{NUMBER_TOKEN} {printf("NUMBER_TOKEN\n");}
...
. {printf("ERROR\n");}
%%

 This section, enclosed between %% markers, contains the rules for matching the patterns defined
above and the actions to perform when a match is found:
 {NUMBER_TOKEN}: If the input matches NUMBER_TOKEN, it prints "NUMBER_TOKEN".
 {ID_TOKEN}: If the input matches ID_TOKEN, it prints "ID_TOKEN".
 {PLUS_TOKEN}: If the input matches PLUS_TOKEN, it prints "PLUS_TOKEN".
 {MINUS_TOKEN}: If the input matches MINUS_TOKEN, it prints "MINUS_TOKEN".
 {TIMES_TOKEN}: If the input matches TIMES_TOKEN, it prints "TIMES_TOKEN".
 {DIV_TOKEN}: If the input matches DIV_TOKEN, it prints "DIV_TOKEN".
 {MOD_TOKEN}: If the input matches MOD_TOKEN, it prints "MOD_TOKEN".
 {SPACE_TOKEN}: If the input matches SPACE_TOKEN, it prints “SPACE_TOKEN".
 .: If the input matches any other character not specified above, it prints "ERREUR".
Language Description & Lexical Analysis
Flex tool
Example of a Flex Specification File
 Explanation of the Example
 Main Function
int main(int argc, char ** argv){
if(argc >= 1){
yyin = fopen(argv[1], "r");
yylex();
}else{
printf("Insufficient number of arguments\n");
}
}

 The main function:


 Checks if there is at least one command-line argument (which should be the
name of the input file).
 If so, it opens the file and sets yyin to read from it.
 Calls yylex(), which starts the lexical analysis and tokenization process.
 If no input file is provided, it prints "Insufficient number of arguments".
Language Description & Lexical Analysis
Flex tool

Example of a Flex Specification File


 Generating and Compiling the Lexical Analyzer
1. Save the code in a file (e.g., lexer.l).
2. Generate the Lexical Analyzer:
 Run Flex on the specification file to generate the C source file for the lexical analyzer.
flex lexer.l
3. Compile the Lexical Analyzer:
 Use a C compiler to compile the generated C source file (lex.yy.c) along with any other
necessary files.
gcc -o lexer lex.yy.c –lfl
4. Run the lexer:
96
./lexer input_file
Language Description & Lexical Analysis
Flex tool

Example of a Flex Specification File


 Example Input and Output
 Suppose you have an input file input_file with the following content:
123 + abc - 456 * def / ghi % jkl

 Running the lexer on this file will produce the following output:
NUMBER_TOKEN
PLUS_TOKEN
ID_TOKEN
MINUS_TOKEN
NUMBER_TOKEN
TIMES_TOKEN
ID_TOKEN
DIV_TOKEN
ID_TOKEN
MOD_TOKEN 97
ID_TOKEN
Language Description & Lexical Analysis
Lexical analyzer in C Code
Example of Lexical Analyzer in C Code
 A Comprehensive example ‘lexer.c’ of a lexical analyzer in C that recognizes arithmetic
expressions with numbers, identifiers, and operators, including some basic error handling.
#include <stdio.h>
#include <ctype.h>
#include <string.h>

typedef enum {
TOKEN_IDENTIFIER,
TOKEN_NUMBER,
TOKEN_OPERATOR,
TOKEN_PAREN_OPEN,
TOKEN_PAREN_CLOSE,
TOKEN_UNKNOWN,
TOKEN_END
} TokenType;

typedef struct {
TokenType type; 98
char text[100];
} Token;
Language Description & Lexical Analysis
Lexical analyzer in C Code
Example of Lexical Analyzer in C Code

...
const char *tokenTypeToString(TokenType type) {
switch (type) {
case TOKEN_IDENTIFIER: return "Identifier";
case TOKEN_NUMBER: return "Number";
case TOKEN_OPERATOR: return "Operator";
case TOKEN_PAREN_OPEN: return "Paren Open";
case TOKEN_PAREN_CLOSE: return "Paren Close";
case TOKEN_UNKNOWN: return "Unknown";
case TOKEN_END: return "End";
default: return "Invalid";
}
}

void printToken(Token token) {


printf("Token: %-12s Text: %s\n", tokenTypeToString(token.type), token.text);
}
...
99
Language Description & Lexical Analysis
Lexical analyzer in C Code
Example of Lexical Analyzer in C Code

...
Token getNextToken(const char **input) {
while (isspace(**input)) (*input)++; // Skip whitespace

if (**input == '\0') {
return (Token){TOKEN_END, ""};
}
Token token;
if (isalpha(**input)) {
// Identifier
token.type = TOKEN_IDENTIFIER;
int length = 0;
while (isalnum(**input)) {
token.text[length++] = *(*input)++;
}
token.text[length] = '\0';
} else if (isdigit(**input)) {
...
100
Language Description & Lexical Analysis
Lexical analyzer in C Code
Example of Lexical Analyzer in C Code

...
// Number
token.type = TOKEN_NUMBER;
int length = 0;
while (isdigit(**input)) {
token.text[length++] = *(*input)++;
}
token.text[length] = '\0';
} else if (strchr("+-*/=", **input)) {
// Operator
token.type = TOKEN_OPERATOR;
token.text[0] = *(*input)++;
token.text[1] = '\0';
} else if (**input == '(') {
// Opening Parenthesis
token.type = TOKEN_PAREN_OPEN;
token.text[0] = *(*input)++;
token.text[1] = '\0';
101
} else if (**input == ')’) {
...
Language Description & Lexical Analysis
Lexical analyzer in C Code
Example of Lexical Analyzer in C Code
...
// Closing Parenthesis
token.type = TOKEN_PAREN_CLOSE;
token.text[0] = *(*input)++;
token.text[1] = '\0';
} else {
// Unknown token
token.type = TOKEN_UNKNOWN;
token.text[0] = *(*input)++;
token.text[1] = '\0';
}

return token;
}
...

102
Language Description & Lexical Analysis
Lexical analyzer in C Code
Example of Lexical Analyzer in C Code

...
int main() {
const char *input = "var1 = 42 + (var2 * 3) / num3";
printf("Input: %s\n", input);
Token token;
while ((token = getNextToken(&input)).type != TOKEN_END) {
printToken(token);
}
return 0;
}

103
Language Description & Lexical Analysis
Lexical analyzer in C Code
Example of Lexical Analyzer in C Code
 Explanation of the Code
 TokenType Enumeration: Defines the different types of tokens: TOKEN_IDENTIFIER,
TOKEN_NUMBER, TOKEN_OPERATOR, TOKEN_PAREN_OPEN, TOKEN_PAREN_CLOSE,
TOKEN_UNKNOWN, and TOKEN_END.
 Token Structure: Represents a token with a type and a text value.
 tokenTypeToString Function: Converts a TokenType to a string for printing purposes.
 printToken Function: Prints a token's type and text value.
 getNextToken Function:
 Takes a pointer to the input string and returns the next token.
 Skips whitespace.
 Recognizes identifiers (alphanumeric strings starting with a letter).
 Recognizes numbers (strings of digits).
 Recognizes operators (+, -, *, /, =).
 Recognizes parentheses (( and )).
 Recognizes unknown tokens (any other single character).
 main Function:
 Defines the input string to be tokenized.
104
 Calls getNextToken in a loop to tokenize the entire input string and print each token.
Language Description & Lexical Analysis
Lexical analyzer in C Code
Example of Lexical Analyzer in C Code
 Compilation and Execution
 To compile and run the lexical analyzer, follow these steps:
1. Save the code to a file named lexer.c.
2. Compile the code using a C compiler, e.g., gcc:
gcc -o lexer lexer.c
3. Run the executable:
./lexer

105
Language Description & Lexical Analysis
Lexical analyzer in C Code
Example of Lexical Analyzer in C Code
 Example Output
 When you run the program with the input string "var1 = 42 + (var2 * 3) / num3", the output will be:

Input: var1 = 42 + (var2 * 3) / num3


Token: Identifier Text: var1
Token: Operator Text: =
Token: Number Text: 42
Token: Operator Text: +
Token: Paren Open Text: (
Token: Identifier Text: var2
Token: Operator Text: *
Token: Number Text: 3
Token: Paren Close Text: )
Token: Operator Text: /
Token: Identifier Text: num3 106
Language Description &
Lexical Analysis

The End

You might also like