2024 CD-Ch02 Lexical Analysis
2024 CD-Ch02 Lexical Analysis
Institute of Technology
Department of Computer Science
Course Title: Compiler Design (CoSc4103)
Chapter Two: Lexical Analysis and Lex
2
nd task: is removing any comments and white spaces from source code in the form of blank,
tab, and newline characters.
Another task: it generates an error messages, if it finds invalid token from the source program.
It identifies valid lexemes from the program and returns tokens to the syntax analyzer,
one after the other, corresponding to the getNextToken command from the syntax
analyzer
read char Token & token value
Source Lexical To semantic
Parser
program Analyzer analysis
put back char getNextToken
id
When you work on Lexical analysis, there are three important terms to know:
Lexemes, Pattern, and Tokens,
Tokens: are a set of strings defining an atomic element with a defined meaning
It is a pre-defined sequence of characters that cannot be broken down further
But, here is some questions which raised from the tasks of LA:
How does the lexical analyzer read the input string and break it into lexemes?
How can it understand the patterns and check if the lexemes are valid?
Attributes of Token
In a program, some times more than one lexeme matches pattern correspond to one token,
So, Lexical analyzer must provide additional information about the particular lexeme.
Because, the rest of the phases need additional information about the lexeme to perform
different operations.
Lexical analyzer collects information about tokens into their associated attributes and sends a
sequence of tokens with their information to the next phase.
i.e., the tokens are sent as a pair of <Token name, Attribute value> to the Syntax
analyzer
A lexeme is like an instance of a token, and the attribute column is used to show which lexeme
of the token is used.
For every lexeme, the 1st and 2nd columns of the above table are sent to the Syntax Analyzer.
2. Strings
Any finite sequence of alphabets (characters) is called a string.
A string over some alphabet is a finite sequence of symbols drawn from that alphabet.
11/28/202 WCU-CS Compiled by TM. 8
Tokens cont’d……
In language theory, the terms sentence and word are often used as synonyms for the term
"string."
Length of the string S is the total number of occurrence of alphabets, and it is denoted by |S|
A string having no alphabets, i.e. a string of zero length is known as an empty string and is
denoted by ε (epsilon).
3. Special symbols
A typical high-level language contains the following special symbols:-
Computer languages are considered as finite sets, and mathematically set operations can be
performed on them.
Finite languages can be described by means of regular expressions.
5. Regular Expressions
Regular expressions are an important notation to specify lexeme patterns for a token.
Each pattern matches a set of strings, so regular expressions serve as names for a set of
strings.
Regular expressions are used to represent the language for lexical analyzer
The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme
that belong to the language in hand.
It searches for the pattern defined by the language rules.
L U M = {s | s is in L or s is in M}
Concatenation of two languages L and M is written as;
LM = {st | s is in L and t is in M}
Kleene closure of a language L is written as;
Union : L U S={0,1,a,b,c}
Concatenation : L.S={0a,1a,0b,1b,0c,1c}
digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]
sign = [ + | - ]
However, the only problem left with the lexical analyzer is how to verify the validity of a
regular expression used in specifying the patterns of keywords of a language.
A well-accepted solution to this problem is use finite automata for verification.
To recognize and verify the tokens, the lexical analyzer builds Finite Automata for every pattern.
Transition diagrams can be built and converted into programs as an intermediate step.
The programs built from Automata can consist of switch statements to keep track of the state of the
lexeme. The lexeme is verified to be a valid token if it reaches the final state.
13
2.3. Lexical Error Recovery
Lexical errors:
are a type of error can be detected during the lexical analysis phase
is a sequence of characters that does not match the pattern of any token, which is not
possible to scan into any valid token
are thrown by the lexer when unable to continue. i.e., if there’s no way to recognize a
lexeme as a valid token.
Lexical errors are not very common, but it should be managed by a scanner
Some of common lexical errors in Lexical phase error can be
Spelling error of identifiers, operators, keyword, etc
11/28/202 14
Lexical Error cont’d……
Example: see this C code Void main() {
In this code, 1xab is neither a number nor
int x=10, y=20;
an identifier.
char * a;
So this code will show the lexical error.
a= &x;
x= 1xab;
}
Lexical Error recovery: There are some recovery mechanisms to remove lexical errors
See some of possible error-recovery actions with examples of “cout” are
i. deleting an unnecessary character eg. couttcout
ii. inserting a missing character eg cotcout
iii. replacing an incorrect character by a correct character eg coufcout
iv. transposing two adjacent characters. Eg ocutcout
However, few errors are out of power of lexical analyzer to recognize, because a lexical analyzer
has a very localized view of a source program. So, some other phase of compiler handle this error
For instance, if the string fi is encountered in a C++/C program for the first time in the context
of:
In this code, a lexical analyzer cannot tell whether fi is a misspelling
fi (a == b) … of the keyword if or an undeclared function identifier.
11/28/202 15
2.4. Automata: NFA to DFA Conversation
Finite automata is a state machine that takes a string of symbols as input and changes its state
accordingly.
Finite automata is a recognizer for regular expressions.
When a regular expression string is fed into finite automata, it changes its state for each literal.
If the input string is successfully processed and the automata reaches its final state, it is
accepted,
i.e., the string that fed was said to be valid token of the language in hand
A set of states S
A start state n
An accepting state
a
A transition
11/28/202 17
Automata: NFA to DFA cont’d……
A finite automaton accepts a string if we can follow transitions labeled with the characters in the
string from the start to some accepting state
Another Example: A finite automaton accepting any number of 1’s followed by a single 0
1
Alphabet: {0,1} 0
1
1
Epsilon Moves
Another kind of transition with: -moves
A B Here a machine can move from state A to state B without
reading input
11/28/202 18
Automata: NFA to DFA cont’d……
Types of Finite Automata
i. Non-Deterministic Automata (NFA).
ii. Deterministic Automata (DFA)
i. Nondeterministic Finite Automata (NFA)
Can have multiple transitions for one input in a given state
Reading assignment
Execution of Finite Automata ?????
Details of NFA vs. DFA ?????
Regular expression is converted into minimized DFA ?????
Regular Expressions to Finite Automata ????
NFA to DFA ????
Implementation of DFA ????
20