0% found this document useful (0 votes)
33 views25 pages

2024 CD-Ch02 Lexical Analysis

This document discusses Chapter Two of Compiler Design, focusing on Lexical Analysis and the role of the lexical analyzer in processing source code. It explains the concepts of tokens, lexemes, and patterns, as well as the process of recognizing and recovering from lexical errors. Additionally, it covers finite automata and the conversion from NFA to DFA, highlighting the importance of regular expressions in defining token patterns.

Uploaded by

munyemola
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views25 pages

2024 CD-Ch02 Lexical Analysis

This document discusses Chapter Two of Compiler Design, focusing on Lexical Analysis and the role of the lexical analyzer in processing source code. It explains the concepts of tokens, lexemes, and patterns, as well as the process of recognizing and recovering from lexical errors. Additionally, it covers finite automata and the conversion from NFA to DFA, highlighting the importance of regular expressions in defining token patterns.

Uploaded by

munyemola
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Wachemo University

Institute of Technology
Department of Computer Science
Course Title: Compiler Design (CoSc4103)
Chapter Two: Lexical Analysis and Lex

By: Tseganesh M.(MSc.)

Subscribe on Yadah Academy YouTube channel


Compiler Design (CoSc4103)
Chapter Two
Lexical Analysis and Lex
2.1. The role of the lexical analyzer
2.2. Token: Specification and Recognition of Tokens
Outline2.3. Lexical Error-Recovery
2.4. Finite Automata: NFA to DFA Conversation
2.5. A typical Lexical Analyzer Generator
By: Tseganesh M.(MSc.)
2.1. The role of the Lexical Analyzer
 Lexical analysis is the first phase of a compiler.
A lexical analyzer is also called a "Scanner".
 The input to a lexical analyzer is the pure high-level code from the preprocessor.

 Main functions of Lexical analyzer


 1
st task: read the given source code from left to right in character-wise and produce a
sequence of tokens that are uses for syntax analysis.
 i.e., the output of lexical analysis is a stream of tokens, which is input to the parser

 2
nd task: is removing any comments and white spaces from source code in the form of blank,
tab, and newline characters.
 Another task: it generates an error messages, if it finds invalid token from the source program.

 It identifies valid lexemes from the program and returns tokens to the syntax analyzer,
one after the other, corresponding to the getNextToken command from the syntax
analyzer
read char Token & token value
Source Lexical To semantic
Parser
program Analyzer analysis
put back char getNextToken
id

Read entire program into memory Symbol table


11/28/202 WCU-CS Compiled by TM. 2
Lexical Analyzer cont’d……
 The lexical analyzer works closely with the syntax analyzer.
 But, there are some Issues/reasons why to separating lexical analysis from parsing
 Simplicity of design

 Improving compiler efficiency

 Enhancing compiler portability (e.g. Linux to Win)

 When you work on Lexical analysis, there are three important terms to know:
 Lexemes, Pattern, and Tokens,

 Token, Pattern, Lexeme


 Lexeme: is a sequence of characters (alphanumeric) in the source program that matches the
pattern of a token.
 Pattern: is a set of rules for every lexeme that the scanner follow to identify a valid token.
 A pattern explains what can be a token, and

 These patterns can be defined by means of regular expressions

 Tokens: are a set of strings defining an atomic element with a defined meaning
 It is a pre-defined sequence of characters that cannot be broken down further

 A token can have a token name and an optional token/attribute value

11/28/202 WCU-CS Compiled by TM. 2


Lexical Analyzer cont’d……
 Some example of tokens, lexemes, and pattern
Token Lexeme Pattern
Keyword while w-h-i-l-e
Relop < <, >, >=, <=, !=, ==
Integer 7 (0 - 9)*-> Sequence of digits with at least one digit
String "Hi" Characters enclosed by " "
Punctuation , ; , . ! etc.
Identifier number A - Z, a - z A sequence of characters and numbers initiated by a
character.

 But, here is some questions which raised from the tasks of LA:
 How does the lexical analyzer read the input string and break it into lexemes?

 How can it understand the patterns and check if the lexemes are valid?

 What does the Lexical Analyzer send to the next phase?

11/28/202 WCU-CS Compiled by TM. 2


2.2. Token: Specification and Recognition of Tokens
 In programming language; keywords, constants, identifiers, strings, numbers, whitespace,
operators, and punctuations are considered as tokens.
 For example, in C or C++ language, the variable declaration line

 int value = 100;

 contains the tokens:

 int (keyword), value (identifier), = (operator), 100 (constant) and ; (symbol).

 Attributes of Token
 In a program, some times more than one lexeme matches pattern correspond to one token,

 So, Lexical analyzer must provide additional information about the particular lexeme.

 Because, the rest of the phases need additional information about the lexeme to perform
different operations.
 Lexical analyzer collects information about tokens into their associated attributes and sends a
sequence of tokens with their information to the next phase.
 i.e., the tokens are sent as a pair of <Token name, Attribute value> to the Syntax
analyzer

11/28/202 WCU-CS Compiled by TM. 6


Tokens cont’d……
 Example: see the tokens and associated attribute-values for the following FORTRAN statement
 E=M * C** 2 are written below as a sequence of pairs:

 <id, pointer to symbol table entry for E> Token Attribute


 <assign-op> ID Index to symbol table entry E
 <id, pointer to symbol table entry for M> =
ID Index to symbol table entry M
 <mult-op>
*
 <id, pointer to symbol table entry for C>
ID Index to symbol table entry C
 <exp-op>
**
 <number, integer value 2> NUM 2

 A lexeme is like an instance of a token, and the attribute column is used to show which lexeme
of the token is used.

 For every lexeme, the 1st and 2nd columns of the above table are sent to the Syntax Analyzer.

11/28/202 WCU-CS Compiled by TM. 7


Tokens cont’d……
 Specifications of Tokens
 To answer the question “how the lexical analyzer can check the validity of lexemes with
tokens”, it is critical to know the following specifications of tokens:
1) Alphabet
2) Strings
3) Special symbols
4) Language
5) Regular expression
6) etc……
 Let us understand how the language theory undertakes these terms:
1. Alphabets
 Any finite set of symbols

 {0,1} is a set of binary alphabets,

 {0,1,2,3,4,5,6,7,8,9,A,B,C,D,E,F} is a set of Hexadecimal alphabets,

 {a-z, A-Z} is a set of English language alphabets.

2. Strings
 Any finite sequence of alphabets (characters) is called a string.

 A string over some alphabet is a finite sequence of symbols drawn from that alphabet.
11/28/202 WCU-CS Compiled by TM. 8
Tokens cont’d……
 In language theory, the terms sentence and word are often used as synonyms for the term
"string."
 Length of the string S is the total number of occurrence of alphabets, and it is denoted by |S|

 e.g., the length of the string compiler is 8 and is denoted by |compiler| = 8

 A string having no alphabets, i.e. a string of zero length is known as an empty string and is
denoted by ε (epsilon).
3. Special symbols
 A typical high-level language contains the following special symbols:-

Arithmetic Symbols Addition(+), Subtraction(-), Modulo(%), Multiplication(*), Division(/)


Punctuation Comma(,), Semicolon(;), Dot(.), Arrow(->)
Assignment =
Special Assignment +=, /=, *=, -=
Comparison ==, !=, <, <=, >, >=
Preprocessor #
Location Specifier &
Logical &, &&, |, ||, !
Shift Operator >>, >>>, <<, <<<

11/28/202 WCU-CS Compiled by TM. 9


Tokens cont’d……
4. Language
 Language is considered as a finite set of strings over some finite set of fixed alphabets.

 Computer languages are considered as finite sets, and mathematically set operations can be
performed on them.
 Finite languages can be described by means of regular expressions.

5. Regular Expressions
 Regular expressions are an important notation to specify lexeme patterns for a token.

 Each pattern matches a set of strings, so regular expressions serve as names for a set of
strings.
 Regular expressions are used to represent the language for lexical analyzer
 The lexical analyzer needs to scan and identify only a finite set of valid string/token/lexeme
that belong to the language in hand.
 It searches for the pattern defined by the language rules.

 A grammar defined by regular expressions is known as regular grammar


 The language defined by regular grammar is known as regular language.
11/28/202 WCU-CS Compiled by TM. 10
Tokens cont’d……
Programming language tokens can be described by regular languages.
 There are a number of algebraic laws that are obeyed by regular expressions, also known as
operations on language
 Operations on languages

 There are several important operations that can be applied to languages.

 Union of two languages L and M is written as;

 L U M = {s | s is in L or s is in M}
 Concatenation of two languages L and M is written as;

 LM = {st | s is in L and t is in M}
 Kleene closure of a language L is written as;

 L* = Zero or more occurrence of language L


 Example: the following example shows the operations on strings:

 Let L={0,1} and S={a,b,c}

 Union : L U S={0,1,a,b,c}

 Concatenation : L.S={0a,1a,0b,1b,0c,1c}

 Kleene closure : L*={ ε,0,1,00….}

 Positive closure : L+={0,1,00….}


11/28/2024 WCU-CS Compiled by TM. 11
Tokens cont’d……
 In lexical analysis by using regular expression it is possible to represent:
i. valid tokens of a language,
ii. occurrences of symbols, and
iii. language tokens;

i. Representing valid tokens of a language in regular expression


 If x is a regular expression, then:

 x* means zero or more occurrence of x.

 i.e., it can generate { e, x, xx, xxx, xxxx, … }

 x+ means one or more occurrence of x.

 i.e., it can generate { x, xx, xxx, xxxx … } or x.x*

 x? means at most one occurrence of x

 i.e., it can generate either {x} or {e}.

 [a-z] is all lower-case alphabets of English language.

 [A-Z] is all upper-case alphabets of English language.

 [0-9] is all natural digits used in mathematics.

11/28/2024 WCU-CS Compiled by TM. 12


Tokens cont’d……
ii. Representation of occurrence of symbols using regular expressions
 letter = [a – z] or [A – Z]

 digit = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 or [0-9]

 sign = [ + | - ]

iii. Representation of language tokens using regular expressions


?
 Decimal = (sign) (digit)
+

 Identifier = (letter)(letter | digit)*

 However, the only problem left with the lexical analyzer is how to verify the validity of a
regular expression used in specifying the patterns of keywords of a language.
 A well-accepted solution to this problem is use finite automata for verification.

 To recognize and verify the tokens, the lexical analyzer builds Finite Automata for every pattern.

 Transition diagrams can be built and converted into programs as an intermediate step.

 Each state in the transition diagram represents a piece of code.

 Every identified lexeme walks through the Automata.

 The programs built from Automata can consist of switch statements to keep track of the state of the
lexeme. The lexeme is verified to be a valid token if it reaches the final state.
13
2.3. Lexical Error Recovery
 Lexical errors:
 are a type of error can be detected during the lexical analysis phase

 is a sequence of characters that does not match the pattern of any token, which is not
possible to scan into any valid token
 are thrown by the lexer when unable to continue. i.e., if there’s no way to recognize a
lexeme as a valid token.
 Lexical errors are not very common, but it should be managed by a scanner
 Some of common lexical errors in Lexical phase error can be
 Spelling error of identifiers, operators, keyword, etc

 Appearance of some illegal character

 Exceeding length of identifier or numeric constants.

 Remove the character that should be present.

 Replace a character with an incorrect character.

 Transposition of two characters.

11/28/202 14
Lexical Error cont’d……
 Example: see this C code Void main() {
 In this code, 1xab is neither a number nor
int x=10, y=20;
an identifier.
char * a;
 So this code will show the lexical error.
a= &x;
x= 1xab;
}
 Lexical Error recovery: There are some recovery mechanisms to remove lexical errors
 See some of possible error-recovery actions with examples of “cout” are
i. deleting an unnecessary character eg. couttcout
ii. inserting a missing character eg cotcout
iii. replacing an incorrect character by a correct character eg coufcout
iv. transposing two adjacent characters. Eg ocutcout
 However, few errors are out of power of lexical analyzer to recognize, because a lexical analyzer
has a very localized view of a source program. So, some other phase of compiler handle this error
 For instance, if the string fi is encountered in a C++/C program for the first time in the context
of:
 In this code, a lexical analyzer cannot tell whether fi is a misspelling
fi (a == b) … of the keyword if or an undeclared function identifier.
11/28/202 15
2.4. Automata: NFA to DFA Conversation
 Finite automata is a state machine that takes a string of symbols as input and changes its state
accordingly.
 Finite automata is a recognizer for regular expressions.
 When a regular expression string is fed into finite automata, it changes its state for each literal.
 If the input string is successfully processed and the automata reaches its final state, it is
accepted,
 i.e., the string that fed was said to be valid token of the language in hand

 Regular expressions =specify the specification


 Finite automata = implementation

 A finite automaton consists of


 An input alphabet 

 A set of states S

 A start state n

 A set of accepting states F  S

 A set of transitions state 


input state

11/28/202 WCU-CS Compiled by TM. 16


Automata: NFA to DFA cont’d……
 Transition: s1 a s2
 This can be read as; In state s1 on input “a” go to state s2
 If end of input
 If in accepting state => accept, othewise => reject

 If no transition possible => reject

 Finite Automata State Graphs can be build up using


 A state

 The start state

 An accepting state
a
 A transition

 Simple Example: A finite automaton that accepts only “1”


1

11/28/202 17
Automata: NFA to DFA cont’d……
 A finite automaton accepts a string if we can follow transitions labeled with the characters in the
string from the start to some accepting state
 Another Example: A finite automaton accepting any number of 1’s followed by a single 0
1
 Alphabet: {0,1} 0

 Check that “1110” is accepted with this finite automation


 Exercise: given Alphabet {0,1}, What language will be recognized by this automation machine?
1 0
0 0

1
1
 Epsilon Moves
 Another kind of transition with: -moves

A B  Here a machine can move from state A to state B without
reading input
11/28/202 18
Automata: NFA to DFA cont’d……
 Types of Finite Automata
i. Non-Deterministic Automata (NFA).
ii. Deterministic Automata (DFA)
i. Nondeterministic Finite Automata (NFA)
 Can have multiple transitions for one input in a given state

 Can have -moves

 NFA accepts if it can get in a final state

ii. Deterministic Finite Automata (DFA):


 A DFA is a special case of a NFA in which:-

 It has at most one transition per input from any state

 No -moves, means it has no transitions on input € ,

 DFA formally defined by 5 tuple notation; M = (Q, ∑, δ, qo, F), where


 Q is a finite “set of states”, which is non empty.
 ∑ is “input alphabets”, indicates input set.
 qo is an “initial state” and qo is in Q; ie, qo, ∑, Q F is a set of “Final states”,
 δ is a “transmission function” or mapping function, using this function the next state
can be determined.
11/28/202 WCU-CS Compiled by TM. 19
Automata: NFA to DFA cont’d……

Reading assignment
Execution of Finite Automata ?????
Details of NFA vs. DFA ?????
Regular expression is converted into minimized DFA ?????
Regular Expressions to Finite Automata ????
NFA to DFA ????
Implementation of DFA ????
20

You can refer more and more for detail elaboration


2.5. Lexical Analyzer Generator
 Creating a lexical analyzer with Lex:
 First, a lexical analyzer is prepared by creating a program lex.l in the Lex language.
 Then, lex.l is run through the Lex compiler to produce a C program lex.yy.c.
 Finally, lex.yy.c is run through the C compiler to produce an object program a.out,
 a.out is the lexical analyzer that transforms an input stream into a sequence of tokens.

11/28/202 WCU-CS Compiled by TM. 21


Lexical Analyzer cont’d……
■ Lex Specification: a Lex program consists of three parts:
%{ definitions }% %{
int vowels=0, int cons=0;
%% %}{
%%
{rules } [aeiouAEIOU] {vowels++;}
%% [a-zA-Z] {cons++;}
%%
{ user subroutines } where,
■ Definitions include declarations of variables, constants, and regular definitions
■ Rules are statements of the form p1{action1}p2{action2}… pn{actionn}
■ where pi is regular expression and
■ action describes what action the lexical analyzer should take when pattern pi matches a
lexeme.
■ Actions are written in C code.
■ User subroutines are auxiliary procedures needed by the actions.
■ These can be compiled separately and loaded with the lexical analyzer.

11/28/202 WCU-CS Compiled by TM. 22


Lexical Analyzer cont’d……
■ Consider the following lex program; that count vowels and consonants
%{
int vowels=0;  Steps to executing this 'Lex' program:

int cons=0;  First write the source code in lex editor


%} “EditPlusPortable” or any editor, then
 Tools->'Lex File Compiler'
%%  Tools->'Lex Build'
[aeiouAEIOU] {vowels++;}  Tools->'Open CMD'
[a-zA-Z] {cons++;}  Then in command prompt type
%% 'name_of_file.exe' example->‘lex2.exe‘ and
press enter
int yywrap() {  Then entering your whole input and press enter
return 1;  Final press Ctrl + Z and press Enter., then you
} see the output
main(){
printf(" Enter any string to count vowels and consonats at end press^d\n");
yylex();
printf("no: of vowels are: %d\n",vowels);
printf("no of constants: %d\n",cons);
return 0;
}
11/28/202 WCU-CS Compiled by TM. 23
Lexical Analyzer cont’d……
■ The output for the above program will look like

11/28/202 WCU-CS Compiled by TM. 24


Next class
Chapter 3: Syntax Analysis
3.1. Role of a parser
3.2. Parsing
Outline
3.3. Types of parsing
3.4. Parser Generator: Yacc

Subscribe Yadah Academy on YouTube


Click https://youtube.com/@yadahacademy-
educationalco8575

You might also like