0% found this document useful (0 votes)
13 views62 pages

67163118e98feCCWeek 03lecture05

Cc lecture 3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views62 pages

67163118e98feCCWeek 03lecture05

Cc lecture 3
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

Compiler Construction

(CSC-320)
Lecture # 05
Course Instructor: M. Ramzan Shahid Khan

Department of Computer Science,


Namal University Mianwali
Fall Semester, 2024
Topics
• Lexical Analysis Phase (Also Known As Scanner)

• Flex (Notations),

• Why NFA To DFA?

2
Lexical Analysis
• Input: Pre-Processed Code (Without pre-
processor directives) – Output of Pre-processor
• Pure High-Level Language

• Output: Valid Tokens

3
Lexical Analysis
• Also called Scanner

• It reads stream of characters from the source code left to right and
produces stream of valid tokens.

• If it encounter invalid token in the source code, generates an error


message indicating the line which contains that invalid token(s).

4
Valid or Invalid Tokens
• Whenever a Language is constructed, then set of rules are written
down for each type of token (present in the language), the tokens
may be:
• Constants
• Identifiers
• Punctuations
• Operators
• Keywords
• Those rules are known as Patterns

• Differentiate between Valid and Invalid Tokens with the help of DFA
5
Valid or Invalid Tokens
Constants, Identifiers, Punctuations, Operators, keywords

Patterns

Regular Expressions

DFA

6
Valid or Invalid Tokens
• The Patterns are used to develop Regular Expressions

• The DFAs are built on the basis of Regular Expressions.

7
Valid or Invalid Tokens
• DFA is a type of machine which differentiate between Valid and
Invalid Tokens.

• Tokens which are accepted by the DFA are Valid Tokens

• Tokens which are rejected by the DFA are Invalid Tokens

8
Valid or Invalid Tokens – Example
Identifiers
• Set of Rules
• Can start with _ (underscore)
Pattern
• Can start with an Alphabet
• Can’t start with any digit or other special symbol

9
Valid or Invalid Tokens – Example
• Set of Rules
• Can start with _ (underscore)
• Can start with an Alphabet
• Can’t start with any digit or other special symbol
• R.E (Regular Expression)
𝑖𝑑 → 𝑙𝑒𝑡𝑡𝑒𝑟 𝑙𝑒𝑡𝑡𝑒𝑟 𝑑𝑖𝑔𝑖𝑡 ∗
𝑙𝑒𝑡𝑡𝑒𝑟 → (𝑎 𝑏 𝑐 … 𝑧 𝐴 𝐵 𝐶 … |𝑍|_)
𝑑𝑖𝑔𝑖𝑡 → (0 1 2 … |9)

10
Valid or Invalid Tokens – Example
• Set of Rules
• Can start with _ (underscore)
• Can start with an Alphabet
• Can’t start with any digit or other special symbol
• R.E (Regular Expression)
𝑖𝑑 → 𝑙𝑒𝑡𝑡𝑒𝑟 𝑙𝑒𝑡𝑡𝑒𝑟 𝑑𝑖𝑔𝑖𝑡 ∗
𝑙𝑒𝑡𝑡𝑒𝑟 → (𝑎 𝑏 𝑐 … 𝑧 𝐴 𝐵 𝐶 … |𝑍|_)
𝑑𝑖𝑔𝑖𝑡 → (0 1 2 … |9)
• DFA (Deterministic Finite Automata)
Letter/digit

- letter
+ 11
Valid or Invalid Tokens – Example
• Set of Rules
• Can start with _ (underscore)
• Can start with an Alphabet
• Can’t start with any digit or other special symbol
• R.E (Regular Expression)
𝑖𝑑 → 𝑙𝑒𝑡𝑡𝑒𝑟 𝑙𝑒𝑡𝑡𝑒𝑟 𝑑𝑖𝑔𝑖𝑡 ∗
𝑙𝑒𝑡𝑡𝑒𝑟 → (𝑎 𝑏 𝑐 … 𝑧 𝐴 𝐵 𝐶 … |𝑍|_)
𝑑𝑖𝑔𝑖𝑡 → (0 1 2 … |9)
• DFA (Deterministic Finite Automata)
Letter/digit

- letter
+
• E.g., Input: x and _ab lead to Accepting State, Input 2x is rejected by DFA

12
Lexical Analysis
• Ignore or skip whitespace characters
• blanks,
• spaces,
• tab,
• new line

• Also ignore comments

13
Lexical Analysis

input output
sequence of Scanner stream of
characters tokens

Error Message

14
Lexical Analysis
• The error generated by the Lexical part of a compiler
is called Lexical Error.

15
Lexical Analysis – Example 1 output:
while
input: Scanner (
while (i<5) i
(Reading <
{ 5
from
i = i + 1; )
Left to {
} Right) i
=
+
1
;
}

16
Lexical Analysis – Example 1
Tokens
while keyword
( operator
i identifier
+ add operator
1 constant
Input: Stream of Characters
Output: Valid Tokens

17
Lexical Analysis – Example 2
output:
input: Scanner Error Message
while (2i<5)
{ (Reading
from
i = 2i + 1;
Left to
} Right) As
2i is Invalid
Token

18
Lexical Analysis – Example 3
output:
while
input: Scanner
(
while (i<5) i
{ (Reading <
from 5
i=i+1
Left to )
} Right) {
i
=
+
No Error 1
As Missing Semi-Colon is }
detected in Syntax Analysis 19
Lexical Analysis – Example 3
output:
while
input: Scanner
(
while (i<5) i
{ (Reading <
from 5
i=i+1
Left to )
} Right) {
i
=
+
Lexical Analyzer only 1
identifies valid and invalid }
tokens 20
FLEX
• Also called Fast Lexical Analyzer Generator

• Type of tool which helps us in constructing a Scanner. Generates a


Scanner

• It takes R.E as input then converts R.E to NFA, then NFA is converted
to DFA, then Scanner is produced as same DFA is used to differentiate
between Valid and Invalid Tokens

21
FLEX
• It is a tool used for generating scanners.

• We don’t have to write things from scratch

• You only need to do following 2 things:


• Identify vocabulary of certain language
• Write Regular Expression
• It will generate Scanner for you

22
FLEX
• Let’s discuss Notations which can be used to generate Regular
Expressions

23
Flex Regular Expression Symbols
″” Anything in quotes is matched exactly.
[] Characters in brackets match any
expression containing any of the
characters in brackets. For example [abc]
matches one a, one b, or one c.
[^] If there is a ^ character after the first
bracket, it matches any character except
those in brackets. For example, [^abc]
matches any character except a, b, or c.
. Matches any character except the
newline.
24
Flex Regular Expression Symbols

\n Matches a newline.
^ Matches the beginning
of a line.
$ Matches the end of a
line.

25
Flex Regular Expression Symbols
* Matches zero or more copies of the
preceding expression.
+ Matches one or more copies of the
preceding expression.
? Matches zero or one copy of the
preceding expression.
| Matches either the preceding expression
or the following one.

26
Flex Regular Expression Symbols
() Parenthesis are used for grouping
operators. For example a(bc|de)
matches abc or ade.
\* A iteral * character.
\” A literal ” character.
\^ A literal ^ character.

27
Regular Expression - Examples
• [a-zA-Z] matches any letter character.

• [a-zA-Z]+ matches any word.

• “hello” matches only the word hello.

• ^.*$ matches one entire line.

28
Regular Expression - Examples
• [a-z]
• we can specify a range of lowercase letters from a to z

• This will match exactly one lowercase character.

29
Regular Expression - Examples
• [A-Za-z0-9]
• The above expression specifies the range containing
• one single uppercase character,
• one lowercase character and
• a digit from 0 to 9.

• The brackets ([]) in the above expressions have a special meaning i.e. they are
used to specify the range.

• If you want to include a bracket as part of an expression, then you will need to
escape it.

30
Regular Expression - Examples
• [\[0-9]
• The above expression indicates
• an opening bracket OR
• a digit in the range 0 to 9 as a regex.

• But note that as we are programming in C++, we need to use the C++ specific
escape sequence as follows:
• [\\[0-9]

31
Regular Expression - Examples
• [a-z]+
• matches the strings like a, aaa, abcd, softwaretesting, etc.
• Note that it will never match a blank string.

• [a-z]*
• will match a blank string or
• any of the above strings.

32
Regular Expression - Examples
• (Xyz)+
• If you want to specify a group of characters to match one or more times, then
you can use the parentheses as above.

• The above expression will match Xyz, XyzXyz, and XyzXyzXyz, etc.

• Examples Implemented using:


• https://regexr.com/

33
Regular Expression – Example

• ^[a-zA-Z_][a-zA-Z_0-9]*\.[a-zA-Z0-9]+$

34
Regular Expression – Example 1
• C++ regex Example

• Consider a regular expression that matches an MS-DOS filename as


shown below.

35
Regular Expression – Example 1
• char regex_filename[] = “[a-zA-Z_] [a-zA-Z_0-9]*\\.[a-zA-Z0-9]+”;

• The above regex can be interpreted as follows:

1. Match a letter (lowercase and then uppercase) or an underscore.


2. Then match zero or more characters, in which each may be a letter, or an
underscore or a digit.
3. Then match a literal dot (.).
4. After the dot, match one or more characters, in which each may be a letter or
digit indicating file extension.

36
Regular Expression – Example 2
• Define a C++ language int literal using regular expression.

• ^(0|[1-9][0-9]*)$

37
Regular Expression – Example 3
• Define a C++ language float literal using regular expression:

38
Regular Expression – Example 3
• Define a C++ language float literal using regular expression:

• A float literal in C language has an optional exponent part.

• If a float literal is written without exponent part, then it must have a decimal
point which can appear at the start, at the end or in the middle of digits, as in
following examples:

• 123.456
• .456
• 456.

39
Regular Expression – Example 3
• [+-]?([0-9]*[.])?[0-9]+

• This will match:

• 123
• 123.456
• .456

40
Regular Expression – Example 3
• If you also want to match 123. .
• (a period with no decimal part), then you'll need a slightly longer expression:

• [+-]?([0-9]+([.][0-9]*)?|[.][0-9]+)

41
Regular Expression – Example 3
• If float literal is written with exponent, then decimal point in mantissa part
is optional, and exponent is a whole number with optional sign, as in
following examples:

• 123e78
• 123e+78
• 123e-78
• 123.456e78
• .456e78
• 456.e78

42
Regular Expression – Example 3
• [+-]?(\d+([.]\d*)?([eE][+-]?\d+)?|[.]\d+([eE][+-]?\d+)?)

43
Regular Expression – Example 4
• Define a C++ language string literal using regular expression.

• A string literal in C++ uses escape sequence.

44
Regular Expression – Example 4
• If a string literal is written like this:

cout << "This\nis\na\ntest\n\nShe said, \"Sells she seashells on the


seashore?\"\n";

• It will give you the results as:

This
is
a
test

She said, "Sells she seashells on the seashore?"


45
Regular Expression – Example 4
• ^([^"\\]|\\.)*$

46
Extended Regular Exp Notations
• To describe Regular Languages, we write down the Regular
Expressions

• Strings generated from the Regular Expressions are the Valid Strings
of that Language

47
Extended Regular Exp Notations
1. a?
• If there is question mark after any pattern, alphabet or character, it means
• Zero or one
• Makes that pattern of character optional, which can be replaced by 0 or one
existence
2. [A-Z]
• Shows the Range
• Ranges from A to Z (A or B or C up to Z)
3. R+
• One or More
4. [X|Y|Z]
• Either ‘X’ or ‘Y’ or ‘Z’
48
Extended Regular Exp Notations
5. [^ ab]
• Caret Sign (Circumflex) ^
• Everything but not ‘a’ & ‘b’
• Excluded part after Caret Sign
• Using a character class such as [^ab] will match a single character that is not
within the set of characters. (With the ^ being the negating part).

• To match a string which does not contain the multi-character sequence ab,
you want to use a negative lookahead:
• ^(?:(?!ab).)+$

49
Extended Regular Exp Notations
6. R*
• Zero or More

50
Extended Regular Exp Notations
1 a? Zero or one a’s
2 [A-Z] Ranges from A to Z (A or B or C up to Z)
3 R+ = RR* One or More
4 [X|Y|Z] Either ‘X’ or ‘Y’ or ‘Z’
5 [^ab] Everything but not ‘a’ and ‘b’
6 R* Zero or More

51
Extended Regular Exp Notations – Example 1
• Any variable name must start with an alphabet following any no. of
alphabets or digits.

• Regular Expression
letter(letter|digit)*

• letter and digit are the non-terminals

• Further explanation is required, by which letter and digit can be replaced

52
Extended Regular Exp Notations – Example 1
• Regular Expression

𝑙𝑒𝑡𝑡𝑒𝑟 𝑙𝑒𝑡𝑡𝑒𝑟|𝑑𝑖𝑔𝑖𝑡 ∗
𝑙𝑒𝑡𝑡𝑒𝑟 → 𝐴 − 𝑍 𝑎 − 𝑧
𝑑𝑖𝑔𝑖𝑡 → [0 − 9]

53
Extended Regular Exp Notations – Example 2
• Each String has exactly two number of a’s (No restriction on b’s)

• 𝐿 = {𝑎𝑎, 𝑏𝑏𝑎𝑎, 𝑎𝑏𝑎, … }


• 𝑅. 𝐸 = (𝑏 ∗ 𝑎𝑏 ∗ 𝑎𝑏 ∗ )

54
Next…
• How Regular Expressions can be converted to NFA?

• Conversion of NFA to DFA

55
Regular Exp to NFA
1. a
• NFA constructed would contain two states i.e., initial state and the final state
• This machine or NFA is accepting only ‘a’, nothing else

q0
a qf
q0

56
Regular Exp to NFA
2. ab
• NFA constructed would contain three states

a b
q0 q1 qf
q0

57
Regular Exp to NFA
3. a|b = a+b = aUb
• Optional – Either ‘a’ or ‘b’ would be accepted at a time
• NFA would be constructed with the help of Null Transitions (ϵ or λ)

q1
a q1
ϵ
ϵ
q0
qf
q0

ϵ ϵ
q2 q2
b 58
Regular Exp to NFA
4. a*
• Any number of a’s can be generated from this NFA
• 𝑎∗ = {∈, 𝑎, 𝑎𝑎, 𝑎𝑎𝑎, … }

ϵ a ϵ qf
q0
q0 q1 q2

ϵ
59
Regular Exp to NFA
5. (𝑎 + 𝑏)∗
• Can be divided into two parts
• First construct NFA for (a+b) as constructed for 3rd example
• Then apply * on (a+b) same as in a*

60
Regular Exp to NFA
ϵ

a
q2 q4
ϵ ϵ
q0
ϵ q1 qf
ϵ qf
q0

ϵ q3 q5 ϵ
b

ϵ
61
Why NFA to DFA? Next…
• NFA to DFA Conversion, Because

• Scanner takes Regular Expression

• Converts Regular Expression to NFA

• Then Converts NFA to DFA

• After that Scanner is able to differentiate between Valid and Invalid Tokens
62

You might also like