Week 5-6
Week 5-6
1
OUTLINE OF THE TOPICS TO BE COVERED TODAY
2
CHOMSKY'S CLASSIFICATION OF GRAMMARS
3
LEXICAL ANALYSIS
For example:
Suppose a source program contains the assignment statement
position = initial + rate * 60 … (1)
Equation 2 shows the representation of the assignment equation (1) after lexical
analysis as the sequence of tokens
<id; 1> <=> <id; 2> <+> <id; 3> <*> <60> … (2)
Visual representation of the afore-mentioned equation will be seen next
5
LEXICAL ANALYSIS
Visual representation
All stages during compilation of the code
6
LEXICAL ANALYSIS
7
LEXICAL ANALYSIS
8
LEXICAL ANALYSIS
9
LEXICAL ANALYSIS
10
LEXICAL ANALYSIS
Examples of Tokens and Lexemes
11
LEXICAL ANALYSIS
Categorization of Tokens:
In many programming languages, the following classes cover most or all of the
tokens:
One token for each keyword. The pattern for a keyword is the same as the keyword itself
Tokens for the operators, either individually or in classes such as the token comparison
One token representing all identifiers
One or more tokens representing constants, such as numbers and literal strings
Tokens for each punctuation symbol, i.e., left and right parentheses, comma, and semicolon
12
LEXICAL ANALYSIS
13
LEXICAL ANALYSIS
Example 2 :
The token names and associated attribute values for the Fortran statement
E = M * C *2
Please solve it with respect to tokens…
14
LEXICAL ANALYSIS
15
LEXICAL ANALYSIS
Lexical Errors:
It is hard for a lexical analyser to tell without the aid of other components, that there is a source-code error.
For instance, if the string fi is encountered for the first time in a C program in the context:
fi ( a == f(x)) ...
a lexical analyser cannot tell whether fi is a misspelling of the keyword if
or an undeclared function identifier
Since fi is a valid lexeme for the token id, the lexical analyser must return the token id to the parser
and let some other phase of the compiler
probably the parser in this case (handle an error due to transposition of the letters)
16
LEXICAL ANALYSIS
17
LEXICAL ANALYSIS
18
REGULAR EXPRESSIONS AND DFA
Specifications of Tokens:
Regular expressions are an important notation for specifying lexeme patterns.
They cannot express all possible patterns, however, they are very effective in specifying those types of
patterns that we actually need for tokens.
We shall study the formal notation for regular expressions
Strings and languages
An alphabet is any finite set of symbols. Typical examples of symbols are letters, digits, and
punctuation.
The set {0, 1} is the binary alphabet. ASCII is an important example of an alphabet; it is used in many
software systems. Unicode is another example.
19
REGULAR EXPRESSIONS AND DFA’S
A string over an alphabet is a finite sequence of symbols drawn from that alphabet .
In language theory, the terms “sentence" and “word" are often used as synonyms for
“string."
Length of a string s, or |s|, is the number of occurrences of symbols in s.
For example, banana is a string of length six. The empty string, denoted , is the string of
length zero
20
REGULAR EXPRESSIONS AND DFA’S
A language is any countable set of strings over some fixed alphabet. This definition is
very broad.
Abstract languages like , the empty set {}, the set containing only the empty string,
are languages under this definition.
Note that the definition of “language" does not require that any meaning be ascribed
to the strings in the language
21
REGULAR EXPRESSIONS AND DFA’S
Operations on languages
In lexical analysis, the most important operations on languages are union, concatenation, and closure
Union is the familiar operation on sets. The concatenation of languages is all strings formed by taking a string
from the first language and a string from the second language, in all possible ways, and concatenating them.
The (Kleene) closure of a language L, denoted as , is the set of strings you get by concatenating L zero or more
times.
Note that , the concatenation of L zero times," is defined to be {}
Finally, the positive closure, denoted , is the same as the Kleene closure, but without the term . That is {}, will
not be in L+ unless it is in L itself.
22
REGULAR EXPRESSIONS AND DFA’S
23
REGULAR EXPRESSIONS AND DFA’S
Regular Expressions
Here are the rules that define the regular expressions over some alphabet and the languages that those expressions
denote.
BASIS: There are two rules that form the basis:
1. is a regular expression, and L() is {}, that is, the language whose sole member is the empty string.
2. If a is a symbol in , then a is a regular expression, that is, the language with one string, of length one, with a in
its one position.
Note that by convention, we use italics for symbols, and boldface for their corresponding regular expression
24
REGULAR EXPRESSIONS AND DFA’S
INDUCTION:
There are four parts to the induction whereby larger regular expressions are built from smaller ones.
Suppose r and s are regular expressions denoting languages L(r) and L(s), respectively.
This last rule says that we can add additional pairs of parentheses around expressions without changing the
language they denote
25
REGULAR EXPRESSIONS AND DFA’S
26
27
REGULAR EXPRESSIONS AND DFA’S
28
REGULAR EXPRESSIONS AND DFA’S
Transition Diagrams
An intermediate step in the construction of a lexical analyser, we first convert patterns into stylized
flowcharts, called “transition diagrams."
We perform the conversion from regular-expression patterns to transition diagrams
Transition diagrams have a collection of nodes or circles, called states. Each state represents a
condition that could occur during the process of scanning the input looking for a lexeme that
matches one of several patterns.
We may think of a state as summarizing all we need to know about what characters we have seen
between the lexemeBegin pointer and the forward pointer
29
REGULAR EXPRESSIONS AND DFA’S
DFA
We shall assume that all our transition diagrams are deterministic, meaning that there is never more than
one edge out of a given state with a given symbol among its labels
Conventions for the transition diagrams
Some important conventions about transition diagrams are:
Certain states are said to be accepting, or final. These states indicate that a lexeme has been found, although the
actual lexeme may not consist of all positions between the lexemeBegin and forward pointers.
In addition, if it is necessary to retract the forward pointer one position (i.e., the lexeme does not include the
symbol that got us to the accepting state), then we shall additionally place a* near that accepting state
One state is designated the start state, or initial state; it is indicated by an edge, labeled “start,"
30
REGULAR EXPRESSIONS AND DFA’S
Example:
A transition diagram that
recognizes the lexemes matching
the token relop.
We begin in state 0, the start
state
32
Thanks
33