0% found this document useful (0 votes)
87 views13 pages

Lex Analysis

1. The document discusses phases of syntax analysis including lexical analysis and parsing. Lexical analysis converts input characters into tokens, while parsing identifies sentence structure by constructing parse trees from tokens. 2. It describes how regular expressions are used to represent sets of strings for lexical analysis. Finite state automata like NFAs and DFAs are constructed to recognize strings that match a regular expression. 3. The subset construction algorithm is used to convert an NFA to a DFA, where each DFA state corresponds to a set of NFA states. This allows for deterministic recognition of strings.

Uploaded by

Abhinit Modi
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views13 pages

Lex Analysis

1. The document discusses phases of syntax analysis including lexical analysis and parsing. Lexical analysis converts input characters into tokens, while parsing identifies sentence structure by constructing parse trees from tokens. 2. It describes how regular expressions are used to represent sets of strings for lexical analysis. Finite state automata like NFAs and DFAs are constructed to recognize strings that match a regular expression. 3. The subset construction algorithm is used to convert an NFA to a DFA, where each DFA state corresponds to a set of NFA states. This allows for deterministic recognition of strings.

Uploaded by

Abhinit Modi
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Phases of Syntax Analysis

1. Identify the words: Lexical Analysis. Converts a stream of characters (input program) into a stream of tokens. Also called Scanning or Tokenizing. 2. Identify the sentences: Parsing. Derive the structure of sentences: construct parse trees from a stream of tokens.

Lexical Analysis
Convert a stream of characters into a stream of tokens. Simplicity: Conventions about words are often dierent from conventions about sentences. Eciency: Word identication problem has a much more ecient solution than sentence identication problem. Portability: Character set, special characters, device features.

Terminology
Token: Name given to a family of words.
e.g., integer constant

Lexeme: Actual sequence of characters representing a word.


e.g., 32894

Pattern: Notation used to identify the set of lexemes represented by a token.


e.g., [0 9]+

Terminology
A few more examples: Token while integer constant identier Sample Lexemes while 32894, -1093, 0 buffer size Pattern while [0-9]+ [a-zA-Z]+

Patterns
How do we compactly represent the set of all lexemes corresponding to a token? For instance:
The token integer constant represents the set of all integers: that is, all sequences of digits (09), preceded by an optional sign (+ or ).

Obviously, we cannot simply enumerate all lexemes. Use Regular Expressions.

Regular Expressions
Notation to represent (potentially) innite sets of strings over alphabet . a: stands for the set {a} that contains a single string a.

Analogous to Union.

ab: stands for the set {ab} that contains a single string ab.
Analogous to Product. (a|b)(a|b): stands for the set {aa, ab, ba, bb}.

a : stands for the set {, a, aa, aaa, . . .} that contains all strings of zero or more as.
Analogous to closure of the product operation.

Regular Expressions
Examples of Regular Expressions over {a, b}: (a|b) : Set of strings with zero or more as and zero or more bs:
{, a, b, aa, ab, ba, bb, aaa, aab, . . .}

(a b ): Set of strings with zero or more as and zero or more bs such that all as occur before any b:
{, a, b, aa, ab, bb, aaa, aab, abb, . . .}

(a b ) : Set of strings with zero or more as and zero or more bs:


{, a, b, aa, ab, ba, bb, aaa, aab, . . .}

Language of Regular Expressions


Let R be the set of all regular expressions over . Then, Empty String: R Unit Strings: R Concatenation: r1 , r2 R r1 r2 R Alternative: r1 , r2 R (r1 | r2 ) R Kleene Closure: r R r R

Regular Expressions
Example: (a | b)
L0 L1 = = = = L2 = = = L3 . . . = {} L0 {a, b} {} {a, b} {a, b} L1 {a, b} {a, b} {a, b} {aa, ab, ba, bb} L2 {a, b}

L=
i=0

Li

= {, a, b, aa, ab, ba, bb, . . .}

Semantics of Regular Expressions

Semantic Function L : Maps regular expressions to sets of strings. L() L() L(r1 | r2 ) L(r1 r2 ) L(r ) = {} = {}

( )

= L(r1 ) L(r2 ) = L(r1 ) L(r2 ) = {} (L(r) L(r ))

Computing the Semantics


L(a) L(a | b) = = = = L(ab) = = = L((a | b)(a | b)) = = = {a} L(a) L(b) {a} {b} {a, b} L(a) L(b) {a} {b} {ab} L(a | b) L(a | b) {a, b} {a, b} {aa, ab, ba, bb}

Computing the Semantics of Closure


Example: L((a | b) ) = {} (L(a | b) L((a | b) ))
L0 L1 = = = = L2 = = = . . . {} Base case {} ({a, b} L0 ) {} ({a, b} {}) {, a, b} {} ({a, b} L1 ) {} ({a, b} {, a, b}) {, a, b, aa, ab, ba, bb}

L((a | b) ) = L = {, a, b, aa, ab, ba, bb, . . .}

Another Example
L((a b ) ) :
L(a ) L(b ) L(a b ) L((a b ) ) = = = = {, a, aa, . . .} {, b, bb, . . .} {, a, b, aa, ab, bb, aaa, aab, abb, bbb, . . .} {} {, a, b, aa, ab, bb, aaa, aab, abb, bbb, . . .} {, a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, bba, bbb, . . .} . . .

Regular Denitions
Assign names to regular expressions. For example,
digit natural 0 | 1 | | 9 digit digit

Shorthands: a+ : Set of strings with one or more occurrences of a. a? : Set of strings with zero or one occurrences of a. Example:
integer (+|)? digit+

Regular Denitions: Examples


oat integer no leading zero fraction no trailing zero exponent digit nonzero digit integer . fraction (+|)? no leading zero (nonzero digit digit ) | 0 no trailing zero exponent? (digit nonzero digit) | 0 (E | e) integer 0 | 1 | | 9 1 | 2 | | 9

Regular Denitions and Lexical Analysis


Regular Expressions and Denitions specify sets of strings over an input alphabet. They can hence be used to specify the set of lexemes associated with a token. Used as the pattern language How do we decide whether an input string belongs to the set of strings specied by a regular expression?

Using Regular Denitions for Lexical Analysis


Q: Is ababbaabbb in L(((a b ) )? A: Hm. Well. Lets see.
L((a b ) ) = {} {, a, b, aa, ab, bb, aaa, aab, abb, bbb, . . .} {, a, b, aa, ab, ba, bb, aaa, aab, aba, abb, baa, bab, bba, bbb, . . .} . . . = ???

Recognizers
Construct automata that recognize strings belonging to a language. Finite State Automata Regular Languages

Push Down Automata Context-free Languages Stack is used to maintain counter, but only one counter can go arbitrarily high.

Recognizing Finite Sets of Strings


Identifying words from a small, nite, xed vocabulary is straightforward. For instance, consider a stack machine with push, pop, and add operations with two constants: 0 and 1. We can use the automaton:
p u s h push o p pop a d d add 0 1

integer_constant

Finite State Automata


Represented by a labeled directed graph. A nite set of states (vertices). (edges).

Transitions between states

Labels on transitions are drawn from {}. One distinguished start state. One or more distinguished nal states.

Finite State Automata: An Example


Consider the Regular Expression (a | b) a(a | b). L((a | b) a(a | b)) = {aa, ab, aaa, aab, baa, bab, aaaa, aaab, abaa, abab, baaa, . . .}. The following automaton determines whether an input string belongs to L((a | b) a(a | b):

a
1

a a
2 3

Determinism
(a | b) a(a | b):
a a a
2 3

Nondeterministic: (NFA)

b
a

Deterministic: (DFA)

a
1

a b

b
4

Acceptance Criterion
A nite state automaton (NFA or DFA) accepts an input string x . . . if beginning from the start state . . . we can trace some path through the automaton . . . such that the sequence of edge labels spells x . . . and end in a nal state.

Recognition with an NFA


Is abab L((a | b) a(a | b))?

a
1

a a
2 3

b
Input: Path 1: Path 2: Path 3: 1 1 1 a 1 1 2 b 1 1 3

b
a 1 2 b 1 3

Accept

Accept

Recognition with an NFA


Is abab L((a | b) a(a | b))?

a
1

a a
2 3

b
Input: Path 1: Path 2: Path 3: 1 1 1 a 1 1 2 b 1 1 3

b
a 1 2 b 1 3

Accept

Accept

Recognition with a DFA


Is abab L((a | b) a(a | b))?

a b a
1 3

a b

b
4

NFA vs. DFA


For every NFA, there is a DFA that accepts the same set of strings. NFA may have transitions labeled by .
(Spontaneous transitions)

All transition labels in a DFA belong to . For some string x, there may be many accepting paths in an NFA. For all strings x, there is one unique accepting path in a DFA. Usually, an input string can be recognized faster with a DFA. NFAs are typically smaller than the corresponding DFAs.

Regular Expressions to NFA


Thompsons Construction: For every regular expression r, derive an NFA N (r) with unique start and nal states.

N(r )
1

(r1 | r2 )

N(r )
2

Regular Expressions to NFA (contd.)



N(r ) 1 N(r ) 2

r1 r2


N(r)

Example
(a | b) a(a | b):

a b

a b

Recognition with an NFA


Is abab L((a | b) a(a | b))?
a
1

a a
2 3

Input: Path 1: Path 2: Path 3: All Paths

1 1 1 {1}

b a 1 1 2

b 1 1 3 {1, 3}

a 1 2 {1, 2}

b 1 3 {1, 3}

Accept

{1, 2}

Accept

Recognition with an NFA (contd.)


Is aaab L((a | b) a(a | b))?

a
1

a a
2 3

Input: Path 1: Path 2: Path 3: Path 4: Path 5: All Paths

1 1 1 1 1 {1}

a 1 1 1 1 2 {1, 2}

a 1 1 1 2 3 {1, 2, 3}

a 1 1 2 3 {1, 2, 3}

b 1 2 3 {1, 2, 3}

Accept

Accept

Recognition with an NFA (contd.)


Is aabb L((a | b) a(a | b))?

a
1

a a
2 3

Input: Path 1: Path 2: Path 3: All Paths

1 1 1 {1}

a 1 1 2 {1, 2}

a 1 2 3 {1, 2, 3}

a 1 3 {1, 3}

b 1 {1}

REJECT

Converting NFA to DFA


Subset construction Given a set S of NFA states, compute S = -closure(S): S is the set of all NFA states reachable by zero or more -transitions from S. compute S = goto(S, ): S is the set of all NFA states reachable from S by taking a transition labeled . S = -closure(S ).

Converting NFA to DFA (contd).


Each state in DFA corresponds to a set of states in NFA. Start state of DFA = -closure(start state of NFA). From a state s in DFA that corresponds to a set of states S in NFA: add a transition labeled to state s that corresponds to a non-empty S in NFA, such that S = goto(S, ).

s is a nal state of DFA

NFA DFA: An Example


a
1

a a
2 3

-closure({1}) goto({1}, a) goto({1}, b) goto({1, 2}, a) goto({1, 2}, b) goto({1, 2, 3}, a) . . .

= = = = = =

{1} {1, 2} {1} {1, 2, 3} {1, 3} {1, 2, 3}

NFA DFA: An Example (contd.)


-closure({1}) goto({1}, a) goto({1}, b) goto({1, 2}, a) goto({1, 2}, b) goto({1, 2, 3}, a) goto({1, 2, 3}, b) goto({1, 3}, a) goto({1, 3}, b) = = = = = = = = = {1} {1, 2} {1} {1, 2, 3} {1, 3} {1, 2, 3} {1} {1, 2} {1}

NFA DFA: An Example (contd.)


goto({1}, a) goto({1}, b) goto({1, 2}, a) goto({1, 2}, b) goto({1, 2, 3}, a) . . . = = = = = {1, 2} {1} {1, 2, 3} {1, 3} {1, 2, 3}

a
{1,2,3}

b a a
{1} {1,2}

a b

b
{1,3}

NFA vs. DFA


R = Size of Regular Expression N = Length of Input String NFA Size of Automaton O(R) DFA O(2R )

Lexical Analysis
Regular Expressions and Denitions are used to specify the set of strings (lexemes) corresponding to a token. An automaton (DFA/NFA) is built from the above specications. Each nal state is associated with an action: emit the corresponding token.

Specifying Lexical Analysis


Consider a recognizer for integers (sequence of digits) and oats (sequence of digits separated by a decimal point).
[0-9]+
0-9 0-9

{ emit(INTEGER_CONSTANT); } { emit(FLOAT_CONSTANT); }

[0-9]+"."[0-9]+
INTEGER_CONSTANT

0-9 0-9

0-9

"."

0-9
FLOAT_CONSTANT

Lex
Tool for building lexical analyzers. Input: lexical specications (.l le) Output: C function (yylex) that returns a token on each invocation.
%% [0-9]+ [0-9]+"."[0-9]+

{ return(INTEGER_CONSTANT); } { return(FLOAT_CONSTANT); }

Tokens are simply integers (#defines).

Lex Specications

%{ C header statements for inclusion %} Regular Denitions digit [0-9] %% Token Specications {digit}+ %% Support functions in C e.g.: { return(INTEGER_CONSTANT); } e.g.:

Regular Expressions in Lex

Range: [0-7]: Integers from 0 through 7 (inclusive) [a-nx-zA-Q]: Letters a thru n, x thru z and A thru Q. Exception: [^/]: Any character other than /. Denition: {digit}: Use the previously specied regular denition digit. Special characters: Connectives of regular expression, convenience features. e.g.: | * ^

Special Characters in Lex


| * + ? ( ) [ ] { } ^ . \ \n, \t Same as in regular expressions Enclose ranges and exceptions Enclose names of regular denitions Used to negate a specied range (in Exception) Match any single character except newline Escape the next character Newline and Tab

For literal matching, enclose special characters in double quotes (") e.g.: "*" Or use \ to escape. e.g.: \"

Examples
Sequence of f, o, r C-style OR operator (two vert. bars) Sequence of non-newline characters Sequence of characters except * and / Sequence of non-quote characters beginning and ending with a quote ({letter}|" ")({letter}|{digit}|" ")* C-style identiers for "||" .* [^*/]+ \"[^"]*\"

A Complete Example
%{ #include <stdio.h> #include "tokens.h" %} digit [0-9] hexdigit [0-9a-f] %% "+" "-" {digit}+ {digit}+"."{digit}+ . %% { { { { { return(PLUS); } return(MINUS); } return(INTEGER_CONSTANT); } return(FLOAT_CONSTANT); } return(SYNTAX_ERROR); }

Actions
Actions are attached to nal states. Distinguish the dierent nal states.

Can be used to set attribute values. Fragment of C code (blocks enclosed by { and }).

Attributes
Additional information about a tokens lexeme. Stored in variable yylval Type of attributes (usually a union) specied by YYSTYPE Additional variables: yytext: Lexeme (Actual text string) yyleng: length of string in yytext yylineno: Current line number (number of \n seen thus far) enabled by %option yylineno

Priority of matching
What if an input string matches more than one pattern?
"if" {letter}+ "while" { return(TOKEN_IF); } { return(TOKEN_ID); } { return(TOKEN_WHILE); }

A pattern that matches the longest string is chosen. Example: if1 is matched with an identier, not the keyword if. Of patterns that match strings of same length, the rst (from the top of le) is chosen. Example: while is matched as an identier, not the keyword while.

Constructing Scanners using (f)lex


Scanner specications: specications.l (f)lex specications.l lex.yy.c Generated scanner in lex.yy.c lex.yy.c (g)cc executable

yywrap(): hook for signalling end of le. Use -lfl (ex) or -ll (lex) ags at link time to include default function yywrap() that always returns 1.

Implementing a Scanner
transition : state state algorithm scanner () { current state = start state; while (1) { c = getc(); /* on end of le, ... */ if dened(transition(current state, c)) current state = transition(current state, c); else return s; }

Implementing a Scanner (contd.)


Implementing the transition function: Simplest: 2-D array. Space inecient. Traditionally compressed using row/colum equivalence. (default on (f)lex) Good space-time tradeo. Further table compression using various techniques: Example: RDM (Row Displacement Method): Store rows in overlapping manner using 2 1-D arrays. Smaller tables, but longer access times.

Lexical Analysis: A Summary


Convert a stream of characters into a stream of tokens. Make rest of compiler independent of character set Strip o comments Recognize line numbers Ignore white space characters Process macros (denitions and uses) Interface with symbol (name) table.

You might also like