0% found this document useful (0 votes)
5 views66 pages

Topic 3

The document discusses lexical analysis and scanner generation, focusing on formal languages, finite state machines, and regular expressions. It explains the role of lexical tokens, symbol tables, and the implementation of finite state machines to simplify lexical analysis. Additionally, it covers various implementation techniques for creating lexical tables and the importance of efficient searching methods.

Uploaded by

heyfiez12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views66 pages

Topic 3

The document discusses lexical analysis and scanner generation, focusing on formal languages, finite state machines, and regular expressions. It explains the role of lexical tokens, symbol tables, and the implementation of finite state machines to simplify lexical analysis. Additionally, it covers various implementation techniques for creating lexical tables and the importance of efficient searching methods.

Uploaded by

heyfiez12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 66

Topic 3: Lexical Analysis &

Scanner Generator

1
A. Formal Language
• Formal Language is critical to the study of programming
languages and compilers
• Formal Language: A language which can be defined by
a precise specification and is amenable for use with
computers
• E.g: Java Programming syntax, Pascal, etc
• Natural Language: A language used by people, which
cannot be defined perfectly with a precise
specification system
• E.g: languages in the world (English, Malay, etc)
2
1. Language Elements
- Collection of unique objects

set - May contain infinite number of objects


- Empty set: set with no element ({ } or Ø)
- - Example: {boy, girl, animal} = {boy, girl, animal, boy}

- List of characters from a given alphabet


- Need not be unique, but order is important
string - Null string: string that contains no character (ε)
- Example: abc, cba, abb, ab

- A set of strings from a given alphabet


- Example from alphabet {0,1}: {0,10,11,…}
language - Example from alphabet from characters on 3
keyboard: Malay syntax, Java Syntax
2. Finite State Machine
• We have problems specifying the strings in an
infinite (very large) language.
• For example:
• English lack the precision necessary to
differentiate which strings are in the language,
and which are not.
• One solution is to use a mathematical or
hypothetical machine called a finite state
machine (or finite automata).
4
2. Finite State Machine

Definition:
A theoretical machine consisting of a
finite set of states, a finite input
alphabet, and a state transition function
which specifies the machine’s state,
given its present state and the current
input.
5
2. Finite State Machine

Starting state
Accepting /final
state

Dead/trapped Try read:


state
0,10,100, 011, 1101

6
2. Finite State Machine
• How Finite State Machine works:
1. Input: string of symbols from input alphabet
2. Machine: initially in starting state
3. Symbols are read from input string
4. Machine changes state based on transition
function on input symbols
5. When all inputs have been read, machine is
either in:
• Accepting state  input string has been accepted
• Non-accepting state  input string has been rejected 7
2. Finite State Machine

The set of all input strings


which would be accepted by
the machine form a
language, and in this way
the finite state machine
provides a precise
specification of a language.
8
2. Finite State Machine

Finite State Machine


Representation

State graph State table

9
2. Finite State Machine
• State graph
• Components:
• State
• Transition function x

• Starting state
• Accepting state
10
2. Finite State Machine
• State table
• Components
• State names  rows
• Input symbols  columns
• Each entry  next state of the machine
given an input
• Starting state  first one in a row
• Accepting state  asterisk (*) 11
2. Finite State Machine
• We will only be using machines that
have exactly one arc leaving the state
for each possible input symbol
• This type of machine is called:
deterministic finite state machine

12
2. Finite State Machine
• Example 1: this machine accepts any string of zeros and ones
which contains an even number of ones

0 0 1
0
*A A B
1
A B B B A
1
state graph state table

13
2. Finite State Machine
• Example 2: string contains odd numbers of zeros (input
alphabet is {0, 1})

1 0 1
1
A B A
0
A B *B A A
0

state graph state table

14
2. Finite State Machine
• Example 3: Strings containing three consecutive ones (the
input alphabet is {0,1})
0

1 1
A B C D
1
0

0 state graph 0,1


0 1
A A A
B A C
C A D
15
*D D D

state table
Exercises

16
Exercises
Show a finite state machine in either state graph or table
form for each of the following languages (in each case the
input alphabet is {0,1}):

1. Strings containing exactly three zeros.

2. Strings containing an odd number of zeros and an even


number of ones.

17
Exercises (continue)
3. Set of all strings that start with ‘0’.

4. Set of all strings of length 2.

5. Set of all strings containing odd number of zeros.

6. Set of all strings containing 3 consecutive ones.

18
3. Regular Expressions
• Another method for specifying or describing
languages.
• Used to represent certain sets of strings in
algebraic form.
• The expressions are using three possible
operations:

1. UNION

2. CONCATENATION
19
3. KLEENE*
3. Regular Expressions

UNION
• Union of two sets = set which contains all
elements in the two sets
• Designated with ‘+’
• { abc, ab, ba } + { ba, bb } = { abc, ab, ba, bb}
• L+{ }=L

20
3. Regular Expressions

CONCATENATION
• Combine two strings to form a new string
• Designated with ‘ . ’
• { ab, a, c } . { b, ε } = { ab.b, ab. ε, a.b, a. ε, c.b,
c. ε } = { abb, ab, a, cb, c }
• L.{ε}=L
• L.{ }={ }
21
3. Regular Expressions

KLEENE*
• Unary operation, often called closure
• Designated with ‘ * ’
• If L is a language:
• L0 = { ε }
• L1 = L
• L2 = L . L 1
• Ln = L . Ln-1 22
• L* = L0 + L1 + L2 + L3 + L4 ….
3. Regular Expressions
• Example 4: L = { 0, 1 }. Find L*

L0 = { ε }
L1 = { 0, 1 }
L2 = L . L1 = { 0, 1 } . { 0, 1 } = { 00, 01, 10, 11 }
L3 = L . L2 = { 0, 1 } . { 00, 01, 10, 11} = { 000, 001, 010, 011, 100, 101, 110,
111 }
L* = { ε, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, 101, 110, 111, 0000,
…..}
L* = the set of all strings of zeros and ones

23
3. Regular Expressions
• Shorthand Notation
• Regular Expression uses shorthand notation
• If x is a character in the input alphabet, then x
= {“x”}  the character x represents the set
consisting of only one string, which is x.
• Example:
• 0 + 1 = { 0 } + { 1 } = { 0, 1 }
• 0 + ε = { 0, ε }
24
3. Regular Expressions
• Precedence
1. ( ) εR = Rε = R
2. Kleene* (R*)* = R*
R*R* = R*
3. Concatenation R+R=R
4. Union (P+Q)R = PR + QR
(P+Q)* = (P*Q*)* = (P*+Q*)*

Arden’s Theorem:
R= Q + RP is equals to
R= QP* 25
3. Regular Expressions
• Example 5: 1 . ( 0+1 )* . 0

1. ( 0 + 1)* . 0
= { 1 } . { 0, 1 }* . { 0 }
= { 10, 100, 110, 1000, 1010, 1100, 1110 …}
= the set of all strings of zeros and ones which begin with a 1 and end with
a0

26
Recap…
• What is Formal Language?
• One example of Formal Language?
• What is Natural Language?
• One example of Natural Language?
• How do we represent Finite State Machine?
• How do we represent accepting state for above?
• What are the three operations we can do on Regular
Expressions?
• List the precedence of Regular Expression operations

27
Exercise
Compute the regular expressions below and draw
their finite state machines.

1.a(bc)*
2.(a+b)*c
3.(a*b) + (cd*)

Try these: a+b, a.b, a*, ab*c, (a.b)*, a(ba)*, a(b+ca)*


28
Exercise
Produce the regular expression for the DFA below:

29
B. Lexical Tokens
• Lexical Analysis: attempts to isolate words in an
input string
• WORD: aka lexeme, lexical item, lexical token –
string of input characters taken as a unit and
passed on the next phase of compilation

30
Lexical Analysis

31
Lexical Analysis – Symbol Table
• Symbol Table: A data structure used to store
identifiers and possibly other lexical entities
during compilation.

encounters constructs symbol


scanner identifiers
table

32
Lexical Analysis – Symbol Table
Things to remember about Symbol Table:

1. Symbol table stores each identifier once, no


matter how many times it appear in the code.
• The information it stores:
• identifier type
• associated run-time information
• e.g: value assigned

33
Lexical Analysis – Symbol Table
Things to remember about Symbol Table:

2. There can be problems with block-structured


language: same identifier have different
declarations in different block
• Solution: all instances of the identifier must be
recorded
• set up separate symbol table for each block
• specify block scopes in a single symbol table
34
Lexical Analysis – Symbol Table
Things to remember about Symbol Table:

3. Numeric constant must be converted to an


appropriate internal form.
• Eg: 3.4e+6 must be converted to 3400000
• So that computer can do appropriate arithmetic
operations on it.

35
Lexical Analysis
• Lexical analysis does not check for appropriate
syntax.
• E.g: } while if ( {
• lexical phase would put out five tokens
• If a source language is not case sensitive, scanner
must accommodate this feature.
• E.g: then = THEN = tHeN = Then

36
Lexical Analysis - Output

stream of
tokens

A. class B. value

class: indicates which type of token


value: indicates which member of the class 37
Lexical Analysis - Output
while ( x33 <= 2.5e+33 - total ) calc ( x33 ) ; //!
1 6 2 3 4 3 2 6 2 6 2 6 6

Token Class Token Value


1 [code for while]
6 [code for (]
2 [ptr to symbol table entry for x33]
3 [code for <=]
4 [ptr to constant table entry for 2.5e+33]
3 [code for -]
2 [ptr to symbol table entry for total]
6 [code for )]
2 [ptr to symbol table entry for calc]
6 [code for (]
2 [ptr to symbol table entry for x33] 38
6 [code for )]
6 [code for ;]
C. Implementation with Finite
State Machine
• To simplify lexical analysis, we can use finite state machines
• Example 1: Finite State Machine which accepts any identifier
beginning with a letter and followed by any number of letters
and digits. L = (a-z), D= (0-9)

L,D

39
L,D
C. Implementation with Finite
State Machine
• Example 2: Finite State Machine which accepts numeric
constants. D = (0-9).
D D

D . E +, -
v

D D
E
D
dead

All unspecified transitions are to the 40


"dead" state.
C. Implementation with Finite State
Machine
• Example 3: Keyword recognizer (this example is not
completely specified)
Keywords:
if, import, int, for, float
f

m p

i o
n

t
f r

o
r
l t
o 41
a t
Recap..
• What do you understand by the word “isolate” in lexical
analysis?
• What does scanner do when it encounters identifiers?
• If there’s numeric constant like this: 2.6e+7, what should
symbol table do?
• Programmer enters: while { for ++ , what does the lexical
analysis phase do?
• A - Print out syntax error
• B - Put out four lexical tokens
• How do we simplify lexical analysis?
42
Exercise
1. Finite state machine that can accept these words: FOR,
FLOW, FRONT and FRIEND. Use a different accepting
state for each of these words.

2. Show a finite state machine which will recognize the


words RENT, RENEW, RED, RAID, RAG, and SENT. Use
a different accepting state for each of these words.

43
D. Lexical Tables
• One of the most important tasks in lexical
analysis: creation of tables.
• Tables could be:
• Symbol table
• Numeric constants table
• String constants table
• Statement labels table
• Line numbers 44
Implementation Techniques
1. SEQUENTIAL SEARCH
• Table organized as array or linked list
• Each time a word is encountered, list is scanned and if
word is not already in the list it will be added at the end.
• E.g: frog, tree, hill, bird, bat, cat

frog
tree
Time required to build a table
hill
of n words: O(n2)
bird
bat 45

cat
Implementation Techniques
1. SEQUENTIAL SEARCH
• Advantage: easy to implement
• Disadvantage: not efficient when number of words
become large
• Used for statement labels or constants, not used for
symbol tables

46
Implementation Techniques
2. BINARY SEARCH TREE
• Table organized as binary tree with:
• LEFT SUBTREE: any preceding word
• RIGHT SUBTREE: any following word
1. Tree starts with empty, first encountered word is
placed at the root
2. When a word w is encountered, w is compared with
root. If:
• w < root --> left subtree
• w > root --> right subtree
• w = root --> already in tree 47

3. Repeat until w is found, or insert at the last node.


Implementation Techniques

2. BINARY SEARCH TREE


• List 1: frog, tree, hill, bird, bat, cat
• List 2: bat, bird, cat, frog, hill, tree
frog bat

bird tree bird


Not balanced tree
cat
bat cat hill Time required to
frog
build: O(n2)
hill
Balanced tree
Time required to tree
build: O(n log2 n)
48
Implementation Techniques
3. HASH TABLE
• table organized as array or array of linked list
1. start with an array of null pointers --> each is head of
linked list
2. word added to the list, hash function used to determine
which list the word is stored
3. the chosen list is then searched sequentially until:
i. word is found OR
ii. end of list is found --> word added to list

49
Implementation Techniques
3. HASH TABLE
• Hash function: function that uses the word as argument
and returns as value subscript to array of pointers
• Example:
• take length of word + ascii code
• divide by array size -> get remainder
• remainder is subscript to array
• Selecting a good hash function is very important
• Good hash function = high efficiency

50
Implementation Techniques
3. HASH TABLE
• take length of word + ascii code
• divide by array size -> get remainder
• remainder is subscript to array

tree hill bird cat

hash(frog) = (4+102)%6 = 4
hash(tree) = (4+116)%6 = 0
frog hash(hill) = (4+104)%6 = 0
bat hash(bird) = (4+98)%6 = 0
hash(bat) = (3+98)%6 = 5
hash(cat) = (3+99)%6 = 0 51
Exercise
1. Show the binary search tree which would be constructed
to store each of the following lists of identifiers:
(a) minsky, babbage, turing, ada, boole,
pascal, vonneuman
(b) ada, babbage, boole, minsky, pascal,
turing, vonneuman

2. Show the hash table which would result for the following
identifiers using the example hash function in page 51:
bog, cab, bc, cb, h33, h22, cater

52
E. Lexical Analysis with Scanner
Generator
• SableCC – utility program to improve programmers’
productivity
• Generate compilers from a set of specifications
• Sable CC advantages:
• Take good advantages of JAVA – Object-oriented and
have extensive use of class inheritance
• Compilation errors easier to fix
• Generates modular software – each class in separate file
• Generates syntax trees – atoms / code can be generated
from it
• Accommodate wider class of languages than JavaCC 53
Sable CC – Input File
• Input to SableCC: text file with “grammar” file type.
• E.g: filename.grammar
• There are six sections in grammar file:
1. Package declarations
2. Helper declarations
3. States declarations
4. Token declarations
5. Ignored tokens
6. Productions
• For lexical analysis we use: Helper, States and Token
54
Sable CC – Input File
• The grammar file will be arranged like this:

Package package-name;

Helpers
[ Helper declarations, if any, go here ]
States
[ State declarations, if any, go here ]
Tokens
[ Token declarations go here ]

• All names (Helpers, States, Tokens) must be in lower case and


underscore
55
• Token declaration is compulsory
Sable CC – Token Declarations
All lexical tokens must be declared (given a name) and defined
Token declarations format:
• Token-name = Token-definition ;
• For example: left_paren = '(' ;
Token definition can be:
A. A character in single quotes e.g ‘w’, ‘$’, ‘9’
B. A number representing ascii code for a character e.g 13 for newline
C. A set of characters:
1) range: first & last placed in brackets e.g [‘a’..’z’], [‘A’..’Z’], [‘0’..’9’]
2) union of two sets e.g [‘a..’z’] + [‘A’..’Z’] // all letters!
3) difference of two sets [ [‘0’..‘127’] – [‘\t’+’\n’]] //all ascii except tab &
new line
4) string of characters in single quotes e.g ‘while’ 56
5) regular expression
Sable CC – Token Declarations
• (cont)
5) regular expression:
• (p) -> parenthesis determine the order of operation
• pq -> concatenation
• p|q -> union
• p* -> kleene*, 0 or more instance of p
• p+ -> 1 or more instance of p
• p? -> 0 or 1 repetitions of p
• When two token definitions match input, the one matching
longer input string is selected
• When two token definitions match input string of same length,
token listed first is selected

Tokens Tokens
identifier = ['a'..'z']+ ; keyword = 'while' | 'for' | 57
keyword = 'while' | 'for' | 'class' ;
'class' ; identifier = ['a'..'z']+ ;
SableCC – Helper Declarations
• Helpers: simplify token definition with macro capability
• Any helper which is defined in the Helpers section may be
used as part of a token definition in the Tokens section.
Helpers
digit = ['0'..'9'] ;
letter = [['a'..'z'] + ['A'..'Z']] ;
sign = '+' | '-' ;
newline = 10 | 13 ; // ascii codes
tab = 9 ; // ascii code for tab
Tokens
number = sign? digit+ ; // A number is an optional
// sign, followed by 1 or more
// digits.
identifier = letter (letter | digit | '_')* ;
// An identifier is a letter
// followed by 0 or more
// letters, digits,
58
// underscores.
space = ' ' | newline | tab ;
SableCC – State Declarations
• State: put lexical scanner in a different state
• Example when scanner sees a comment:
• // this is a comment
• We want the scanner to go into a different state when it
sees ‘//’ and change to original state back after end of line.
• State declarations:
States
statement1, statement2,….

• Under the States section we just declare the names of the


states, and the first state is the starting state.
• Changing states: use transition operator ‘->’ 59
SableCC – State Declarations
• We apply states in the Tokens section using curly braces.

{statename} token = def ;


// apply this definition only if the scanner is
// in state statename (and remain in that state)

{statename->newstate} token = def;


// apply this definition only if the scanner
is // in statename,and change the state to new
// state.

60
SableCC- State Declarations
The following example is taken from the SableCC web site. Its
purpose is to make the scanner toggle back and forth between
two states depending on whether it is at the beginning of a
line in the input.
States
bol, inline; // Declare the state names. bol is
// the start state.
Tokens
{bol->inline, inline} char = [[0..0xfff] - [10 + 13]];
// Scanning a non-newline char. Apply
// this in either state, New state is
// inline.
{bol, inline->bol} eol = 10 | 13 | 10 13;
// Scanning a newline char. Apply this in
// either state. New state is bol.
61
Example of SableCC Input File
lexing.grammar

62
Running SableCC
1. Prepare your files (refer to SableCC_Files.doc):
1) lexing.grammar
2) Lexing.java

2. Invoke SableCC as shown below:

sablecc lexing.grammar

This will produce a sub-directory, with the same name as the


language being compiled. All the generated java code is placed
in this sub-directory.

3. The second step required to generate the scanner is to compile


these Java classes. First copy the Lexing.java file to your lexing
sub-directory.
63
javac lexing/*.java
Running SableCC
4. We have now generated the scanner in lexing.Lexing.class. To execute
the scanner:

java lexing.Lexing

This will read from the standard input file (keyboard) and should display
tokens as they are recognized. Use the end-of-file character to terminate
the input ( ctrl-z for Windows/DOS). A sample session is shown below:

java lexing.Lexing
sum = sum + salary ;

Identifier: sum
Unknown =
Identifier: sum
Arith Op: +
64
Identifier: salary
Unknown ;
Exercise 1

65
Exercise 2

66

You might also like