Topic 3
Topic 3
Scanner Generator
1
A. Formal Language
• Formal Language is critical to the study of programming
languages and compilers
• Formal Language: A language which can be defined by
a precise specification and is amenable for use with
computers
• E.g: Java Programming syntax, Pascal, etc
• Natural Language: A language used by people, which
cannot be defined perfectly with a precise
specification system
• E.g: languages in the world (English, Malay, etc)
2
1. Language Elements
- Collection of unique objects
Definition:
A theoretical machine consisting of a
finite set of states, a finite input
alphabet, and a state transition function
which specifies the machine’s state,
given its present state and the current
input.
5
2. Finite State Machine
Starting state
Accepting /final
state
6
2. Finite State Machine
• How Finite State Machine works:
1. Input: string of symbols from input alphabet
2. Machine: initially in starting state
3. Symbols are read from input string
4. Machine changes state based on transition
function on input symbols
5. When all inputs have been read, machine is
either in:
• Accepting state input string has been accepted
• Non-accepting state input string has been rejected 7
2. Finite State Machine
9
2. Finite State Machine
• State graph
• Components:
• State
• Transition function x
• Starting state
• Accepting state
10
2. Finite State Machine
• State table
• Components
• State names rows
• Input symbols columns
• Each entry next state of the machine
given an input
• Starting state first one in a row
• Accepting state asterisk (*) 11
2. Finite State Machine
• We will only be using machines that
have exactly one arc leaving the state
for each possible input symbol
• This type of machine is called:
deterministic finite state machine
12
2. Finite State Machine
• Example 1: this machine accepts any string of zeros and ones
which contains an even number of ones
0 0 1
0
*A A B
1
A B B B A
1
state graph state table
13
2. Finite State Machine
• Example 2: string contains odd numbers of zeros (input
alphabet is {0, 1})
1 0 1
1
A B A
0
A B *B A A
0
14
2. Finite State Machine
• Example 3: Strings containing three consecutive ones (the
input alphabet is {0,1})
0
1 1
A B C D
1
0
state table
Exercises
16
Exercises
Show a finite state machine in either state graph or table
form for each of the following languages (in each case the
input alphabet is {0,1}):
17
Exercises (continue)
3. Set of all strings that start with ‘0’.
18
3. Regular Expressions
• Another method for specifying or describing
languages.
• Used to represent certain sets of strings in
algebraic form.
• The expressions are using three possible
operations:
1. UNION
2. CONCATENATION
19
3. KLEENE*
3. Regular Expressions
UNION
• Union of two sets = set which contains all
elements in the two sets
• Designated with ‘+’
• { abc, ab, ba } + { ba, bb } = { abc, ab, ba, bb}
• L+{ }=L
20
3. Regular Expressions
CONCATENATION
• Combine two strings to form a new string
• Designated with ‘ . ’
• { ab, a, c } . { b, ε } = { ab.b, ab. ε, a.b, a. ε, c.b,
c. ε } = { abb, ab, a, cb, c }
• L.{ε}=L
• L.{ }={ }
21
3. Regular Expressions
KLEENE*
• Unary operation, often called closure
• Designated with ‘ * ’
• If L is a language:
• L0 = { ε }
• L1 = L
• L2 = L . L 1
• Ln = L . Ln-1 22
• L* = L0 + L1 + L2 + L3 + L4 ….
3. Regular Expressions
• Example 4: L = { 0, 1 }. Find L*
L0 = { ε }
L1 = { 0, 1 }
L2 = L . L1 = { 0, 1 } . { 0, 1 } = { 00, 01, 10, 11 }
L3 = L . L2 = { 0, 1 } . { 00, 01, 10, 11} = { 000, 001, 010, 011, 100, 101, 110,
111 }
L* = { ε, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, 100, 101, 110, 111, 0000,
…..}
L* = the set of all strings of zeros and ones
23
3. Regular Expressions
• Shorthand Notation
• Regular Expression uses shorthand notation
• If x is a character in the input alphabet, then x
= {“x”} the character x represents the set
consisting of only one string, which is x.
• Example:
• 0 + 1 = { 0 } + { 1 } = { 0, 1 }
• 0 + ε = { 0, ε }
24
3. Regular Expressions
• Precedence
1. ( ) εR = Rε = R
2. Kleene* (R*)* = R*
R*R* = R*
3. Concatenation R+R=R
4. Union (P+Q)R = PR + QR
(P+Q)* = (P*Q*)* = (P*+Q*)*
Arden’s Theorem:
R= Q + RP is equals to
R= QP* 25
3. Regular Expressions
• Example 5: 1 . ( 0+1 )* . 0
1. ( 0 + 1)* . 0
= { 1 } . { 0, 1 }* . { 0 }
= { 10, 100, 110, 1000, 1010, 1100, 1110 …}
= the set of all strings of zeros and ones which begin with a 1 and end with
a0
26
Recap…
• What is Formal Language?
• One example of Formal Language?
• What is Natural Language?
• One example of Natural Language?
• How do we represent Finite State Machine?
• How do we represent accepting state for above?
• What are the three operations we can do on Regular
Expressions?
• List the precedence of Regular Expression operations
27
Exercise
Compute the regular expressions below and draw
their finite state machines.
1.a(bc)*
2.(a+b)*c
3.(a*b) + (cd*)
29
B. Lexical Tokens
• Lexical Analysis: attempts to isolate words in an
input string
• WORD: aka lexeme, lexical item, lexical token –
string of input characters taken as a unit and
passed on the next phase of compilation
30
Lexical Analysis
31
Lexical Analysis – Symbol Table
• Symbol Table: A data structure used to store
identifiers and possibly other lexical entities
during compilation.
32
Lexical Analysis – Symbol Table
Things to remember about Symbol Table:
33
Lexical Analysis – Symbol Table
Things to remember about Symbol Table:
35
Lexical Analysis
• Lexical analysis does not check for appropriate
syntax.
• E.g: } while if ( {
• lexical phase would put out five tokens
• If a source language is not case sensitive, scanner
must accommodate this feature.
• E.g: then = THEN = tHeN = Then
36
Lexical Analysis - Output
stream of
tokens
A. class B. value
L,D
39
L,D
C. Implementation with Finite
State Machine
• Example 2: Finite State Machine which accepts numeric
constants. D = (0-9).
D D
D . E +, -
v
D D
E
D
dead
m p
i o
n
t
f r
o
r
l t
o 41
a t
Recap..
• What do you understand by the word “isolate” in lexical
analysis?
• What does scanner do when it encounters identifiers?
• If there’s numeric constant like this: 2.6e+7, what should
symbol table do?
• Programmer enters: while { for ++ , what does the lexical
analysis phase do?
• A - Print out syntax error
• B - Put out four lexical tokens
• How do we simplify lexical analysis?
42
Exercise
1. Finite state machine that can accept these words: FOR,
FLOW, FRONT and FRIEND. Use a different accepting
state for each of these words.
43
D. Lexical Tables
• One of the most important tasks in lexical
analysis: creation of tables.
• Tables could be:
• Symbol table
• Numeric constants table
• String constants table
• Statement labels table
• Line numbers 44
Implementation Techniques
1. SEQUENTIAL SEARCH
• Table organized as array or linked list
• Each time a word is encountered, list is scanned and if
word is not already in the list it will be added at the end.
• E.g: frog, tree, hill, bird, bat, cat
frog
tree
Time required to build a table
hill
of n words: O(n2)
bird
bat 45
cat
Implementation Techniques
1. SEQUENTIAL SEARCH
• Advantage: easy to implement
• Disadvantage: not efficient when number of words
become large
• Used for statement labels or constants, not used for
symbol tables
46
Implementation Techniques
2. BINARY SEARCH TREE
• Table organized as binary tree with:
• LEFT SUBTREE: any preceding word
• RIGHT SUBTREE: any following word
1. Tree starts with empty, first encountered word is
placed at the root
2. When a word w is encountered, w is compared with
root. If:
• w < root --> left subtree
• w > root --> right subtree
• w = root --> already in tree 47
49
Implementation Techniques
3. HASH TABLE
• Hash function: function that uses the word as argument
and returns as value subscript to array of pointers
• Example:
• take length of word + ascii code
• divide by array size -> get remainder
• remainder is subscript to array
• Selecting a good hash function is very important
• Good hash function = high efficiency
50
Implementation Techniques
3. HASH TABLE
• take length of word + ascii code
• divide by array size -> get remainder
• remainder is subscript to array
hash(frog) = (4+102)%6 = 4
hash(tree) = (4+116)%6 = 0
frog hash(hill) = (4+104)%6 = 0
bat hash(bird) = (4+98)%6 = 0
hash(bat) = (3+98)%6 = 5
hash(cat) = (3+99)%6 = 0 51
Exercise
1. Show the binary search tree which would be constructed
to store each of the following lists of identifiers:
(a) minsky, babbage, turing, ada, boole,
pascal, vonneuman
(b) ada, babbage, boole, minsky, pascal,
turing, vonneuman
2. Show the hash table which would result for the following
identifiers using the example hash function in page 51:
bog, cab, bc, cb, h33, h22, cater
52
E. Lexical Analysis with Scanner
Generator
• SableCC – utility program to improve programmers’
productivity
• Generate compilers from a set of specifications
• Sable CC advantages:
• Take good advantages of JAVA – Object-oriented and
have extensive use of class inheritance
• Compilation errors easier to fix
• Generates modular software – each class in separate file
• Generates syntax trees – atoms / code can be generated
from it
• Accommodate wider class of languages than JavaCC 53
Sable CC – Input File
• Input to SableCC: text file with “grammar” file type.
• E.g: filename.grammar
• There are six sections in grammar file:
1. Package declarations
2. Helper declarations
3. States declarations
4. Token declarations
5. Ignored tokens
6. Productions
• For lexical analysis we use: Helper, States and Token
54
Sable CC – Input File
• The grammar file will be arranged like this:
Package package-name;
Helpers
[ Helper declarations, if any, go here ]
States
[ State declarations, if any, go here ]
Tokens
[ Token declarations go here ]
Tokens Tokens
identifier = ['a'..'z']+ ; keyword = 'while' | 'for' | 57
keyword = 'while' | 'for' | 'class' ;
'class' ; identifier = ['a'..'z']+ ;
SableCC – Helper Declarations
• Helpers: simplify token definition with macro capability
• Any helper which is defined in the Helpers section may be
used as part of a token definition in the Tokens section.
Helpers
digit = ['0'..'9'] ;
letter = [['a'..'z'] + ['A'..'Z']] ;
sign = '+' | '-' ;
newline = 10 | 13 ; // ascii codes
tab = 9 ; // ascii code for tab
Tokens
number = sign? digit+ ; // A number is an optional
// sign, followed by 1 or more
// digits.
identifier = letter (letter | digit | '_')* ;
// An identifier is a letter
// followed by 0 or more
// letters, digits,
58
// underscores.
space = ' ' | newline | tab ;
SableCC – State Declarations
• State: put lexical scanner in a different state
• Example when scanner sees a comment:
• // this is a comment
• We want the scanner to go into a different state when it
sees ‘//’ and change to original state back after end of line.
• State declarations:
States
statement1, statement2,….
60
SableCC- State Declarations
The following example is taken from the SableCC web site. Its
purpose is to make the scanner toggle back and forth between
two states depending on whether it is at the beginning of a
line in the input.
States
bol, inline; // Declare the state names. bol is
// the start state.
Tokens
{bol->inline, inline} char = [[0..0xfff] - [10 + 13]];
// Scanning a non-newline char. Apply
// this in either state, New state is
// inline.
{bol, inline->bol} eol = 10 | 13 | 10 13;
// Scanning a newline char. Apply this in
// either state. New state is bol.
61
Example of SableCC Input File
lexing.grammar
62
Running SableCC
1. Prepare your files (refer to SableCC_Files.doc):
1) lexing.grammar
2) Lexing.java
sablecc lexing.grammar
java lexing.Lexing
This will read from the standard input file (keyboard) and should display
tokens as they are recognized. Use the end-of-file character to terminate
the input ( ctrl-z for Windows/DOS). A sample session is shown below:
java lexing.Lexing
sum = sum + salary ;
Identifier: sum
Unknown =
Identifier: sum
Arith Op: +
64
Identifier: salary
Unknown ;
Exercise 1
65
Exercise 2
66