Compilers: CS414-2017S-01 Compiler Basics & Lexical Analysis
Compilers: CS414-2017S-01 Compiler Basics & Lexical Analysis
Compilers: CS414-2017S-01 Compiler Basics & Lexical Analysis
CS414-2017S-01
Compiler Basics & Lexical Analysis
David Galles
Simplified View
01-5: What is a compiler?
Source Token Abstract
Lexical Analyzer Parser
File Stream Syntax Tree
Abstract
Assembly Code Generator Semantic Analyzer
Assembly
Tree
Assembly Tree
Generator
Relocatable
Assembler Object Linker Machine code
Code
Libraries
Abstract
Assembly Code Generator Semantic Analyzer
Assembly
Back End
Tree
Assembly Tree
Generator
Relocatable
Assembler Object Linker Machine code
Code
Libraries
01-7: What is a compiler?
Covered in
this course
Abstract
Assembly Code Generator Semantic Analyzer
Assembly
Tree
Assembly Tree
Generator
Relocatable
Assembler Object Linker Machine code
Code
Libraries
01-8: Why Use Decomposition?
01-9: Why Use Decomposition?
Software Engineering!
void main() {
print(4);
}
01-11: Lexical Analysis
Converting input file to stream of tokens
...
01-13: Lexical Analysis
Brute-Force Approach
Break the input file into words, separated by
spaces or tabs
This can be tricky – not all tokens are separated
by whitespace
Use string comparison to determine tokens
01-14: Deterministic Finite Automata
Set of states
Initial State
Final State(s)
Transitions
Combine DFA
01-15: DFAs and Lexical Analyzers
Given a DFA, it is easy to create C code to
implement it
DFAs are easier to understand than C code
Visual – almost like structure charts
... However, creating a DFA for a complete lexical
analyzer is still complex
01-16: Automatic Creation of DFAs
We’d like a tool:
For example:
{fire, truck, car} {car, dog} =
{firecar, firedog, truckcar, truckdog, carcar, cardog}
01-20: Kleene Closure
Given a formal language L:
L0 = {ǫ}
L1 = L
L2 = LL
L3 = LLL
L4 = LLLL
[ [ [ [ [
L∗ = L0 L1 L2 ... Ln ...
01-21: Regular Expressions
Regular expressions are use to describe formal
languages over an alphabet Σ:
Kleene Closure *
Concatenation
Alternation |
ab*c|e = (a(b*)c) | e
01-23: Regular Expression Examples
[a,b,c,d] = (a|b|c|d)
[d-g] = [d,e,f,g] = (b|e|f|g)
[d-f,M-O] = [d,e,f,M,N,O]
= (d|e|f|M|N|O)
(α)? = Optionally α (i.e., (α | ǫ))
(α)+ = α(α)*
01-26: Regular Expressions & Unix
Many unix tools use regular expressions
Example: grep ’<reg exp>’ filename
Prints all lines that contain a match to the
regular expression
Special characters:
^ beginning of line
$ end of line
(grep examples on other screen)
01-27: JavaCC Regular Expressions
All characters & strings must be in quotation marks
"else"
"+"
("a"|"b")
All regular expressions involving * must be
parenthesized
("a")*, not "a"*
01-28: JavaCC Shorthand
["a","b","c","d"] = ("a"|"b"|"c"|"d")
["d"-"g"] = ["d","e","f","g"] = ("b"|"e"|"f"|"g")
["d"-"f","M"-"O"] = ["d","e","f","M","N","O"]
= ("d"|"e"|"f"|"M"|"N"|"O")
(α)? = Optionally α (i.e., (α | ǫ))
(α)+ = α(α)*
(~["a","b"]) = Any character except “a” or “b”.
Can only be used with [] notation
~(a(a|b)*b) is not legal
01-29: r.e. Shorthand Examples
Regular Expression Langauge
{if}
Set of legal identifiers
Set of integer literals
(leading zeroes allowed)
Set of real literals
01-30: r.e. Shorthand Examples
Regular Expression Langauge
"if" {if}
["a"-"z"](["0"-"9","a"-"z"])* Set of legal identifiers
["0"-"9"] Set of integer literals
(leading zeroes allowed)
(["0"-"9"]+"."(["0"-"9"]*))| Set of real literals
((["0"-"9"])*"."["0"-"9"]+)
01-31: Lexical Analyzer Generator
JavaCC is a Lexical Analyzer Generator and a Parser
Generator
PARSER_BEGIN(foo)
PARSER_END(foo)
TOKEN_MGR_DECLS :
{
/* Declarations used by lexical analyzer */
}
TOKEN :
{
<TOKEN_NAME: RegularExpression>
}
TOKEN_NAME is the name of the token being
described
RegularExpression is a regular expression that
describes the token
01-34: Token Rules in JavaCC
Token rule examples:
TOKEN :
{
<ELSE: "else">
}
TOKEN :
{
<INTEGER_LITERAL: (["0"-"9"])+>
}
01-35: Token Rules in JavaCC
Several different tokens can be described in the
same TOKEN block, with token descriptions
separated by |.
TOKEN :
{
<ELSE: "else">
| <INTEGER_LITERAL: (["0"-"9"])+>
| <SEMICOLON: ";">
}
01-36: getNextToken
When we run javacc on the input file foo.jj, it
creates the class fooTokenManager
The class fooTokenManager contains the static
method getNextToken()
Every call to getNextToken() returns the next
token in the input stream.
01-37: getNextToken
When getNextToken is called, a regular
expression is found that matches the next
characters in the input stream.
What if more than one regular expression
matches?
TOKEN :
{
<ELSE: "else">
| <IDENTIFIER: (["a"-"z"])+>
}
01-38: getNextToken
When more than one regular expression matches
the input stream:
Use the longest match
“elsed” should match to IDENTIFIER, not to
ELSE followed by the identifier “d”
If two matches have the same length, use the
rule that appears first in the .jj file
“else” should match to ELSE, not
IDENTIFIER
01-39: JavaCC Example
PARSER_BEGIN(simple)
public class simple {
}
PARSER_END(simple)
TOKEN :
{
<ELSE: "else">
| <SEMICOLON: ";">
| <FOR: "for">
| <INTEGER_LITERAL: (["0"-"9"])+>
| <IDENTIFIER: ["a"-"z"](["a"-"z","0"-"9"])*>
}
else;ford for
01-40: SKIP Rules
Tell JavaCC what to ignore (typically whitespace)
using SKIP rules
SKIP rule is just like a TOKEN rule, except that no
TOKEN is returned.
SKIP:
{
< regularexpression1 >
| < regularexpression2 >
| ...
| < regularexpressionn >
}
01-41: Example SKIP Rules
PARSER_BEGIN(simple2)
public class simple2 {
}
PARSER_END(simple2)
SKIP :
{
< " " >
| < "\n" >
| < "\t" >
}
TOKEN :
{
<ELSE: "else">
| <SEMICOLON: ";">
| <FOR: "for">
| <INTEGER_LITERAL: (["0"-"9"])+>
| <IDENTIFIER: ["A"-"Z"](["A"-"Z","0"-"9"])*>
}
01-42: JavaCC States
Comments can be dealt with using SKIP rules
How could we skip over 1-line C++ Style
comments?
// This is a comment
01-43: JavaCC States
Comments can be dealt with using SKIP rules
How we could skip over 1-line C++ Style
comments:
// This is a comment
Using a SKIP rule
SKIP :
{
< "//" (~["\n"])* "\n" >
}
01-44: JavaCC States
Writing a regular expression to match multi-line
comments (using /* and */) is much more difficult
Writing a regular expression to match nested
comments is impossible (take Automata Theory for
a proof :) )
What can we do?
Use JavaCC States
01-45: JavaCC States
We can label each TOKEN and SKIP rule with a
“state”
Unlabeled TOKEN and SKIP rules are assumed to
be in the default state (named DEFAULT,
unsurprisingly enough)
Can switch to a new state after matching a TOKEN
or SKIP rule using the : NEWSTATE notation
01-46: JavaCC States
SKIP :
{
< " " >
| < "\n" >
| < "\t" >
}
SKIP :
{
< "/*" > : IN_COMMENT
}
<IN_COMMENT>
SKIP :
{
< "*/" > : DEFAULT
| < ~[] >
}
TOKEN :
{
<ELSE: "else">
| ... (etc)
}
01-47: Actions in TOKEN & SKIP
We can add Java code to any SKIP or TOKEN rule
That code will be executed when the SKIP or
TOKEN rule is matched.
Any methods / variables defined in the
TOKEN_MGR_DECLS section can be used by
these actions
01-48: Actions in TOKEN & SKIP
PARSER_BEGIN(remComments)
public class remComments { }
PARSER_END(remComments)
TOKEN_MGR_DECLS :
{
public static int numcomments = 0;
}
SKIP :
{
< "/*" > : IN_COMMENT
}
SKIP :
{
< "//" (~["\n"])* "\n" > { numcomments++; }
}
01-49: Actions in TOKEN & SKIP
<IN_COMMENT>
SKIP :
{
< "*/" > { numcomments++; SwitchTo(DEFAULT);}
}
<IN_COMMENT>
SKIP :
{
< ~[] >
}
TOKEN :
{
<ANY: ~[]>
}
01-50: Tokens
Each call to getNextToken returns a “Token” object
Token class is automatically created by javaCC.
Variables of type Token contain the following public
variables:
public int kind; The type of token. When
javacc is run on the file foo.jj, a file
fooConstants.java is created, which contains
the symbolic names for each constant
public interface simplejavaConstants {
int EOF = 0;
int CLASSS = 8;
int DO = 9;
int ELSE = 10;
...
01-51: Tokens
Each call to getNextToken returns a “Token” object
Token class is automatically created by javaCC.
Variables of type Token contain the following public
variables:
public int beginLine, beginColumn,
endLine, endColumn; The location of the
token in the input file
01-52: Tokens
Each call to getNextToken returns a “Token” object
Token class is automatically created by javaCC.
Variables of type Token contain the following public
variables:
public String image; The text that was
matched to create the token.
01-53: Generated TokenManager
class TokenTest {
public static void main(String args[]) {
Token t;
Java.io.InputStream infile;
pascalTokenManager tm;
boolean loop = true;
if (args.length < 1) {
System.out.print("Enter filename as command line argument");
return;
}
try {
infile = new Java.io.FileInputStream(args[0]);
} catch (Java.io.FileNotFoundException e) {
System.out.println("File " + args[0] + " not found.");
return;
}
tm = new sjavaTokenManager(new SimpleCharStream(infile));
01-54: Generated TokenManager
t = tm.getNextToken();
while(t.kind != sjavaConstants.EOF) {
System.out.println("Token : "+ t + " : ");
System.out.println(pascalConstants.tokenImage[t.kind]);
}
}
}
01-55: Lexer Project
Write a .jj file for simpleJava tokens
Need to handle all whitespace (tabs, spaces,
end-of-line)
Need to handle nested comments (to an arbitrary
nesting level)
01-56: Project Details
JavaCC is available at https://javacc.dev.java.net/
To compile your project
% javacc simplejava.jj
% javac *.java
To test your project
% java TokenTest <test filename>
To submit your program: Create a branch:
https://www.cs.usfca.edu/svn/<username>/cs414/lexer/