CH-05 Semantic Analysis
CH-05 Semantic Analysis
Jing-Shin Chang
Department of Computer Science &
Information Engineering
National Chi-Nan University
1
Goals
What is a Compiler? Why? Applications?
How to Write a Compiler by Hands?
Theories and Principles behind compiler
construction - Parsing, Translation &
Compiling
Techniques for Efficient Parsing
How to Write a Compiler with Tools
2
Table of Contents
1. Introduction: What, Why & Apps
2. How: A Simple Compiler
- What is A Better & Typical Compiler
3. Lexical Analysis:
- Regular Expression and Scanner
4. Syntax Analysis:
- Grammars and Parsing
5. Top-Down Parsing: LL(1)
6. Bottom-Up Parsing: LR(1)
3
Table of Contents
7. Syntax-Directed Translation
8. Semantic Processing
9. Symbol Tables
10. Run-time Storage Organization
4
Table of Contents
6
What is A Compiler?
- Functional blocks
- Forms of compilers
7
The Compiler
What is a compiler?
A program for translating programming
languages into machine languages
source language => target language
Why compilers?
Filling the gaps between a programmer and the
computer hardware
8
Compiler: A Bridge Between
PL and Hardware
Compiler
Operating System
MOV A, C
Hardware (Low Level Language) MUL A, D
ADD A, B
MOV va, A
12
Compiler (2a) – Execution
Running the compiled codes
Source program
Interpreter Output
Input
Error Message
Interpreter: One single pass to complete the two-phases work
- Each source statement is Compiled and Executed subsequently
- The next statement is then handled in the same way
19
Interpreter (2)
Compile and then execute for each
incoming statements
Do not save compiled codes in executable files
Save storage
Re-compile the same statements if loop back
Slower
Detect (compilation & runtime) errors as one
occurs during the execution time
Compiler: Detect syntax/semantic errors
(“compilation errors”) during compilation time
20
Hybrid: Compiler + Interpreter?
Source program
Intermediate program
Interpreter+ Output
Input
(with/without JIT) 21
Hybrid: Compiler + Interpreter?
Source program
Intermediate program:
- without syntax/semantic errors
- machine independent
Compiler Interpreter:
- do not interpret high level source
- but compiled low level code
- easy to interpret + efficient
Intermediate program
Interpreter+ Output
Input
(with/without JIT) 22
Hybrid Method & Virtual Machine
Source program
Translator (Compiler)
Intermediate program
Virtual Machine
(VM) Output
Input
(Interpreter with/without JIT) 23
Example: Java Compiler & Java VM
Java program (app.java)
25
Just-in-time (JIT) Compilation
Compile a new statement (only once) as it comes
for the first time
And save the compiled codes
Executed by virtual/real machine
Do not re-compile as it loop back
Example:
Java VM (simple Interpreter version, without JIT): high
penalty in performance due to interpretation
Java VM + JIT: improved by the order of a factor of 10
JIT: translate bytecodes during run time to the native target
machine instruction set
26
Comparison of Different
Compilation-and-Go Schemes
Normal Compilers
Will generate codes for all statements whether they will be
executed or not
Separate the compilation phase and execution phase into two
different phrases
Syntax & semantic errors are detected at compilation time
Interpreters and JIT Compilers
Can generate codes only for statements that are really executed
Will depend on your input – different execution flows mean different
sets of executed codes
Interpreter: Syntax & semantic errors are detected at run/execution
time
JIT vs. Simple Interpreter
JIT: save the target machine codes
• Can be re-used, and compiled at most once
Interpreter: do not save target machine codes
• Compiled more than once
27
Register-Based Virtual Machine
for Android Phone – Dalvik VM
Java Program
Java VM (JVM) – Stack-based
Instruction Set
Java
Compiler Normally less efficient than RISC or
Java Bytecodes CISC instructions
(stack based) Limited memory organization
Java Requires too many swap and copy
Virtual Machine
operations
28
Register-Based Virtual Machine
for Android Phone – Dalvik VM
Dalvik VM (for Android OS) – Register-based
Java Program Instruction Set
Smaller size
Java Better memory efficiency
Compiler Good for phone and other embedded systems
Generation and Execution of Dalvik byte codes
Java Bytecodes Compiled/Translated from Java byte code into a new
(stack-based) byte code
app.java (Java source)
dx
(+compression) =|| javac (Java Compiler)||=> app.class (executable by
JVM)
=|| dx (in Android SDK tool) ||=> app.dex (Dalvik
Dalvik Bytecodes
(register-based)
Executable)
=|| compression ||=> apps.apk (Android Application
Package)
Dalvik
=|| Dalvik VM ||=> (execution)
Virtual Machine
29
How To Construct A Compiler
- Language Processing Systems
- High-Level and Intermediate Languages
- Processing Phases
- Quick Review on Syntax & Semantics
- Processing Phases in Detail
- Structure of Compilers
30
Source Program
A language-Processing System
Preprocessor
Compiler
Assembler
T1 := C * D
Compiler T2 := B + T1
A := T2
35
float position, initial, rate
position := initial + rate * 60
:=
code optimizer
Parse Tree id1 +
or
Syntax Tree id2 * temp1 := id3 * 60.0 Optimized
id3 60 id1 := id2 + temp1 codes
38
float position, initial, rate
position := initial + rate * 60
:=
code optimizer
Parse Tree id1 +
or
Syntax Tree id2 * temp1 := id3 * 60.0 Optimized
id3 60 id1 := id2 + temp1 codes
40
Syntax Analysis: Structure
Syntax Analysis (Parsing):
match input tokens against
a grammar of the id1 := id2 + id3 * 60
language
To ensure that the input
tokens form a legal sentence
(statement)
To build the structure
Grammar Syntax Analysis
representation of the input
S → id := e
tokens
S→…
So the structure can be used
for translation (or code
e → id + t s
generation) e→…
t → id * n id1 := e
Knowledge source: t→ …
Grammar in CFG (Context-
Free Grammar) form Parse Tree id2 + t
Additional semantic rules for (Concrete syntax tree)
semantic checks and id3 * 60
translation (in later phases) 41
Grammar: Context Free Grammar
42
Context Free Grammar (CFG):
Specification for Structures & Constituency
NP PP
NP
S
NP VP
NP
NP
NP PP
NP
45
CFG: Example Grammar
Grammar Rules
S → NP VP
NP → Pron | Proper-Noun | Det Norm
Norm → Noun Norm | Noun
VP → Verb | Verb NP | Verb NP PP | Verb PP
PP → Prep NP
S: sentence, NP: noun phrase, VP: verb phrase
Pron: pronoun
Det: determiner, Norm: Norminal
PP: prepositional phrase, Prep: preposition
Lexicon (in CFG form)
Noun → girl | park | desk
Verb → like | want | is | saw | walk
Prep → by | in | with | for
Det → the | a | this | these
Pron → I | you | he | she | him
Proper-Noun → IBM | Microsoft | Berkeley
46
Syntax vs. Semantic Analyses
Syntax:
How the input tokens look like? Do they form a legal
structure?
Analysis of relationship between elements
e.g., operator-operands relationship
Semantic:
What they mean? And, thus, how they act?
Analysis of detailed attributes of elements and check
constraints over them under the given syntax
Not all knowledge between elements can be conveniently
represented by a simple syntactic structure. Various kinds of
attributes are associated with sub-structures in the given syntax
47
Syntax vs. Semantic Analyses
:=
Examples: id1 +
int a, b, c ,d; float f; char s1[], s2[] ; id2 *
a=b+c * d; id3 id4
a = b + f * d ; // OK, but not strictly right
a = b + s1 * s2 ; // BAD: * is undefined for strings
a = b + s1 * 3 ; // OK? if properly defined
All the above statements have the same look
Convenient to represent them with the same syntactic structure
(grammar/production rules)
But Semantically …
semantic analyzer
Not all of them are meaningful (?? string * string ??)
• You have to check their other attributes for meanings
:= Not all meaningful statements will mean/act the same and have
id1 + the same codes (*: int * int int * float string * int)
• You have to generate different codes according to other
id2 * attributes of the tokens, since instructions are limited
id3 inttoreal • E.g., INT and FLOAT additions may use different machine
instructions, like ADD and ADDF respectively. 48
id4
Semantic Analysis: Attributes
s
id1 := e
Parse Tree
(Concrete Syntax Tree)
id2 + t
id3 * 60 Semantic
Semantic Rules checks
Assoc. with &
Grammar Semantic Analysis abstraction
Productions
Syntax Tree
:= (Abstract Syntax Tree)
:= +
id1
+ *
id1 id2 i2r
* id3
id2
id3 60 60 49
How To Construct A Compiler
- Language Processing Systems
- High-Level and Intermediate Languages
- Processing Phrases
- Quick Review on Syntax & Semantics
- Processing Phrases in Detail
- Structure of Compilers
50
Symbol Table Management
Symbols:
Variable names, procedure names, constant literals
(3.14159)
Symbol Table:
A record for each name describing its attributes
Managing Information about names
Variable attributes:
• Type, register/storage allocated, scope
Procedure names:
• Number and types of arguments
• Method of argument passing
– By value, address, reference
51
[1] Lexical Analysis: Tokenization
I saw the girls
[I see the girls]
Both looks the same. So you want to final := initial + rate * 60
represent them with the same normalized [f := i + r * 60]
token string, and hide detailed
features as additional attributes.
Lexical Analysis
I(+1p+sg) see (+ed) the girl (+s)
[I(+1p+sg) see (+prs) the girl (+s)] id1 := id2 + id3 * 60
1 “I” “I” +1p+sg 1 id1 “final” float R2
2 “see” “saw” +ed 2 id2 “initial” float R1
3 “the” “the” 3 id3 “rate” float
4 “girl” “girls” +3p+pl +s 4 const1 “60” const 60.0 52
[2] Syntax Analysis: Structure
I see (+ed) the girl (+s)
id1 := id2 + id3 * 60
NP verb NP
s
id1 := e
Parse Tree
(Concrete Syntax Tree)
id2 + t
id3 * 60 Semantic
Semantic Rules checks
Assoc. with &
Grammar Semantic Analysis abstraction
Productions
Syntax Tree
:= (Abstract Syntax Tree)
:= +
id1
+ *
id1 id2 i2r
* id3
id2
id3 60 60 58
Semantic Checking
sentenc
e Semantic Constraints:
subject verb object Agreement: (somewhat
syntactic)
I see (+ed) the girl (+s)
Subject-Verb: I have,
she has/had, I do have,
abstraction she does not
NP: Quantifier-noun: a
see (+ed)
book, two books
Selectional Constraint:
subject object
Kill Animate
Kiss Animate
I the girl (+s) 60
Semantic Checking
sentenc
e Semantic Constraints:
subject verb object Agreement: (somewhat
syntactic)
I see (+ed) the girl (+s)
Subject-Verb: I have,
she has/had, I do have,
semantic she does not
checking NP: Quantifier-noun: a
69
[6] Code Generation
71
Optimization for Computer
Architectures (1)
Parallelism
Instruction level: multiple operations are executed simultaneously
Processor check dependency in sequential instructions, issue them in
parallel
• Hardware scheduler: change order of instruction
Compilers: rearrange instructions to make instruction level
parallelism more effective
Instruction set supports:
• Very long Instruction word: issues multiple operations in parallel
• Instructions that can operate on Vector data at the same time
Compilers: generate codes for such machine from sequential codes
Processor level: different threads of the same application are run
on different processors
Multiprocessors + multithreaded codes
• Programmer: write multithreaded codes, vs
• Compiler: generate parallel codes automatically
72
Optimization for Computer
Architectures (2)
Memory Hierarchies
No storage that is both fast and large
Registers (tens ~ hundreds bytes), caches (K~MB),
main/physical memory (M~GB), secondary/virtual memory
(hard disks) (G~TB)
Using registers effectively is probably the single most
important problem in optimizing a program
Cache-management by hardware is not effective in
scientific code that has large data structures (arrays)
Improve effectiveness of memory hierarchies:
• By changing layout of data, or
• Changing the order of instructions accessing the data
Improve effectiveness of instruction cache:
• Change the layout of codes
73
How To Construct A Compiler
- Language Processing Systems
- High-Level and Intermediate Languages
- Processing Phrases
- Quick Review on Syntax & Semantics
- Processing Phrases in Detail
- Structure of Compilers
74
Structure of a Compiler
Front End: Source Dependent
Lexical Analysis
Syntax Analysis
Semantic Analysis
Intermediate Code Generation
(Code Optimization: machine independent)
Back End: Target Dependent
Code Optimization
Target Code Generation
75
Structure of a Compiler
Fortran Pascal C
Intermediate Code
76
History
1st Fortran compiler: 1950s
efficient? (compared with assembly program)
not bad, but much easier to write programs
high-level languages are feasible.
18 man-year, ad hoc structure
Today, we can build a simple compiler in a few
month.
Crafting an efficient and reliable compiler is still
challenging.
77
Cousins of the Compiler
Preprocessors: macro definition/expansion
Interpreters
Compiler vs. interpreter vs. just-in-time compilation
Assemblers: 1-pass / 2-pass
Linkers: link source with library functions
Loaders: load executables into memory
Editors: editing sources (with/without syntax prediction)
Debuggers: symbolically providing stepwise trace
Profilers: gprof (call graph and time analysis)
Project managers: IDE
Integrated Development Environment
Deassemblers, Decompilers: low-level to high-level
language conversion
78
Applications of Compilation
Techniques
79
Applications of Compilation
Techniques
Virtually any kinds of Programming
Languages and Specification Languages
with Regular and Well-defined
Grammatical Structures will need a kind of
compiler (or its variant, or a part of it) to
analyze and then process them.
80
Applications of Lexical Analysis
Text/Pattern Processing:
grep: get lines with specified pattern
• Ex: grep ‘^From ‘ /var/spool/mail/andy
sed: stream editor, editing specified patterns
• Ex: ls *.JPG | sed ‘s/JPG/jpg/’
tr: simple translation between patterns (e.g., uppercases
to lowercases)
• Ex: tr ‘a-z’ ‘A-Z’ < mytext > mytext.uc
AWK: pattern-action rule processing
pattern processing based on regular expression
• Ex: awk '$1==“John"{count++}END{print count} ' <
Students.txt 81
Applications of Lexical Analysis
Search Engines/Information Retrieval
full text search, keyword matching, fuzzy
match
Database Machine
fast matching over large database
database filter
Fast & Multiple Matching Algorithms
Optimized/specialized lexical analyzers (FSA)
Examples: KMP, Boyer-Moore (BM), …
82
Applications Syntax Analysis
Structured Editor/Word Processor
Integrated Develop Environment (IDE)
automatic formatting, keyword insertion
Incremental Parser vs. Full-blown Parsing
incremental: patching analysis made by incremental
changes, instead of re-parsing or re-compiling
Pretty Printer: beautify nested structures
cb (C-beautifier)
indent (an even more versatile C-beautifier)
83
Applications Syntax Analysis
Static Checker/Debugger: lint
check errors without really running, e.g.,
statement not reachable
used before defined
84
Application of Optimization
Techniques
Data flow analysis
Software testing:
Locating errors before running (static checking)
Locate errors along all possible execution paths
• not only on test data set
Type Checking
Dereferncing null or freed pointers
“Dangerous” user supplied strings
Bound Checking
Security vulnerability: buffer over-run attack
Tracking values of pointers across procedures
Memory management
Garbage collection
85
Applications of Compilation
Techniques
Pre-processor: Macro definition/expansion
Active Webpages Processing
Script or programming languages embedded in
webpages for interactive transactions
Examples: JavaScript, JSP, ASP, PHP
Compiler Apps: expansion of embedded
statements, in addition to web page parsing
Database Query Language: SQL
86
Applications of Compilation
Techniques
Interpreter
no pre-compilation
executed on-the-fly
e.g., BASIC
Script Languages: C-shell, Perl
Function: for batch processing multiple
files/databases
mostly interpreted, some pre-compiled
Some interpreted and save compiled codes
87
Applications of Compilation
Techniques
Text Formatter
Troff, LaTex, Eqn, Pic, Tbl
VLSI Design: Silicon Compiler
Hardware Description Languages
variables => control signals / data
Circuit Synthesis
Preliminary Circuit Simulation by Software
88
Applications of Compilation
Techniques
VLSI Design
89
Advanced Applications
Natural Language Processing
advanced search engines: retrieve relevant
documents
more than keyword matching
natural language query
information extraction:
acquire relevant information (into structured form)
text summarization:
get most brief & relevant paragraphs
text/web mining:
mining information & rules from text/web
90
Advanced Applications
Machine Translation
Translating a natural language into another
Models:
Direct translation
Transfer-Based Model
Inter-lingua Model
Transfer-Based Model:
Analysis-Transfer-Generation (or Synthesis) model
91
Tools for Compiler Construction
92
Tools: Automatic Generation of
Lexical Analyzers and Compilers
Lexical Analyzer Generator: LEX
Input: Token Pattern specification (in regular
expression)
Output: a lexical analyzer
Parser Generator: YACC
“compiler-compiler”
Input: Grammar Specification (in context-free
grammar)
Output: a syntax analyzer (aka “parser”)
93
Tools
Syntax Directed Translation engines
translations associated with nodes
translations defined in terms of translations of
children
Automatic code generation
translation rules
template matching
Data flow analyses
dependency of variables & constructs
94
Programming Languages
-Issues about Modern PL’s
- Module programming & Parameter passing
- Nested modules & Scopes
- Static dynamic allocation
95
Programming Language Basics
Static vs. Dynamic Issues or Policies
Static: determined at compile time
Dynamic: determined at run time
Scopes of declaration
Region in which the use of x refer to a declaration of x
Static Scope (aka lexical scope):
Possible to determine the scope of declaration by looking at
the program
C, Java (and most PL)
• Delimited by block structures
Dynamic scope:
At run time, the same use of x could refer to any of several
declarations of x.
96
Programming Language Basics
Variable declaration
Static variables
Possible to determine the location in memory where the
declared variable can be found
• Public static int x; // C++
• Only one copy of x, can be determined at compile time
• Global declarations and declared constants can also be made
static
Dynamic variables:
Local variables without the “static” keyword
• Each object of the class would have its own location where x
would be held.
• At run time, the same use of x in different objects could refer to
any of several different locations.
97
Programming Language Basics
Parameter Passing Mechanisms
called by value
make a copy of physical value
called by reference
make a copy of the address of a physical object
call by name (Algol 60)
callee executed as if the actual parameter were
substituted literally for the formal parameter in the
code of the callee
• macro expansion of formal parameter into actual
parameter
98
Formal Languages
99
Languages, Grammars and
Recognition Machines
I saw a girl in the
park …
Language
define accept
generate
Grammar Parser
(expression) (automaton)
construct
S NP VP S · NP VP Parsing Table
NP pron | det n NP · pron | · det n 100
Languages
Alphabet - any finite set of symbols
{0, 1}: binary alphabet
String - a finite sequence of symbols from
an alphabet
1011: a string of length 4
: the empty string
Language - any set of strings on an alphabet
{00, 01, 10, 11}: the set of strings of length 2
: the empty set
101
Grammars
The sentences in a language may be defined
by a set of rules called a grammar
L: {00, 01, 10, 11}
(the set of binary digits of length 2)
G: (0|1)(0|1)
Languages of different degree of regularity can be
specified with grammar of different “expressive
powers”
Chomsky Hierarchy:
Regular Grammar < Context-Free Grammar < Context-
Sensitive Grammar < Unrestricted
105
Automata
An acceptor/recognizer of a language is an
automaton which determines if an input
string is a sentence in the language
A transducer of a language is an automaton
which determines if an input string is a
sentence in the language, and may produce
strings as output if it is in the language
Implementation: state transition functions
(parsing table)
106
Transducer
language L1 language L2
accept translation
Define automaton Define
/ Generate / Generate
grammar G1 grammar G2
construct
107
Meta-languages
Meta-language: a language used to define
another language
108
Definition of Programming
Languages
Lexical tokens: regular expressions
Syntax: context free grammars
Semantics: attribute grammars
Intermediate code generation:
attribute grammars
Code generation: tree grammars
109
Implementation of
Programming Languages
Regular expressions:
finite automata, lexical analyzer
Context free grammars:
pushdown automata, parser
Attribute grammars:
attribute evaluators, type checker and
intermediate code generator
Tree grammars:
finite tree automata, code generator 110
Appendix: Machine Translation
111
Machine Translation (Transfer Approach)
SL TL
Dictionaries SL-TL Dictionaries
& Grammar Dictionaries & Grammar
Transfer
Inter-lingua Rules
n. Miss
n. Smith
v. put (+ed)
q. two
n. book (+s)
p. on
d. this
n. dining table.
113
Example:
Miss Smith put two books on this dining table.
Syntax Analysis
S
NP VP
V NP PP
116
Example:
Miss Smith put two books on this dining table.
中文翻譯: 史密斯小姐把兩本書放在這張餐桌上面
117
source program
lexical
[Aho 86] analyzer
syntax
analyzer
semantic
analyzer
symbol-table error
manager handler
intermediate code
generator
code
optimizer
code
generator
lexical analyzer
[Aho 86]
id1 : = id2 + id3 * 60
syntax analyzer
:=
SYMBOL TABLE id1 +
id2 *
1 position … id3 60
2 initial …
semantic analyzer
3 rate … :=
4 id1 +
id2 *
id3 inttorea
l 119
60
C
code optimizer
temp1 := id3 * 60.0
id1 := id2 + temp1
code generator
Binary Code
120
Detailed Steps (1): Analysis
Text Pre-processing (separating texts from tags)
Clean up garbage patterns (usually introduced during file
conversion)
Recover sentences and words (e.g., <B>C</B> omputer)
Separate Processing-Regions from Non-Processing-Regions (e.g.,
File-Header-Sections, Equations, etc.)
Extract and mark strings that need special treatment (e.g., Topics,
Keywords, etc.)
Identify and convert markup tags into internal tags (de-markup;
however, markup tags also provide information)
Tokenization
English: mainly identify split-idiom (e.g., turn NP on) and compound
Chinese: Word Segmentation (e.g., [ 土地 ] [ 公有 ] [ 政策 ])
Regular Expression: numerical strings/expressions (e.g., twenty
millions), date, … (each being associated with a specific type)
Tagging
Assign Part-of-Speech (e.g., n, v, adj, adv, etc.)
Associated forms are basically independent of languages starting from
this step
122
Detailed Steps (3): Analysis (Cont.)
Parsing
Decide suitable syntactic relationship (e.g., PP-Attachment)
Decide Word-Sense
Decide appropriate lexicon-sense (e.g., River-Bank, Money-Bank,
etc.)
Assign Case-Label
Decide suitable semantic relationship (e.g., Patient, Agent, etc.)
123
Detailed Steps (4): Analysis (Cont.)
Decide Discourse Structure
Decide suitable discourse segments relationship (e.g., Evidence,
Concession, Justification, etc. [Marcu 2000].)
124
Detailed Steps (5): Transfer
Decide suitable Target Discourse Structure
For example: Evidence, Concession, Justification, etc. [Marcu 2000].
Text Post-processing
Final string substitution (replace those markers of special strings)
Extract and export associated information (e.g., Glossary, Index,
etc.)
Restore customer’s markup tags (re-markup) for saving
typesetting work
126