0% found this document useful (0 votes)
25 views115 pages

CH-05 Semantic Analysis

The document outlines the principles and techniques of compiler construction, including the definition, applications, and the process of writing compilers by hand and with tools. It covers various phases of compilation such as lexical analysis, syntax analysis, semantic processing, code generation, and optimization. Additionally, it discusses different types of compilers and interpreters, including hybrid models and just-in-time compilation.

Uploaded by

Arif Kamal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views115 pages

CH-05 Semantic Analysis

The document outlines the principles and techniques of compiler construction, including the definition, applications, and the process of writing compilers by hand and with tools. It covers various phases of compilation such as lexical analysis, syntax analysis, semantic processing, code generation, and optimization. Additionally, it discusses different types of compilers and interpreters, including hybrid models and just-in-time compilation.

Uploaded by

Arif Kamal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 115

Compilers: Principles,

Techniques, and Tools

Jing-Shin Chang
Department of Computer Science &
Information Engineering
National Chi-Nan University

1
Goals
 What is a Compiler? Why? Applications?
 How to Write a Compiler by Hands?
 Theories and Principles behind compiler
construction - Parsing, Translation &
Compiling
 Techniques for Efficient Parsing
 How to Write a Compiler with Tools

2
Table of Contents
1. Introduction: What, Why & Apps
2. How: A Simple Compiler
- What is A Better & Typical Compiler
3. Lexical Analysis:
- Regular Expression and Scanner
4. Syntax Analysis:
- Grammars and Parsing
5. Top-Down Parsing: LL(1)
6. Bottom-Up Parsing: LR(1)
3
Table of Contents
7. Syntax-Directed Translation
8. Semantic Processing
9. Symbol Tables
10. Run-time Storage Organization

4
Table of Contents

11. Translation of Special Structures


*. Modular Program Structures
*. Declarations
*. Expressions and Data Structure
References
*. Control Structures
*. Procedures and Functions
12. General Translation Scheme:
- Attribute Grammars
5
Table of Contents

13. Code Generation


14. Global Optimization
15. Tools: Compiler Compiler

6
What is A Compiler?

- Functional blocks
- Forms of compilers

7
The Compiler
 What is a compiler?
 A program for translating programming
languages into machine languages
 source language => target language
 Why compilers?
 Filling the gaps between a programmer and the
computer hardware

8
Compiler: A Bridge Between
PL and Hardware

Applications (High Level Language) A := B + C * D

Compiler

Operating System
MOV A, C
Hardware (Low Level Language) MUL A, D
ADD A, B
MOV va, A

Register-based or Assembly Codes


Stack-based machines
9
Typical Machine Instructions –
Register-based Machines A
 Data Transfer B C
 MOV A, B
 D E
MOV A, [mem]
 More: IN/OUT, Push, Pop, ... H L
 Arithmetic Operation
 ADD A, C // A := A + C Registers of
 MUL A, D // A := A * D an Intel 8085
 processor
More: ADC, SUB, SBB, INC …
 Logical Operation
 AND A, 00001111B // A := A & 00001111B
 More: OR, NOT, XOR, Shift, Rotate
 Program Control
 JMP, JZ, JNZ, Call, …
 Low Level Instructions Features:
 Mostly Simple Binary Operators (using source & target operands)
10
Typical Machine Instructions –
Stack-based Machines
 Data Transfer SP *SP
 Push A // SP++; *(SP) := A
 SP-1
Push [mem] // SP++; *(SP) := [mem]
 Dup // *(SP+1) := *(SP) ; SP++ …
 Pop [mem] // *[mem] := *(SP); SP--
 Arithmetic Operation
 ADD // *(SP-1) := *(SP) + *(SP-1); SP--
 MUL // *(SP-1) := *(SP) x *(SP-1); SP--
 Logical Operation …
 Program Control …
 Low Level Instructions Features:
 Mostly Simple Binary Operators
 Operations are applied to the topmost 2 source operands
 return results to new stack top (destination operand)
 Almost no general purpose registers
11
Compiler (1) - Compilation
MOV A, C
A := B + C * D MUL A, D
ADD A, B
MOV va, A

Source Compiler Target


Program/Code Program/Code
(P.L., Formal Spec.) (P.L., Assembly,
Machine Code)
Error Message

12
Compiler (2a) – Execution
Running the compiled codes

Input Target Code Output


(in Real Machine)
Target code
(compiled)
Loader

(load into Real Machine) 16


Compiler (2b) – Compile & Go
Two working phases in two passes
Source
Program Compiler Error Message

Input Target Code Output


(in Real Machine)
Compiler: Two independent phases to complete the work
- (1) Compilation Phase: Source to Target compilation
- (2) Execution Phase: run compiled codes & respond to input
& produce output 17
Compiler (2c) – compile & go
Two working phases in two passes
Source program (& executable Target code)
Compiler
Output
(+Loader)
Input
(target loaded into Real Machine)
Compiler: Two independent phases to complete the work
- (1) Compilation Phase: Source to Target compilation
- (2) Execution Phase: run compiled codes & respond to input
& produce output 18
Interpreter (1)

Source program
Interpreter Output
Input

Error Message
Interpreter: One single pass to complete the two-phases work
- Each source statement is Compiled and Executed subsequently
- The next statement is then handled in the same way
19
Interpreter (2)
 Compile and then execute for each
incoming statements
 Do not save compiled codes in executable files
 Save storage
 Re-compile the same statements if loop back
 Slower
 Detect (compilation & runtime) errors as one
occurs during the execution time
  Compiler: Detect syntax/semantic errors
(“compilation errors”) during compilation time
20
Hybrid: Compiler + Interpreter?
Source program

Compiler Error Message

Intermediate program

Interpreter+ Output
Input
(with/without JIT) 21
Hybrid: Compiler + Interpreter?
Source program
Intermediate program:
- without syntax/semantic errors
- machine independent
Compiler Interpreter:
- do not interpret high level source
- but compiled low level code
- easy to interpret + efficient
Intermediate program

Interpreter+ Output
Input
(with/without JIT) 22
Hybrid Method & Virtual Machine
Source program

Translator (Compiler)

Intermediate program
Virtual Machine
(VM) Output
Input
(Interpreter with/without JIT) 23
Example: Java Compiler & Java VM
Java program (app.java)

Java Compiler (Javac)

Java Bytecodes (app.class)


Java
Virtual Machine Output
Input
(Interpreter with/without JIT) 24
Hybrid Method & Virtual Machine
 Compile source program into a platform
independent code
 E.g., Java => Bytecodes (stack-based
instructions)
 Execute the code with a virtual machine
 High portability: The platform independent
code can be distributed on the web,
downloaded and executed in any platform that
had VM pre-installed
 Good for cross-platform applications

25
Just-in-time (JIT) Compilation
 Compile a new statement (only once) as it comes
for the first time
 And save the compiled codes
 Executed by virtual/real machine
 Do not re-compile as it loop back
 Example:
 Java VM (simple Interpreter version, without JIT): high
penalty in performance due to interpretation
 Java VM + JIT: improved by the order of a factor of 10
 JIT: translate bytecodes during run time to the native target
machine instruction set
26
Comparison of Different
Compilation-and-Go Schemes
 Normal Compilers
 Will generate codes for all statements whether they will be
executed or not
 Separate the compilation phase and execution phase into two
different phrases
 Syntax & semantic errors are detected at compilation time
 Interpreters and JIT Compilers
 Can generate codes only for statements that are really executed
 Will depend on your input – different execution flows mean different
sets of executed codes
 Interpreter: Syntax & semantic errors are detected at run/execution
time
 JIT vs. Simple Interpreter
 JIT: save the target machine codes
• Can be re-used, and compiled at most once
 Interpreter: do not save target machine codes
• Compiled more than once

27
Register-Based Virtual Machine
for Android Phone – Dalvik VM
Java Program
 Java VM (JVM) – Stack-based
Instruction Set
Java
Compiler  Normally less efficient than RISC or
Java Bytecodes CISC instructions
(stack based)  Limited memory organization
Java  Requires too many swap and copy
Virtual Machine
operations

28
Register-Based Virtual Machine
for Android Phone – Dalvik VM
 Dalvik VM (for Android OS) – Register-based
Java Program Instruction Set
 Smaller size
Java  Better memory efficiency
Compiler  Good for phone and other embedded systems
 Generation and Execution of Dalvik byte codes
Java Bytecodes  Compiled/Translated from Java byte code into a new
(stack-based) byte code
 app.java (Java source)
dx 
(+compression) =|| javac (Java Compiler)||=> app.class (executable by
JVM)
 =|| dx (in Android SDK tool) ||=> app.dex (Dalvik
Dalvik Bytecodes
(register-based)
Executable)
 =|| compression ||=> apps.apk (Android Application
Package)
Dalvik
 =|| Dalvik VM ||=> (execution)
Virtual Machine

29
How To Construct A Compiler
- Language Processing Systems
- High-Level and Intermediate Languages
- Processing Phases
- Quick Review on Syntax & Semantics
- Processing Phases in Detail
- Structure of Compilers

30
Source Program

A language-Processing System
Preprocessor

Modified Source Program

Compiler

Target Assembly Program

Assembler

Relocatable Machine Code


Library files
Linker/Loader and/or
Relocatable object files

Target Machine Code


31
Programming Languages vs.
Natural Languages
 Natural languages: for communication between
native speakers of the same or different languages
 Chinese, English, French, Japanese
 Programming languages: for communication
between programmers and computers
 Generic High-Level Programming Languages:
 Basic, Fortran, COBOL, Pascal, C/C++, Java
 Typesetting Languages:
 TROFF (+TBL, EQN, PIC), La/Tex, PostScript
 Markup Language -- Structured Documents:
 SGML, HTML, XML, ...
 Script Languages:
 Csh, bsh, awk, perl, python, javascript, asp, jsp, php
32
Machine Independent Intermediate
Instructions
 Low Level Instructions Features:
 Mostly Simple Binary Operators
 Result is often save to Accumulator (A register)
 Not intuitive to programmers
 Intermediate instructions:
 3 address codes: (for register-based machines)
 A := B + C
 2 source operands, one destination operand
 Easy to map to machine instructions (share one source &
destination operand)
• A := A + B
 Stack machine codes: (for stack-based machines)
33
Compiler: A Bridge Between
PL and Hardware

Applications (High Level Language) A := B + C * D

T1 := C * D
Compiler T2 := B + T1
A := T2

Operating System Intermediate Codes

Hardware (Low Level Language) MOV A, C


MUL A, D
ADD A, B
MOV va, A
Register-based or
Stack-based machines Assembly Codes
34
Compiler: with Intermediate
Codes
T1 := C * D MOV A, C
A := B + C * D T2 := B + T1 MUL A, D
A := T2 ADD A, B
MOV va, A

Source Compiler Target


Program/Code Program/Code
(P.L., Formal Spec.) (P.L., Assembly,
Machine Code)
Error Message

35
float position, initial, rate
position := initial + rate * 60

intermediate code generator


lexical analyzer

Typical Phases of a Compiler


temp1 := inttoreal (60) 3-address
Tokens id1 := id2 + id3 * 60 temp2 := id3 * temp1 codes, or
temp3 := id2 + temp2 Stack machine
syntax analyzer Id1 := temp3 codes

:=
code optimizer
Parse Tree id1 +
or
Syntax Tree id2 * temp1 := id3 * 60.0 Optimized
id3 60 id1 := id2 + temp1 codes

semantic analyzer code generator


:=

Syntax Tree id1 + MOVF id3, R2


or MULF #60.0, R2 Assembly
id2 *
Annotated MOVF id2, R1 (or Machine)
Syntax Tree id3 inttoreal ADDF R2, R1 Codes
MOVF R1, id1 36
60
Analysis-Synthesis Model of a
Compiler
 Analysis : Program => Constituents => I.R.
 Lexical Analysis: linear => token
 Syntax Analysis: hierarchical, nested => tree
 Identify relations/actions among tokens: e.g., add(b, mult(c,d))
 Semantic Analysis: check legal constraints / meanings
 By examining attributes associated with tokens & relations
 Synthesis: I.R. => I.R.* => Target Language
 Intermediate Code Generation
 generate intermediate representation (I.R.) from syntax
 Code Optimization: generate better equivalent IR
 machine independent + machine dependent
 Code Generation
37
Typical Modules of a Compiler

Source Syntax Annotated


Annotated Target
Code Tokens Tree Syntax
Tree Tree IR IR Code
Lexical Syntax Semantic Intermediate Code Code
Code
Analyzer Analyzer Analyzer Generator Optimizer Generator

Literal Symbol Error


Table Table Handler

38
float position, initial, rate
position := initial + rate * 60

intermediate code generator


lexical analyzer

Typical Phases of a Compiler


temp1 := inttoreal (60) 3-address
Tokens id1 := id2 + id3 * 60 temp2 := id3 * temp1 codes, or
temp3 := id2 + temp2 Stack machine
syntax analyzer Id1 := temp3 codes

:=
code optimizer
Parse Tree id1 +
or
Syntax Tree id2 * temp1 := id3 * 60.0 Optimized
id3 60 id1 := id2 + temp1 codes

semantic analyzer code generator


:=

Syntax Tree id1 + MOVF id3, R2


or MULF #60.0, R2 Assembly
id2 *
Annotated MOVF id2, R1 (or Machine)
Syntax Tree id3 inttoreal ADDF R2, R1 Codes
MOVF R1, id1
60
How To Construct A Compiler
- Language Processing Systems
- High-Level and Intermediate Languages
- Processing Phrases
- Quick Review on Syntax & Semantics
- Processing Phrases in Detail
- Structure of Compilers

40
Syntax Analysis: Structure
 Syntax Analysis (Parsing):
match input tokens against
a grammar of the id1 := id2 + id3 * 60
language
 To ensure that the input
tokens form a legal sentence
(statement)
 To build the structure
Grammar Syntax Analysis
representation of the input
S → id := e
tokens

S→…
So the structure can be used
for translation (or code
e → id + t s
generation) e→…
t → id * n id1 := e
 Knowledge source: t→ …
 Grammar in CFG (Context-
Free Grammar) form Parse Tree id2 + t
 Additional semantic rules for (Concrete syntax tree)
semantic checks and id3 * 60
translation (in later phases) 41
Grammar: Context Free Grammar

42
Context Free Grammar (CFG):
Specification for Structures & Constituency

 Parse Tree: graphical representation of structure


 root node (S): a sentential level structure
 internal nodes: constituents of the sentence
 arcs: relationship between parent nodes and their children (constituents)
 terminal nodes: surface forms of the input symbols (e.g., words)
 alternative representation: bracketed notation:
 e.g., [I saw [the [girl [in [the park]]]]]
 Example:
NP

NP PP

NP

girl in the park 43


Parse Tree: “I saw the girl in the park”

S
NP VP
NP
NP

NP PP

NP

pron v det n p det n

I saw the girl in the park


44
CFG: Components

 CFG: formal specification of parse trees


 G = {, N, P, S}
 : terminal symbols
 N: non-terminal symbols
 P: production rules
 S: start symbol
 : terminal symbols
 the input symbols of the language
 programming language: tokens (reserved words, variables, operators, …)
 natural languages: words or parts of speech
 pre-terminal: parts of speech (when words are regarded as terminals)
 N: non-terminal symbols
 groups of terminals and/or other non-terminals
 S: start symbol: the largest constituent of a parse tree
 P: production (re-writing) rules
 form: α → β (α: non-terminal, β: string of terminals and non-terminals)
 meaning: α re-writes to (“consists of”, “derived into”)β, or βreduced to α
 start with “S-productions” (S → β)

45
CFG: Example Grammar

 Grammar Rules
 S → NP VP
 NP → Pron | Proper-Noun | Det Norm
 Norm → Noun Norm | Noun
 VP → Verb | Verb NP | Verb NP PP | Verb PP
 PP → Prep NP
 S: sentence, NP: noun phrase, VP: verb phrase
 Pron: pronoun
 Det: determiner, Norm: Norminal
 PP: prepositional phrase, Prep: preposition
 Lexicon (in CFG form)
 Noun → girl | park | desk
 Verb → like | want | is | saw | walk
 Prep → by | in | with | for
 Det → the | a | this | these
 Pron → I | you | he | she | him
 Proper-Noun → IBM | Microsoft | Berkeley

46
Syntax vs. Semantic Analyses
 Syntax:
 How the input tokens look like? Do they form a legal
structure?
 Analysis of relationship between elements
 e.g., operator-operands relationship
 Semantic:
 What they mean? And, thus, how they act?
 Analysis of detailed attributes of elements and check
constraints over them under the given syntax
 Not all knowledge between elements can be conveniently
represented by a simple syntactic structure. Various kinds of
attributes are associated with sub-structures in the given syntax

47
Syntax vs. Semantic Analyses
:=
 Examples: id1 +
 int a, b, c ,d; float f; char s1[], s2[] ; id2 *
 a=b+c * d; id3 id4
 a = b + f * d ; // OK, but not strictly right
 a = b + s1 * s2 ; // BAD: * is undefined for strings
 a = b + s1 * 3 ; // OK? if properly defined
 All the above statements have the same look
 Convenient to represent them with the same syntactic structure
(grammar/production rules)
 But Semantically …
semantic analyzer
 Not all of them are meaningful (?? string * string ??)
• You have to check their other attributes for meanings
:=  Not all meaningful statements will mean/act the same and have
id1 + the same codes (*: int * int  int * float  string * int)
• You have to generate different codes according to other
id2 * attributes of the tokens, since instructions are limited
id3 inttoreal • E.g., INT and FLOAT additions may use different machine
instructions, like ADD and ADDF respectively. 48
id4
Semantic Analysis: Attributes
s

id1 := e
Parse Tree
(Concrete Syntax Tree)
id2 + t
id3 * 60 Semantic
Semantic Rules checks
Assoc. with &
Grammar Semantic Analysis abstraction
Productions
Syntax Tree
:= (Abstract Syntax Tree)
:= +
id1
+ *
id1 id2 i2r
* id3
id2
id3 60 60 49
How To Construct A Compiler
- Language Processing Systems
- High-Level and Intermediate Languages
- Processing Phrases
- Quick Review on Syntax & Semantics
- Processing Phrases in Detail
- Structure of Compilers

50
Symbol Table Management
 Symbols:
 Variable names, procedure names, constant literals
(3.14159)
 Symbol Table:
 A record for each name describing its attributes
 Managing Information about names
 Variable attributes:
• Type, register/storage allocated, scope
 Procedure names:
• Number and types of arguments
• Method of argument passing
– By value, address, reference
51
[1] Lexical Analysis: Tokenization
I saw the girls
[I see the girls]
Both looks the same. So you want to final := initial + rate * 60
represent them with the same normalized [f := i + r * 60]
token string, and hide detailed
features as additional attributes.

Lexical Analysis
I(+1p+sg) see (+ed) the girl (+s)
[I(+1p+sg) see (+prs) the girl (+s)] id1 := id2 + id3 * 60
1 “I” “I” +1p+sg 1 id1 “final” float R2
2 “see” “saw” +ed 2 id2 “initial” float R1
3 “the” “the” 3 id3 “rate” float
4 “girl” “girls” +3p+pl +s 4 const1 “60” const 60.0 52
[2] Syntax Analysis: Structure
I see (+ed) the girl (+s)
id1 := id2 + id3 * 60

Sentence Grammar Syntax Analysis

NP verb NP
s

I see (+ed) the girl (+s) id1 := e

Normalized tokens have id2 + t


the same parse/syntax tree
Parse Tree id3 * 60
whether they were “see”/“saw”
(Concrete syntax tree)
and “girl”/“girls”. 53
[3] Semantic Analysis: Attributes
Sentence s Semantic
checks
id1 := e &
NP verb NP abstraction
Parse Tree
(Concrete Syntax Tree)
id2 + t
I see (+ed) the girl (+s) id3 * 60
Semantic Rules
Assoc. with
Sentence Grammar Semantic Analysis
Productions
Syntax Tree
:= (Abstract Syntax Tree)
NP.subject verb NP.object
+
id1
*
id2 i2r
I see (+ed) the girl (+s) id3
60 56
[3] Semantic Analysis: Attributes
s

id1 := e
Parse Tree
(Concrete Syntax Tree)
id2 + t
id3 * 60 Semantic
Semantic Rules checks
Assoc. with &
Grammar Semantic Analysis abstraction
Productions
Syntax Tree
:= (Abstract Syntax Tree)
:= +
id1
+ *
id1 id2 i2r
* id3
id2
id3 60 60 58
Semantic Checking
sentenc
e  Semantic Constraints:
subject verb object  Agreement: (somewhat
syntactic)
I see (+ed) the girl (+s)
 Subject-Verb: I have,
she has/had, I do have,
abstraction she does not
 NP: Quantifier-noun: a
see (+ed)
book, two books
 Selectional Constraint:
subject object
 Kill  Animate
 Kiss  Animate
I the girl (+s) 60
Semantic Checking
sentenc
e  Semantic Constraints:
subject verb object  Agreement: (somewhat
syntactic)
I see (+ed) the girl (+s)
 Subject-Verb: I have,
she has/had, I do have,
semantic she does not
checking  NP: Quantifier-noun: a

See[+ed](I, the girl[+s]) book, two books


(semantically meaningful)  Selectional Constraint:
Kill/Kiss (John, the Stone)  Kill  Animate
 Kiss  Animate
(semantically meaningless
unless the Stone refers to an animate entity) 61
Parse Tree vs. Syntax Tree
 Parse Tree: (aka concrete syntax tree)
 Tree concrete representation drawn according to a grammar
 For validating correctness of syntax of input
 For easy parsing (or fitting constraints of parsing algorithm)
 Normally constructed incrementally during parsing
 Syntax Tree: (aka abstract syntax tree)
 Tree logical representation that characterize the abstract
relationships between constituents
 For representing semantic relationships & semantic checking
 Normalizing various parse trees of the same “ meaning” (semantics)
 May ignore non-essential syntactic details
 Not always the same as parse tree
 May be constructed in parallel with the parse tree during parsing
 Or converted from parse tree after syntactic parsing
 Annotated Syntax Tree (AST)
 Syntax Tree with annotated attributes
62
Parse Tree vs. Syntax Tree
 Parse Tree
Parse Tree: (depend on grammar)
for G1
 Input: T + T + T Parse Tree
 G1: T ((+ T) (+ T) …) for G2
 E → T R’
 R’ → + T R’
 R’ → <null>
 G2: ((T) + T) + T
 E→E+T
 E→T
 Syntax Tree:
 Abstract representation for syntax
defined by G1/G2
 Use operation as parent nodes and
operands as children nodes
 Operation-operand relationship: Easy Syntax Tree
for instruction selection in code (independent
generation (e.g., ADD R1, R2) of G1 or G2) 63
[4] Intermediate Code Generation
Action(+anim,+anim) Attribute
see (+ed) evaluation
+anim
+anim (assembly codes
are attributes for
subject object := code generation)
id1 +
id2 *
id3 i2r
I the girl (+s) 60

Intermediate Code Generation


logic form 3-address codes
temp1 := i2r ( 60 )
See[+ed](I, the girl[+s]) temp2 := id3 * temp1
temp3 := id2 + temp2
id1 := temp3 64
Syntax-Directed Translation
(1)
 Translation from input to target can be regarded as
attribute evaluation.
 Evaluate attributes of each node, in a well defined order,
based on the particular piece of sub-tree structure
(syntax) wherein the attributes are to be evaluated.
 Attributes: the particular properties associated with
a tree node (a node may have many attributes)
 Abstract representation of the sub-tree rooted at that node
 The attributes of the root node represent the particular
properties of the whole input statement or sentence.
 E.g., value associated with a mathematic sub-expression
 E.g., machine codes associated with a sub-expression
 E.g., language translation associated with a sub-sentence
66
Syntax-Directed Translation
(2)
 Synthesis Attributes:
 Attributes that can be evaluated based on the attributes of
children nodes
 E.g., value of math. expression can be acquired from the values
of sub-expressions (and the operators being applied)
 a := b + c * d
• ( a.val = b.val + tmp.val where tmp.val = c.val * d.val)
 girls = girl + s
• ( tr.girls = tr.girl + tr.s = 女孩 + 們 女孩們 )
 Inherited Attributes:
 Attributes evaluatable from parent and/or sibling nodes
 E.g., data type of a variable can be acquired from its left-hand
side type declaration or from the type of its left-hand side brother
 int a, b, c; ( a.type = INT & b.type = a.type & …)
67
Syntax-Directed Translation
(3)
 Attribute evaluation order:
 Any order that can evaluate the attribute
AFTER all its dependent attributes are
evaluated will result in correct evaluation.
 General: topological order
 Analyze the dependency between attributes and
construct an attribute tree or forest
 Evaluate the attribute of any leave node, and mark it

as “evaluated”, thus logically remove it from the


attribute tree or forest
 Repeat for any leave nodes that have not been

marked, until no unmarked node


68
[5] Code Optimization
[Normalization]
temp1 := i2r ( 60 ) Normalization
Was_Kill[+ed](Bill, John) temp2 := id3 * temp1 into better
equivalent
See[+ed](I, the girl[+s]) temp3 := id2 + temp2 form
id1 := temp3 (optional)
Unify
passive/active
voices
Code Optimization
Kill[+ed](John, Bill)
See[+ed](I, the girl[+s]) temp1 := id3 * 60.0
id1 := id2 + temp1

69
[6] Code Generation

temp1 := id3 * 60.0


See[+ed](I, the girl[+s]) id1 := id2 + temp1
Selection of
target words
& Code Generation
order of
phrases Selection of
usable codes
movf id3, r2 &
Lexical: 看到 [ 了 ] ( 我 , 女孩 [ 們 ]) mulf #60.0, order of codes
r2 &
Structural: 我 看到 女孩 [ 們 ] [ 了 ] movf id2, r1 Allocation of
available
addf r2, r1 registers 70
movf r1, id1
Objectives of Optimizing Compilers
 Correct codes: preserve meaning
 Better performance
 Maximum Execution Efficiency
 Minimum Code Size
 Embedded systems
 Minimizing Power Consumptions
 Mobile devices
 Typically, faster execution also implies lower power
 Reasonable compilation time
 Manageable engineering and maintenance efforts

71
Optimization for Computer
Architectures (1)
 Parallelism
 Instruction level: multiple operations are executed simultaneously
 Processor check dependency in sequential instructions, issue them in
parallel
• Hardware scheduler: change order of instruction
 Compilers: rearrange instructions to make instruction level
parallelism more effective
 Instruction set supports:
• Very long Instruction word: issues multiple operations in parallel
• Instructions that can operate on Vector data at the same time
 Compilers: generate codes for such machine from sequential codes
 Processor level: different threads of the same application are run
on different processors
 Multiprocessors + multithreaded codes
• Programmer: write multithreaded codes, vs
• Compiler: generate parallel codes automatically

72
Optimization for Computer
Architectures (2)
 Memory Hierarchies
 No storage that is both fast and large
 Registers (tens ~ hundreds bytes), caches (K~MB),
main/physical memory (M~GB), secondary/virtual memory
(hard disks) (G~TB)
 Using registers effectively is probably the single most
important problem in optimizing a program
 Cache-management by hardware is not effective in
scientific code that has large data structures (arrays)
 Improve effectiveness of memory hierarchies:
• By changing layout of data, or
• Changing the order of instructions accessing the data
 Improve effectiveness of instruction cache:
• Change the layout of codes

73
How To Construct A Compiler
- Language Processing Systems
- High-Level and Intermediate Languages
- Processing Phrases
- Quick Review on Syntax & Semantics
- Processing Phrases in Detail
- Structure of Compilers

74
Structure of a Compiler
 Front End: Source Dependent
 Lexical Analysis
 Syntax Analysis
 Semantic Analysis
 Intermediate Code Generation
 (Code Optimization: machine independent)
 Back End: Target Dependent
 Code Optimization
 Target Code Generation
75
Structure of a Compiler

Fortran Pascal C

Intermediate Code

MIPS SPARC Pentium

76
History
 1st Fortran compiler: 1950s
 efficient? (compared with assembly program)
 not bad, but much easier to write programs
 high-level languages are feasible.
 18 man-year, ad hoc structure
 Today, we can build a simple compiler in a few
month.
 Crafting an efficient and reliable compiler is still
challenging.
77
Cousins of the Compiler
 Preprocessors: macro definition/expansion
 Interpreters
 Compiler vs. interpreter vs. just-in-time compilation
 Assemblers: 1-pass / 2-pass
 Linkers: link source with library functions
 Loaders: load executables into memory
 Editors: editing sources (with/without syntax prediction)
 Debuggers: symbolically providing stepwise trace
 Profilers: gprof (call graph and time analysis)
 Project managers: IDE
 Integrated Development Environment
 Deassemblers, Decompilers: low-level to high-level
language conversion
78
Applications of Compilation
Techniques

79
Applications of Compilation
Techniques
 Virtually any kinds of Programming
Languages and Specification Languages
with Regular and Well-defined
Grammatical Structures will need a kind of
compiler (or its variant, or a part of it) to
analyze and then process them.

80
Applications of Lexical Analysis
 Text/Pattern Processing:
 grep: get lines with specified pattern
• Ex: grep ‘^From ‘ /var/spool/mail/andy
 sed: stream editor, editing specified patterns
• Ex: ls *.JPG | sed ‘s/JPG/jpg/’
 tr: simple translation between patterns (e.g., uppercases
to lowercases)
• Ex: tr ‘a-z’ ‘A-Z’ < mytext > mytext.uc
 AWK: pattern-action rule processing
 pattern processing based on regular expression
• Ex: awk '$1==“John"{count++}END{print count} ' <
Students.txt 81
Applications of Lexical Analysis
 Search Engines/Information Retrieval
 full text search, keyword matching, fuzzy
match
 Database Machine
 fast matching over large database
 database filter
 Fast & Multiple Matching Algorithms
 Optimized/specialized lexical analyzers (FSA)
 Examples: KMP, Boyer-Moore (BM), …
82
Applications Syntax Analysis
 Structured Editor/Word Processor
 Integrated Develop Environment (IDE)
 automatic formatting, keyword insertion
 Incremental Parser vs. Full-blown Parsing
 incremental: patching analysis made by incremental
changes, instead of re-parsing or re-compiling
 Pretty Printer: beautify nested structures
 cb (C-beautifier)
 indent (an even more versatile C-beautifier)
83
Applications Syntax Analysis
 Static Checker/Debugger: lint
 check errors without really running, e.g.,
 statement not reachable
 used before defined

84
Application of Optimization
Techniques
 Data flow analysis
 Software testing:
 Locating errors before running (static checking)
 Locate errors along all possible execution paths
• not only on test data set
 Type Checking
 Dereferncing null or freed pointers
 “Dangerous” user supplied strings
 Bound Checking
 Security vulnerability: buffer over-run attack
 Tracking values of pointers across procedures
 Memory management
 Garbage collection
85
Applications of Compilation
Techniques
 Pre-processor: Macro definition/expansion
 Active Webpages Processing
 Script or programming languages embedded in
webpages for interactive transactions
 Examples: JavaScript, JSP, ASP, PHP
 Compiler Apps: expansion of embedded
statements, in addition to web page parsing
 Database Query Language: SQL
86
Applications of Compilation
Techniques
 Interpreter
 no pre-compilation
 executed on-the-fly
 e.g., BASIC
 Script Languages: C-shell, Perl
 Function: for batch processing multiple
files/databases
 mostly interpreted, some pre-compiled
 Some interpreted and save compiled codes
87
Applications of Compilation
Techniques
 Text Formatter
 Troff, LaTex, Eqn, Pic, Tbl
 VLSI Design: Silicon Compiler
 Hardware Description Languages
 variables => control signals / data
 Circuit Synthesis
 Preliminary Circuit Simulation by Software

88
Applications of Compilation
Techniques
 VLSI Design

89
Advanced Applications
 Natural Language Processing
 advanced search engines: retrieve relevant
documents
 more than keyword matching
 natural language query

 information extraction:
 acquire relevant information (into structured form)
 text summarization:
 get most brief & relevant paragraphs
 text/web mining:
 mining information & rules from text/web
90
Advanced Applications
 Machine Translation
 Translating a natural language into another
 Models:
 Direct translation
 Transfer-Based Model

 Inter-lingua Model

 Transfer-Based Model:
 Analysis-Transfer-Generation (or Synthesis) model

91
Tools for Compiler Construction

92
Tools: Automatic Generation of
Lexical Analyzers and Compilers
 Lexical Analyzer Generator: LEX
 Input: Token Pattern specification (in regular
expression)
 Output: a lexical analyzer
 Parser Generator: YACC
 “compiler-compiler”
 Input: Grammar Specification (in context-free
grammar)
 Output: a syntax analyzer (aka “parser”)
93
Tools
 Syntax Directed Translation engines
 translations associated with nodes
 translations defined in terms of translations of
children
 Automatic code generation
 translation rules
 template matching
 Data flow analyses
 dependency of variables & constructs

94
Programming Languages
-Issues about Modern PL’s
- Module programming & Parameter passing
- Nested modules & Scopes
- Static dynamic allocation

95
Programming Language Basics
 Static vs. Dynamic Issues or Policies
 Static: determined at compile time
 Dynamic: determined at run time
 Scopes of declaration
 Region in which the use of x refer to a declaration of x
 Static Scope (aka lexical scope):
 Possible to determine the scope of declaration by looking at
the program
 C, Java (and most PL)
• Delimited by block structures
 Dynamic scope:
 At run time, the same use of x could refer to any of several
declarations of x.
96
Programming Language Basics
 Variable declaration
 Static variables
 Possible to determine the location in memory where the
declared variable can be found
• Public static int x; // C++
• Only one copy of x, can be determined at compile time
• Global declarations and declared constants can also be made
static
 Dynamic variables:
 Local variables without the “static” keyword
• Each object of the class would have its own location where x
would be held.
• At run time, the same use of x in different objects could refer to
any of several different locations.

97
Programming Language Basics
 Parameter Passing Mechanisms
 called by value
 make a copy of physical value
 called by reference
 make a copy of the address of a physical object
 call by name (Algol 60)
 callee executed as if the actual parameter were
substituted literally for the formal parameter in the
code of the callee
• macro expansion of formal parameter into actual
parameter

98
Formal Languages

99
Languages, Grammars and
Recognition Machines
I saw a girl in the
park …
Language

define accept
generate

Grammar Parser
(expression) (automaton)
construct

S NP VP S · NP VP Parsing Table
NP pron | det n NP · pron | · det n 100
Languages
 Alphabet - any finite set of symbols
{0, 1}: binary alphabet
 String - a finite sequence of symbols from
an alphabet
1011: a string of length 4
 : the empty string
 Language - any set of strings on an alphabet
{00, 01, 10, 11}: the set of strings of length 2
 : the empty set
101
Grammars
 The sentences in a language may be defined
by a set of rules called a grammar
L: {00, 01, 10, 11}
(the set of binary digits of length 2)
G: (0|1)(0|1)
 Languages of different degree of regularity can be
specified with grammar of different “expressive
powers”
 Chomsky Hierarchy:
 Regular Grammar < Context-Free Grammar < Context-
Sensitive Grammar < Unrestricted
105
Automata
 An acceptor/recognizer of a language is an
automaton which determines if an input
string is a sentence in the language
 A transducer of a language is an automaton
which determines if an input string is a
sentence in the language, and may produce
strings as output if it is in the language
 Implementation: state transition functions
(parsing table)
106
Transducer

language L1 language L2
accept translation
Define automaton Define
/ Generate / Generate

grammar G1 grammar G2
construct

107
Meta-languages
 Meta-language: a language used to define
another language

Different meta-languages will be used to


define the various components of a
programming language so that these
components can be analyzed automatically

108
Definition of Programming
Languages
 Lexical tokens: regular expressions
 Syntax: context free grammars
 Semantics: attribute grammars
 Intermediate code generation:
attribute grammars
 Code generation: tree grammars

109
Implementation of
Programming Languages
 Regular expressions:
finite automata, lexical analyzer
 Context free grammars:
pushdown automata, parser
 Attribute grammars:
attribute evaluators, type checker and
intermediate code generator
 Tree grammars:
finite tree automata, code generator 110
Appendix: Machine Translation

111
Machine Translation (Transfer Approach)

SL Analysis SL Transfer TL Synthesis TL


Text IR IR Text

SL TL
Dictionaries SL-TL Dictionaries
& Grammar Dictionaries & Grammar
Transfer
Inter-lingua Rules

IR: Intermediate Representation

 Analysis is target independent, and


 Generation (Synthesis) is source independent
SL TL 112
Example:
Miss Smith put two books on this dining table.
 Analysis
 Morphological and Lexical Analysis
 Part-of-speech (POS) Tagging

n. Miss
n. Smith
v. put (+ed)
q. two
n. book (+s)
p. on
d. this
n. dining table.
113
Example:
Miss Smith put two books on this dining table.
 Syntax Analysis
S

NP VP

V NP PP

Miss Smith put(+ed) two book(s) on this dining table


114
Example:
Miss Smith put two books on this dining table.
 Transfer

 (1) Lexical Transfer


Miss 小姐
Smith 史密斯
put (+ed) 放
two 兩
book (+s) 書
on 在…上面
this 這
dining table 餐桌
115
Example:
Miss Smith put two books on this dining table.
 Transfer

 (2) Phrasal/Structural Transfer


小姐史密斯放兩書在上面這餐桌
史密斯小姐放兩書在這餐桌上面

116
Example:
Miss Smith put two books on this dining table.

 Generation: Morphological & Structural


史密斯小姐放兩書在這餐桌上面
史密斯小姐放兩 ( 本 ) 書在這 ( 張 ) 餐桌上面

史密斯小姐 ( 把 ) 兩 ( 本 ) 書放在這 ( 張 ) 餐桌上面

中文翻譯: 史密斯小姐把兩本書放在這張餐桌上面
117
source program

lexical
[Aho 86] analyzer

syntax
analyzer

semantic
analyzer
symbol-table error
manager handler
intermediate code
generator

code
optimizer

code
generator

target program 118


position : = initial + rate * 60

lexical analyzer
[Aho 86]
id1 : = id2 + id3 * 60

syntax analyzer
:=
SYMBOL TABLE id1 +
id2 *
1 position … id3 60
2 initial …
semantic analyzer
3 rate … :=
4 id1 +
id2 *
id3 inttorea
l 119
60
C

intermediate code generator


[Aho 86]
temp1 := inttoreal (60)
temp2 := id3 * temp1
temp3 := id2 + temp2
id1 := temp3

code optimizer
temp1 := id3 * 60.0
id1 := id2 + temp1

code generator

Binary Code

120
Detailed Steps (1): Analysis
 Text Pre-processing (separating texts from tags)
 Clean up garbage patterns (usually introduced during file
conversion)
 Recover sentences and words (e.g., <B>C</B> omputer)
 Separate Processing-Regions from Non-Processing-Regions (e.g.,
File-Header-Sections, Equations, etc.)
 Extract and mark strings that need special treatment (e.g., Topics,
Keywords, etc.)
 Identify and convert markup tags into internal tags (de-markup;
however, markup tags also provide information)

 Discourse and Sentence Segmentation


 Divide text into various primary processing units (e.g., sentences)
 Discourse: Cue Phrases
 Sentence: mainly classify the type of “Period” and “Carriage Return”
in English (“sentence stops” vs. “abbreviations/titles”)
121
Detailed Steps (2): Analysis (Cont.)
 Stemming
 English: perform morphological analysis (e.g., -ed, -ing, -s, -ly, re-,
pre-, etc.) and Identify root form (e.g., got <get>, lay <lie/lay>, etc.)
 Chinese: mainly detect suffix lexemes (e.g., 孩子們 , 學生們 , etc.)
 Text normalization: Capitalization, Hyphenation, …

 Tokenization
 English: mainly identify split-idiom (e.g., turn NP on) and compound
 Chinese: Word Segmentation (e.g., [ 土地 ] [ 公有 ] [ 政策 ])
 Regular Expression: numerical strings/expressions (e.g., twenty
millions), date, … (each being associated with a specific type)

 Tagging
 Assign Part-of-Speech (e.g., n, v, adj, adv, etc.)
 Associated forms are basically independent of languages starting from
this step
122
Detailed Steps (3): Analysis (Cont.)
 Parsing
 Decide suitable syntactic relationship (e.g., PP-Attachment)

 Decide Word-Sense
 Decide appropriate lexicon-sense (e.g., River-Bank, Money-Bank,
etc.)

 Assign Case-Label
 Decide suitable semantic relationship (e.g., Patient, Agent, etc.)

 Anaphora and Antecedent Resolution


 Pronoun reference (e.g., “he” refers to “the president”)

123
Detailed Steps (4): Analysis (Cont.)
 Decide Discourse Structure
 Decide suitable discourse segments relationship (e.g., Evidence,
Concession, Justification, etc. [Marcu 2000].)

 Convert into Logical Form (Optional)


 Co-reference resolution (e.g., “president” refers to “Bill Clinton”),
scope resolution (e.g., negation), Temporal Resolution (e.g., today,
last Friday), Spatial Resolution (e.g., here, next), etc.
 Identify roles of Named-Entities (Person, Location, Organization),
and determine IS-A (also Part-of) relationship, etc.
 Mainly used in inference related applications (e.g., Q&A, etc.)

124
Detailed Steps (5): Transfer
 Decide suitable Target Discourse Structure
 For example: Evidence, Concession, Justification, etc. [Marcu 2000].

 Decide suitable Target Lexicon Senses


 Sense Mapping may not be one-to-one (sense resolution might be different
in different languages, e.g. “snow” has more senses in Eskimo)
 Sense-Token Mapping may not be one-to-one (lexicon representation
power might be different in different languages, e.g., “DINK”, “ 睨” , etc).
It could be 2-1, 1-2, etc.

 Decide suitable Target Sentence Structure


 For example: verb nominalization, constitute promotion and demotion
(usually occurs when Sense-Token-Mapping is not 1-1)

 Decide appropriate Target Case


 Case Label might change after the structure has been modified
 (Example) verb nominalization: “… that you (AGENT) invite me”  “…
your (POSS) invitation”
125
Detailed Steps (6): Generation
 Adopt suitable Sentence Syntactic Pattern
 Depend on Style (which is the distributions of lexicon selection
and syntactic patterns adopted)

 Adopt suitable Target Lexicon


 Select from Synonym Set (depend on style)

 Add “de” (Chinese), comma, tense, measure (Chinese), etc.


 Morphological generation is required for target-specific tokens

 Text Post-processing
 Final string substitution (replace those markers of special strings)
 Extract and export associated information (e.g., Glossary, Index,
etc.)
 Restore customer’s markup tags (re-markup) for saving
typesetting work
126

You might also like