CC Summary (Slides)
CC Summary (Slides)
Chapter 1
Compilers viewed from many perspectives:
o Construction:
Single pass.
Multi pass.
Load & go.
o Functional:
Debugging.
Optimizing.
Compilers have two fundamental parts:
o Analysis: decompose source code into intermediate representation.
o Synthesis: target code generation from representation
Important software tools in analysis:
o Structure / syntax directed editors: force "syntactically" correct
code to be entered.
o Pretty printers: standardized version for program structure.
o Static checkers: a quick compilation to detect rudimentary errors.
o Interpreters: real time execution of code (line at a time).
o Text formatters: like LATEX and TROFF
o Silicon compilers: take input and generate circuit design.
Analysis task for compilation:
o Lexical Analysis:
Left-to-right scan to identify tokens.
Tokens: sequence of chars that have collective meaning.
Linear action (not recursive)
Identify only individual "words" that are the tokens of the
language.
o Hierarchical (Syntax) Analysis:
Grouping of tokens into meaningful collection
Verify that the "words" are correctly assembled into
"sentences"
Recursion is required to identify structure of an expression.
o Semantic Analysis:
Checking to ensure correctness of components.
Determine whether the sentences have one and only one
unambiguous interpretation.
Provide Type Checking (legality of operands) operation.
Supporting phases for analysis phase:
o Symbol table creation:
A data structure that contains info about tokens created by
the lexical analyzer.
Updated during analysis phase, and used during synthesis
phases.
o Error Handling:
Detection of different errors which correspond to all phases.
Synthesis task for compilation:
o Intermediate code generation:
Abstract machine version of code (independent of
architecture).
o Code optimization:
Find more efficient ways to execute code.
Replace code with more optimal statements.
Has two approaches: Peephole and High-level language.
o Final code generation:
Generate relocatable machine dependent code.
Grammar: set of rules which govern the interdependencies & structure
among the tokens.
Assembly code: names used for introductions, and names are used for
memory addresses.
Loader: taking relocatable machine code, alerting the addresses and
placing the altered instructions into memory.
Link-editor: taking many (relocatable) machine code programs (with cross-
references) and produce a single file.
o Need to keep track of correspondence between variable names and
corresponding addresses in each piece of code.
Compiler construction tools:
o Parser Generators : Produce Syntax Analyzers
o Scanner Generators (LEX) : Produce Lexical Analyzers
o Syntax-directed Translation Engines (YACC): Generate
Intermediate Code
o Automatic Code Generators : Generate Actual Code
o Data-Flow Engines : Support Optimization
Chapter 2
A Context-free Grammar is utilized to describe the syntactic structure of the
language.
A CFG is characterized by:
o A set of Tokens or Terminal symbols.
o A set of Non-Terminals.
o A set of Production rules.
o A Non-Terminal designated as the start symbol.
A parse tree for a CFG has the following properties:
o Root is labeled with the start symbol.
o Leaf node is a token or epsilon.
o Interior node is a Non-Terminal.
Ambiguous grammar that does not enforce associativity.
o Non-ambiguous grammar enforcing left associativity have a parse
tree that will grow to the left.
o Non-ambiguous grammar enforcing right associativity have a parse
tree that will grow to the right.
Syntax-Directed Translation:
o Associate attributes with grammar rules & constructs and translate as
parsing occurs.
o Each production has a set of semantic rules.
o Each grammar symbol has a set of attributes.
The type of tree traversal that is being performed during semantic rules is
postorder depth-first traversal.
Semantic actions are added into the right sides of the productions.
o Example: 𝑒𝑥𝑝𝑟 → 𝑟𝑒𝑠𝑢𝑙𝑡 | 𝑑𝑖𝑔𝑖𝑡 {𝑝𝑟𝑖𝑛𝑡("𝑎𝑐𝑡𝑖𝑜𝑛"); }
Parse tree / derivation of a token string occurs in a top down fashion.
o Uses a grammar to check structure of tokens.
o Can be recursive descent or predictive parsing.
o Parser operates by attempting to match tokens in the input stream.
Lexical Analysis process functional responsibilities:
o Input token string is broken down.
o White spaces and comments are filtered out.
o Individual tokens with associated values are identified.
o Symbol table is initialized and entries are constructed for each
"appropriate" token.
Reserved words are placed into the symbol table for easy lookup.
Consider 𝑨 → 𝒂
o FIRST(𝒂)= set of leftmost tokens that appear in 𝒂 or in strings
generated by 𝒂
Chapter 3
Separation of Lexical analysis from parsing presents a simpler conceptual
model as it emphasizes:
o High cohesion and low coupling
o Implies well specified for parallel implementation.
o Increase in compiler efficiency (I/O techniques to enhance lexical
analysis).
o Promoting portability.
Major terms in Lexical Analysis:
o Token:
A classification for a common set of strings.
Examples: <Identifier>, <number>.
o Pattern:
The rules which characterize the set of strings for a token.
Examples: recall files and OS wildcards ([A-Z]*.*).
o Lexeme:
Actual sequence of characters that matches pattern and is
classified by a token.
Examples: Identifiers: x, count, name, etc..
Error handling in lexical analysis is very localized, with respect to input
source.
o Errors occur when prefix of remaining input doesn't match any defined
token.
o Possible error recovery actions:
Deleting or inserting input characters.
Replacing or transposing characters.
Lexical Analyzer construction techniques:
o Lexical analyzer generator.
o Hand-code / High-level Language (I/O facilitated by the language).
o Hand-code / Assembly Language (Explicitly manage I/O)
Language: any set of strings over a fixed alphabet.
Regular Expression: a set of rules/techniques used for constructing
sequences of symbols (strings) from an alphabet.
o For fixed alphabet ∑
∈ is a regular expression denoting {∈ }
If a is in ∑, a is a regular expression that denotes {𝒂}
All are Left-Associative. Parentheses are dropped as allowed by
precedence rules.
Transition Diagrams (TD): used to represent the tokens.
o Attempts to match lexeme to a pattern.
o Each TD has:
States: represented by circles.
Actions: represented by arrows between states.
Start state: beginning of a pattern (arrowhead)
Final state(s): end of pattern (concentric circles)
o Each TD is Deterministic.
Lexical Analyzer matches all keywords/reserved words as ids
o After the match, the symbol table or a special keyword table is
consulted
o Keyword table contains string versions of all keywords and associated
token values
o When a match is found, the token is returned, along with its symbolic
value
o If a match is not found, then it is assumed that an id has been
discovered.
Finite Automata: a recognizer that takes an input string & determines
whether it's a valid sentence of the language.
o Deterministic: has at most one action for a give input symbol.
Complex but more precise.
o Non-Deterministic: has more than one alternative action for the same
input symbol.
Easy but less precise.
o Both types are used to recognize regular expressions
Each NFA consists of:
o S, set of states
o ∑, the symbols of the input alphabet
o 𝛿, transition function.
o 𝑠0 ,the start state
o 𝐹, a set of final or accepting states.
Problems in NFA:
o Valid input might not be accepted.
o NFA may behave differently on the same input.
Relationship of NFAs to Compilation:
o Regular Expressions are "Recognized" by NFA
o Regular Expressions are "Patterns" for "Tokens"
o Tokens are building blocks for lexical analysis.
o Lexical analyzer can be described by a collection of NFAs. Each
NFA is for a language token.
Transition diagrams are the states (circles), arcs, and final states.
Transition tables are more suitable to representation within a computer.
Each state in DFA corresponds to a SET of states of the NFA. (same input can
have multiple paths in NFA)
ax- syntax of regular expression
is determining factor for NFA construction and structure.
Let 𝑟 be a regular expression, with NFA 𝑁(𝑟), then:
o 𝑁(𝑟) ℎ𝑎𝑠 # 𝑜𝑓 𝑠𝑡𝑎𝑡𝑒𝑠 ≤ 2(#𝑠𝑦𝑚𝑏𝑜𝑙𝑠+#𝑜𝑝𝑒𝑟𝑎𝑡𝑜𝑟𝑠) 𝑓𝑜𝑟 𝑟
o 𝑁(𝑟) ℎ𝑎𝑠 𝑒𝑥𝑎𝑐𝑡𝑙𝑦 𝑜𝑛𝑒 𝑠𝑡𝑎𝑟𝑡 𝑎𝑛𝑑 𝑜𝑛𝑒 𝑎𝑐𝑐𝑒𝑝𝑡𝑖𝑛𝑔 𝑠𝑡𝑎𝑡𝑒
o Each state of 𝑁(𝑟) has at most one outgoing edge 𝑎 ∈ ∑ or at most two
outgoing ∈ ′𝑠
o Each state must have a unique name.