CSC Slides Intro N Lex
CSC Slides Intro N Lex
Reading Material
• A Practical Approach to Compiler Construction. Des Watson (2017), Springer.
• Compilers Principles, Techniques, & Tools, by A.V. Aho, R.Sethi & J.D.Ullman, Pearson
Education
• Principle of Compiler Design, A.V.Aho and J.D. Ullman, Addition – Wesley
2/
Course Outline
• Introduction to Compilers/Compiling
• Lexical analysis: top-down & bottom-up passing,
• Syntax Analysis,
• Type Checking, Run-Time Environments, etc.
• Intermediate Code Generation,
• Code generation and Optimization
3/
Introduction to Compilers/Compiling
4/
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Abstract Compilation
Process
5/
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Preprocessor
• A preprocessor produce input to compilers.
• They may perform the following functions:
• Macro processing: A preprocessor may allow a user to define macros
that are short hands for longer constructs.
• File inclusion: A preprocessor may include header files into the
program text.
• Rational preprocessor: these preprocessors augment older
languages with more modern flow-of-control and data structuring
facilities.
• Language Extensions: These preprocessor attempts to add
capabilities to the language by certain amounts to build-in macro.
6/
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Compiler
• Compiler is a translator program that translates a program written in
• a source program (HLL) into an equivalent program in (MLL) the
target program.
• or in one HLL to another HLL. E.g. C++ to Java, C to C++, etc.
• Error Detection: An important part of a compiler is the error
messaging system.
Structure of a Compiler 7/
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Compiler
• Executing a program written in HLL programming language is
basically of two parts:
• the source program must first be compiled/translated into an object (target) program.
• Then the resulting object program is loaded into a memory and executed.
8/
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
9/
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Assembler
• Assemble Language was at a point the main mode of writing computer
programs.
• Mnemonic (symbols) were used for each machine instruction, which are
subsequently translated into machine language.
• Because of the difficulty in writing or reading programs in machine
language:
• Programs known as assemblers were written to automate the
translation of assembly language into machine language.
• The input to an assembler program is called source program, the
output is a machine language translation (object program).
10 /
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Loader and Link-Editor
• Object programs produced by assemblers are normally placed into
memory and executed.
• There is a waste of core when the assembler is left in memory while the
program is being executed.
• Also, the programmer would have to retranslate his program with each
execution, thus wasting translation time.
• To over come this problem of wasted translation time and memory.
• System programmers developed another component called loader
11 /
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Loader and Link-Editor
• A loader is a program that places programs into memory and prepares them
for execution.
• The task of adjusting programs so they may be placed in arbitrary core
locations is called relocation.
• Link-Editor: Large programs are often compiled in pieces,
• so the relocatable machine code may have to be linked together with other relocatable
object files and library files into the code that actually runs on the machine.
• The linker resolves external memory addresses, where the code in one file
may refer to a location in another file.
• The loader then puts together all of the executable object files into memory
for execution.
12 /
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
13 /
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Interpreter
• An interpreter is a program that appears to execute a source program as
if it were machine language.
• Languages such as BASIC, SNOBOL, LISP can be translated using interpreters.
• JAVA also uses interpreter.
• The process of interpretation can be carried out in following phases:
• Lexical analysis
• Syntax analysis
• Semantic analysis
• Direct Execution
14 /
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Interpreter
• Advantages:
• Modification of user program can be easily made and implemented as execution
proceeds.
• Type of object that denotes a variables may change dynamically.
• Debugging a program and finding errors is simplified for a program used for
interpretation.
• The interpreter for the language makes it machine independent.
• Disadvantages:
• The execution of the program is slower.
• Memory consumption is higher.
15 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
• A compiler operates in phases.
• A phase is a logically interrelated operation that takes source
program in one representation and produces output in another
representation.
• There are two main phases of compilation:
• Analysis (Machine Independent/Language Dependent)
• Synthesis(Machine Dependent/Language independent)
16 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
• The analysis part breaks up the source program into constituent
pieces and imposes a grammatical structure on them.
• It then uses this structure to create an intermediate representation
of the source program.
• Both syntactical errors and semantical soundness of the source code
is checked.
• Information about the source program is collected and stored in a
data structure called a symbol table, which is passed along with the
intermediate representation to the synthesis part.
17 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
• The synthesis part constructs the desired target program from the
intermediate representation and the information in the symbol
table.
• The analysis part is often called the front end of the compiler and
the synthesis part the back end .
18 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
19 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
• Lexical Analysis
• The lexical analyzer reads the stream of characters making up the source
program and groups the characters into meaningful sequences called lexemes.
• For each lexeme, the lexical analyzer produces as output a token of the form:
<token-name; attribute-value>
• For example: position = initial + rate * 60
20 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
• Lexical Analysis
• position is a lexeme that is mapped into a token <id;1>,
• Id is an abstract symbol for identifier and 1 points to the symbol-table entry for
position .
• The symbol-table entry for an identifier holds information about the identifier,
such as its name and type.
• = mapped into the token <=>
• initial to <id,2>, + to <+>, rate to <id,3>, * to <*>, and 60 to <60>
• <id;1> <=> <id;2> <+><id;3><*><60>
21 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
• Syntax Analysis
• The second phase of the compiler is syntax analysis or parsing .
• Creates a tree-like intermediate representation that depicts the grammatical
structure of the token stream.
• A typical representation is a syntax tree in which each interior node
represents an operation and the children of the node represent the arguments
of the operation.
22 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
23 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
24 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
• Semantic Analysis
• The semantic analyzer uses the syntax tree and the information in the
symbol table to check the source program for semantic consistency with the
language definition.
• It also gathers type information and saves it in either the syntax tree or the
symbol table, for subsequent use during intermediate-code generation.
• An important part of semantic analysis is type checking, where the
compiler checks that each operator has matching operands.
25 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
• Intermediate Code Generation
• In translating a source program into target code, one or more intermediate
representations may be constructed.
• E.g. Syntax trees, commonly used during syntax and semantic analysis.
• After syntax and semantic analysis, explicit low-level or machine-like
intermediate representation may be generated (a program for an abstract
machine).
• This intermediate representation should have two important properties:
• it should be easy to produce and it should be easy to translate into the
target machine.
26 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
• Code Optimization
• The machine-independent code-optimization phase attempts to
improve the intermediate code so that better target code will result.
• Usually better means faster, but other objectives may be desired,
such as shorter code, or target code that consumes less power.
27 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
• Code Generation
• The code generator takes as input an intermediate representation of the
source program and maps it into the target language.
• If the target language is machine code, registers or memory locations are
selected for each of the variables used by the program.
• Then, the intermediate instructions are translated into sequences of
machine instructions that perform the same task.
• A crucial aspect of code generation is the judicious assignment of registers to
hold variables.
28 /
2.0 Lexical Analysis
Overview
• The LA is the first phase of a compiler.
• Its main task is to read the input character from source code and
produce as output a sequence of tokens that the parser uses for syntax
analysis.
30 /
2.0 Lexical Analysis
Overview
• To identify the tokens, we need some method of describing the possible tokens
that can appear in the input stream.
• For this purpose, regular expression are often used,
• a notation that can be used to describe essentially all the tokens of a programming language.
• Secondly, having decided what the tokens are, we need some mechanism to
recognize these in the input stream.
• This is done by the token recognizers, which are designed using transition diagrams and finite
automata.
31 /
2.0 Lexical Analysis
Role of Lexical Analyzer
• Its job is to turn a character input stream from a source file into a token stream by
breaking it into pieces and skipping over irrelevant details.
• Stripping out comments and whitespaces (blank, newline, tab, and other characters that are used
to separate tokens in the input).
• It enters lexemes such as identifiers into the symbol table and also read from it to
determine the proper token it must pass to the parser.
• It performs other tasks such as:
• correlating error messages generated by the compiler with the source program.
32 /
2.0 Lexical Analysis
Why Separate Lexical Analysis & Syntactic Analysis (Parsing)?
• The primary benefits of doing so include significantly simplified jobs for the
subsequent syntactical analysis,
• which would otherwise have to expect whitespace and comments all over
the place.
• It also classify input tokens into types like INTEGER or IDENTIFIER or WHILE-
keyword or OPENINGBRACKET.
• Another benefit of the lexing phase is that it greatly compresses the input by
about 80%.
• A lexer is essentially taking care of the first layer of a regular language view on
the input language.
33 /
2.0 Lexical Analysis
Tokens, Patterns, and Lexemes
• Three related but distinct terms are often used:
• Token: A token is a group of characters having collective meaning:
• Typically, a word or punctuation mark, separated by a lexical analyzer and passed
to a parser.
• The token names are the input symbols that the parser processes.
• Pattern: A rule that describes the set of strings associated to a token.
• Often expressed as a regular expression and describing how a particular token
can be formed.
• A lexeme is a sequence of characters in the source program that matches the pattern
for a token and is identified by the lexical analyzer as an instance of that token.
34 /
2.0 Lexical Analysis
Tokens, Patterns, and Lexemes
36 /
2.0 Lexical Analysis
Tokens, Patterns, and Lexemes
• Example: The token names and associated attribute values for the
Fortran statement are written as a sequence of pairs
37 /
2.0 Lexical Analysis
Lexical Errors
• Lexical analyzers mostly cannot detect errors without the aid of other
components.
• For instance, if the string fi is encountered for the first time in a C
program in the context: fi ( a == f(x)) …
• It cannot tell whether fi is a misspelling of the keyword if or an
undeclared function identifier.
• In general, when an error is found, the lexical analyzer stops (but other
actions are also possible).
• The simplest recovery strategy is “panic mode" recovery.
38 /
2.0 Lexical Analysis
Assignment
• Divide the following C++ program into appropriate lexemes.
• Which lexemes should get associated lexical values?
• What should those values be?
39 /
2.0 Lexical Analysis
Stages of a lexical analyzer
• Scanner
• Based on a finite state machine.
• If it lands on an accepting state, it takes note of the type and position of the
acceptance, and continues.
• When the lexical analyzer lands on a dead state, it is done.
• The last accepting state is the one that represent the type and length of the
longest valid lexeme.
• The "extra" non valid character is "returned" to the input buffer.
40 /
2.0 Lexical Analysis
Stages of a lexical analyzer
• Evaluator
• Goes over the characters of the lexeme to produce a value.
• The lexeme’s type combined with its value is what properly constitutes a
token, which can be given to a parser.
• Some tokens such as parentheses do not really have values, and so the
evaluator function for these can return nothing.
• The evaluators for integers, identifiers, and strings can be considerably more
complex.
• Sometimes evaluators can suppress a lexeme entirely, concealing it from
the parser, which is useful for whitespace and comments.
41 /
2.0 Lexical Analysis
Stages of a lexical analyzer
• Example
42 /
2.0 Lexical Analysis
Input Buffering
• The LA scans the characters of the source program one at a time to
discover tokens.
• Often, many characters beyond a token may have to be examined
before it can be determined.
• For instance, we cannot be sure we've seen the end of an identifier
until we see a character that is not a letter or digit or underscore.
• Also, in C, single-character operators like - , = , or < could also be the
beginning of a two-character operator like -> , == , or <= .
43 /
2.0 Lexical Analysis
Input Buffering
• For this reasons, it is desirable for the lexical analyzer to read its input
from an input buffer.
• Because, large amount of time can be consumed scanning characters,
• specialized buffering techniques have been developed to reduce the amount of
overhead required to process an input character.
• Buffering techniques:
• Buffer pairs
• Sentinels
44 /
2.0 Lexical Analysis
Input Buffering: Buffer pairs
• Each buffer is of the same size N,
• N is usually the size of a disk block, e.g., 4096 bytes.
45 /
2.0 Lexical Analysis
Input Buffering: Buffer pairs
• Once the next lexeme is determined, forward is set to the character at its right
end.
• After recording the lexeme lexemeBegin is set to the character immediately
after the lexeme just found.
46 /
2.0 Lexical Analysis
Input Buffering: Sentinels
• Checks are made each time we advance forward, to ensure that we have not
moved off one of the buffers;
• if so, then another buffer must be reloaded.
47 /
2.0 Lexical Analysis
Regular Expressions (REGEX)
• REGEX provide a concise and flexible means of describing patterns in text, which
is essential for tasks like lexical analysis in compiler design.
• They are sequences of characters that define search patterns, primarily used for
string matching within texts.
• These patterns can be as simple as;
• matching a single character,
• Or complex, involving combinations of various characters, special symbols, and
operators.
50 /
2.0 Lexical Analysis
Regular Expressions
• Here are the rules that define the regular expression over alphabet .
• ϵ is a regular expression denoting {ϵ }, that is, the language containing only
the empty string.
• For each ‘a’ in Σ, is a regular expression denoting { a }, the language with
only one string consisting of the single symbol ‘a’ .
• If R and S are regular expressions denoting the languages L(r) and L(s)
• then (R) | (S) means L(r) U L(s)
• R.S means L(r).L(s)
• R* denotes L(r*)
51 /
2.0 Lexical Analysis
Regular Definitions
• For notational convenience, names may be given to certain regular expressions
and used in subsequent expressions, as if the names were themselves symbols.
• E.g. C identifiers are strings of letters, digits, and underscores.
• Here is a regular definition for the language of C identifiers.
52 /
2.0 Lexical Analysis
Recognition of Tokens
• We will consider how to take the patterns for all the needed tokens and
• build a piece of code that examines the input string and finds a prefix that is a lexeme matching
one of the patterns.
53 /
2.0 Lexical Analysis
Recognition of Tokens
• For relop, comparison operators like =, <>, >=, <= are considered.
• The terminals of the grammar, which are if, then, else, relop, id , and number,
are the names of tokens as far as the lexical analyzer is concerned.
• The patterns for these tokens are described using regular definitions as below:
54 /
2.0 Lexical Analysis
Recognition of Tokens
• For this language, the lexical analyzer will recognize the keywords if, then, and
else ,
• as well as lexemes that match the patterns for relop, id, and number.
56 /
2.0 Lexical Analysis
Transition Diagram
• Transition Diagram has a collection of nodes or circles, called states.
• Each state represents a condition that could occur during the process of
scanning the input looking for a lexeme that matches one of several patterns .
• Edges are directed from one state of the transition diagram to another.
• Each edge is labeled by a symbol or set of symbols.
• If we are in one state s, and the next input symbol is a,
• we look for an edge out of state s labeled by a.
• if we find such an edge, we advance the forward pointer and enter the state of the transition
diagram to which that edge leads.
57 /
2.0 Lexical Analysis
Transition Diagram: Some Important Conventions
i. One state is designated the start state, or initial state;
• It is indicated by an edge, labeled start,“.
• The transition diagram always begins in the start state before any input symbols is read.
58 /
2.0 Lexical Analysis
Transition Diagram: Example: Transition diagram for relop
• Lexemes that match the pattern for relop; <, <>, <= , >, >=, =.
59 /
2.0 Lexical Analysis
Transition Diagram: Recognition of Reserved Words and Identifiers
• Recognizing keywords and identifiers presents a problem.
• Usually, keywords like if or then are reserved so they are not identifiers even
though they look like them.
• This diagram will also recognize the keywords if, then, and else of our running
example.
60 /
2.0 Lexical Analysis
Transition Diagram: Recognition of Reserved Words and Identifiers
• There are two ways that we can handle reserved words that look like identifiers:
• Install the reserved words in the symbol table initially.
• A field of the symbol-table entry indicates that these strings are never ordinary
identifiers, and tells which token they represent.
• Create separate transition diagrams for each keyword;
• an example for the keyword then is shown below.
61 /
2.0 Lexical Analysis
Transition Diagram: Recognition of unsigned Numbers
• While in instate 13:
• if anything, but a digit, dot, or E, is seen, then an integer; e.g 123 is found. – enter state 20
• If a dot, then we have an “optional fraction". – enter state 14
• If an E then we have an “optional exponent" – enter state 16
62 /
2.0 Lexical Analysis
Transition Diagram: Recognition of Whitespace
• Here we look for one or more “whitespace" characters, represented by delim;
• Typically, blank, tab, newline, and perhaps other characters that are not considered by the language
design to be part of any token.
63 /
2.0 Lexical Analysis
Finite Automaton (FA)
• At the heart of the transition is the formalism known as finite automata.
• Finite automata are recognizers;
• A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence
of that language, and “no” otherwise.
64 /
2.0 Lexical Analysis
FA: Nondeterministic Finite Automata (NFA)
• An NFA is a mathematical model that consists of:
1. A finite set of states, S.
2. A set of input symbols Σ, the input alphabet .
• We assume that ϵ, the empty string, is never a member of Σ.
3. A transition function that gives, for each state, and for each symbol in Σ,
a set of next states.
4. A state s0 from S that is distinguished as the start state or initial state.
5. A set of states F, a subset of S, that are distinguished as the accepting
states or final states.
65 /
2.0 Lexical Analysis
FA: Nondeterministic Finite Automata (NFA)
• NFA or DFA can be represented by a transition graph,
• where the nodes are states and the labelled edges represent the transition function.
• NFA accepts a string x, if and only if there is a path from the starting
state to one of the accepting states such that edge labels along this path
spell out x.
66 /
2.0 Lexical Analysis
FA: Nondeterministic Finite Automata (NFA) – E.g.
• The transition graph for an NFA recognizing the language of regular
expression (a|b)*abb – for any abstract string ending in abb.
• Thus, the only strings getting to the accepting state are those that end in abb.
67 /
2.0 Lexical Analysis
Nondeterministic Finite Automata (NFA) – Transition Tables
• NFA can also be represented by a transition table,
• whose rows correspond to states, and
• columns to the input symbols and ϵ.
• Transition table for (a|b)*abb – for any abstract string ending in abb.
68 /
2.0 Lexical Analysis
Nondeterministic Finite Automata (NFA) – Transition Tables
• The transition table has the advantage that we can easily find the transitions
on a given state and input.
• Its disadvantage is that it takes a lot of space, when the input alphabet is
large, yet most states do not have any moves on most of the input symbols.
69 /
2.0 Lexical Analysis
Deterministic Finite Automata (DFA)
• A deterministic finite automaton (DFA) is a special case of an NFA where:
• No state has ϵ- transition, and
• For each symbol a and state s, there is at most one labelled edge a leaving s.
• i.e. transition function is from pair of state-symbol to state (not set of states)
70 /
2.0 Lexical Analysis
Deterministic Finite Automata (DFA) – E.g.
• The transition graph of a DFA accepting (a|b)*abb
• For an abstract input string ababb,
• this DFA enters the sequence of states 0;1;2;1;2;3 and returns “yes."
71 /
2.0 Lexical Analysis
Implementation of Lexical Analysers
• There are two broad approaches to lexical analyser construction.
• The first approach tackles the problem directly by coding the analyser in an
appropriate implementation language – Direct Implementation
• Lexical analysers in many compilers are written this way.
72 /
2.0 Lexical Analysis
Implementation of Lexical Analysers
• Direct Implementation
• White Space, Single Character Tokens, Other Short Tokens, Identifiers and
Reserved Words, Integer Constants, Comments.
73 /
2.0 Lexical Analysis
Implementation of Lexical Analysers
• Tool-Based Implementation
• The Unix-based program lex and its descendants (flex and JLex) are some of the most
popular lexical analyser generating tools.
• Lex takes as input a set of regular expressions defining the tokens, each being
associated with some C code indicating the action to be performed when each token is
recognised.
• The output of the lex program is itself a C program that recognises instances of the
regular expressions and acts according to the corresponding action specifications.
• flex like the lex is used to generate analysers in C for C while JLex generates for Java.
• Most of these tools share largely a common format for the specification of the regular
expressions.
74 /
2.0 Lexical Analysis
Implementation of Lexical Analysers
• Advantages of Direct Implementation
• There may be no compatible lexical analyser generating tools available for the programming
language being used to implement the compiler – Programming by hand may be the only
option.
• The process of compiler construction is less dependent on the availability of other software
tools, and there is no need to learn the language of a new software tool.
• E.g., a newcomer may find specifying complex regular expressions in flex (and other, similar
tools) rather difficult.
• There are no real constraints on the techniques used for recognising tokens because it is easy to add ad-hoc code to
deal with awkward analysis tasks.
• The memory requirements of the lexical analyser can be modest.
• A lexical analyser generated automatically may be more memory hungry despite the effect of
techniques for internal table compression.
• Performance of the lexical analyser produced in this way can be very good.
75 /
2.0 Lexical Analysis
Implementation of Lexical Analysers
• Advantages of Tool-Based Implementation
• The code is more likely to be correct, particularly for lexical tokens with a complex structure.
• E.g., writing code by hand for recognising a floating-point constant as found in traditional
programming languages is a daunting task.
• But once a regular expression has been constructed, the software tool should generate
accurate code without a problem.
• Regular expression specifications for the lexical tokens may be available in the language
specification or documentation.
76 /
77 /