0% found this document useful (0 votes)
6 views77 pages

CSC Slides Intro N Lex

The document outlines the course structure for CSC 401: Compiler Construction for the 2024/2025 academic year, detailing modes of assessment, reading materials, and a comprehensive course outline covering topics such as lexical analysis, syntax analysis, and code generation. It explains the roles of various components in the compilation process, including preprocessors, compilers, loaders, and interpreters, as well as the phases of compilation like lexical analysis, syntax analysis, and semantic analysis. Additionally, it discusses the importance of tokens, patterns, and lexemes in the context of lexical analysis.

Uploaded by

zampuu841
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views77 pages

CSC Slides Intro N Lex

The document outlines the course structure for CSC 401: Compiler Construction for the 2024/2025 academic year, detailing modes of assessment, reading materials, and a comprehensive course outline covering topics such as lexical analysis, syntax analysis, and code generation. It explains the roles of various components in the compilation process, including preprocessors, compilers, loaders, and interpreters, as well as the phases of compilation like lexical analysis, syntax analysis, and semantic analysis. Additionally, it discusses the importance of tokens, patterns, and lexemes in the context of lexical analysis.

Uploaded by

zampuu841
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

CSC 401: Compiler Construction

2024/2025 Academic Year


1/
Mode of Assessment
• Assignments – 10 Marks
• Quizzes – 10 Marks
• Mid – Semester Exams – 20 marks
• Main Exams – 60 Marks

Reading Material
• A Practical Approach to Compiler Construction. Des Watson (2017), Springer.
• Compilers Principles, Techniques, & Tools, by A.V. Aho, R.Sethi & J.D.Ullman, Pearson
Education
• Principle of Compiler Design, A.V.Aho and J.D. Ullman, Addition – Wesley

2/
Course Outline
• Introduction to Compilers/Compiling
• Lexical analysis: top-down & bottom-up passing,
• Syntax Analysis,
• Type Checking, Run-Time Environments, etc.
• Intermediate Code Generation,
• Code generation and Optimization

3/
Introduction to Compilers/Compiling

4/
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM

Abstract Compilation
Process

5/
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Preprocessor
• A preprocessor produce input to compilers.
• They may perform the following functions:
• Macro processing: A preprocessor may allow a user to define macros
that are short hands for longer constructs.
• File inclusion: A preprocessor may include header files into the
program text.
• Rational preprocessor: these preprocessors augment older
languages with more modern flow-of-control and data structuring
facilities.
• Language Extensions: These preprocessor attempts to add
capabilities to the language by certain amounts to build-in macro.
6/
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Compiler
• Compiler is a translator program that translates a program written in
• a source program (HLL) into an equivalent program in (MLL) the
target program.
• or in one HLL to another HLL. E.g. C++ to Java, C to C++, etc.
• Error Detection: An important part of a compiler is the error
messaging system.

Structure of a Compiler 7/
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Compiler
• Executing a program written in HLL programming language is
basically of two parts:
• the source program must first be compiled/translated into an object (target) program.
• Then the resulting object program is loaded into a memory and executed.

Execution process of source program in Compiler

8/
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM

• Compiled code either runs on target hardware or virtual machine.


• Advantages of compiling for virtual machines
● The design of the code generated by the compiler is not constrained by
the architecture of the target machine.
● Portability is enhanced: virtual machine and interpreter can run on
machines with different architectures.
● Runtime debugging and monitoring features can be incorporated in the
virtual machine interpreter allowing improved safety and security in
program execution.

9/
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Assembler
• Assemble Language was at a point the main mode of writing computer
programs.
• Mnemonic (symbols) were used for each machine instruction, which are
subsequently translated into machine language.
• Because of the difficulty in writing or reading programs in machine
language:
• Programs known as assemblers were written to automate the
translation of assembly language into machine language.
• The input to an assembler program is called source program, the
output is a machine language translation (object program).
10 /
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Loader and Link-Editor
• Object programs produced by assemblers are normally placed into
memory and executed.
• There is a waste of core when the assembler is left in memory while the
program is being executed.
• Also, the programmer would have to retranslate his program with each
execution, thus wasting translation time.
• To over come this problem of wasted translation time and memory.
• System programmers developed another component called loader

11 /
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Loader and Link-Editor
• A loader is a program that places programs into memory and prepares them
for execution.
• The task of adjusting programs so they may be placed in arbitrary core
locations is called relocation.
• Link-Editor: Large programs are often compiled in pieces,
• so the relocatable machine code may have to be linked together with other relocatable
object files and library files into the code that actually runs on the machine.
• The linker resolves external memory addresses, where the code in one file
may refer to a location in another file.
• The loader then puts together all of the executable object files into memory
for execution.
12 /
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM

● LIST OF COMPILERS –Common Lisp compilers


–Ada compilers –ECMAScript interpreters
–ALGOL compilers –Fortran compilers
–BASIC compilers –Java compilers
–C# compilers –Pascal compilers
–C compilers –PL/I compilers
–C++ compilers –Python compilers
–COBOL compilers –Smalltalk compilers

13 /
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Interpreter
• An interpreter is a program that appears to execute a source program as
if it were machine language.
• Languages such as BASIC, SNOBOL, LISP can be translated using interpreters.
• JAVA also uses interpreter.
• The process of interpretation can be carried out in following phases:
• Lexical analysis
• Syntax analysis
• Semantic analysis
• Direct Execution

14 /
1.0 INTRODUCTION OF LANGUAGE PROCESSING SYSTEM
Interpreter
• Advantages:
• Modification of user program can be easily made and implemented as execution
proceeds.
• Type of object that denotes a variables may change dynamically.
• Debugging a program and finding errors is simplified for a program used for
interpretation.
• The interpreter for the language makes it machine independent.

• Disadvantages:
• The execution of the program is slower.
• Memory consumption is higher.

15 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
• A compiler operates in phases.
• A phase is a logically interrelated operation that takes source
program in one representation and produces output in another
representation.
• There are two main phases of compilation:
• Analysis (Machine Independent/Language Dependent)
• Synthesis(Machine Dependent/Language independent)

• Each of these two main phases are further partitioned into


subphases.

16 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
• The analysis part breaks up the source program into constituent
pieces and imposes a grammatical structure on them.
• It then uses this structure to create an intermediate representation
of the source program.
• Both syntactical errors and semantical soundness of the source code
is checked.
• Information about the source program is collected and stored in a
data structure called a symbol table, which is passed along with the
intermediate representation to the synthesis part.

17 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
• The synthesis part constructs the desired target program from the
intermediate representation and the information in the symbol
table.
• The analysis part is often called the front end of the compiler and
the synthesis part the back end .

18 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler

19 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
• Lexical Analysis
• The lexical analyzer reads the stream of characters making up the source
program and groups the characters into meaningful sequences called lexemes.
• For each lexeme, the lexical analyzer produces as output a token of the form:

<token-name; attribute-value>
• For example: position = initial + rate * 60

20 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
• Lexical Analysis
• position is a lexeme that is mapped into a token <id;1>,
• Id is an abstract symbol for identifier and 1 points to the symbol-table entry for
position .
• The symbol-table entry for an identifier holds information about the identifier,
such as its name and type.
• = mapped into the token <=>
• initial to <id,2>, + to <+>, rate to <id,3>, * to <*>, and 60 to <60>
• <id;1> <=> <id;2> <+><id;3><*><60>

21 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
• Syntax Analysis
• The second phase of the compiler is syntax analysis or parsing .
• Creates a tree-like intermediate representation that depicts the grammatical
structure of the token stream.
• A typical representation is a syntax tree in which each interior node
represents an operation and the children of the node represent the arguments
of the operation.

22 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler

23 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler

24 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
• Semantic Analysis
• The semantic analyzer uses the syntax tree and the information in the
symbol table to check the source program for semantic consistency with the
language definition.
• It also gathers type information and saves it in either the syntax tree or the
symbol table, for subsequent use during intermediate-code generation.
• An important part of semantic analysis is type checking, where the
compiler checks that each operator has matching operands.

25 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
• Intermediate Code Generation
• In translating a source program into target code, one or more intermediate
representations may be constructed.
• E.g. Syntax trees, commonly used during syntax and semantic analysis.
• After syntax and semantic analysis, explicit low-level or machine-like
intermediate representation may be generated (a program for an abstract
machine).
• This intermediate representation should have two important properties:
• it should be easy to produce and it should be easy to translate into the
target machine.

26 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
• Code Optimization
• The machine-independent code-optimization phase attempts to
improve the intermediate code so that better target code will result.
• Usually better means faster, but other objectives may be desired,
such as shorter code, or target code that consumes less power.

27 /
1.3 STRUCTURE OF THE COMPILER DESIGN
Phases of a compiler
• Code Generation
• The code generator takes as input an intermediate representation of the
source program and maps it into the target language.
• If the target language is machine code, registers or memory locations are
selected for each of the variables used by the program.
• Then, the intermediate instructions are translated into sequences of
machine instructions that perform the same task.
• A crucial aspect of code generation is the judicious assignment of registers to
hold variables.

28 /
2.0 Lexical Analysis
Overview
• The LA is the first phase of a compiler.
• Its main task is to read the input character from source code and
produce as output a sequence of tokens that the parser uses for syntax
analysis.

30 /
2.0 Lexical Analysis
Overview
• To identify the tokens, we need some method of describing the possible tokens
that can appear in the input stream.
• For this purpose, regular expression are often used,
• a notation that can be used to describe essentially all the tokens of a programming language.

• Secondly, having decided what the tokens are, we need some mechanism to
recognize these in the input stream.
• This is done by the token recognizers, which are designed using transition diagrams and finite
automata.

31 /
2.0 Lexical Analysis
Role of Lexical Analyzer
• Its job is to turn a character input stream from a source file into a token stream by
breaking it into pieces and skipping over irrelevant details.
• Stripping out comments and whitespaces (blank, newline, tab, and other characters that are used
to separate tokens in the input).
• It enters lexemes such as identifiers into the symbol table and also read from it to
determine the proper token it must pass to the parser.
• It performs other tasks such as:
• correlating error messages generated by the compiler with the source program.

• That’s it maybe divided into a cascade of two processes:


• Scanning: consists of the simple processes that do not require tokenization of the input, such as
deletion of comments and compaction of consecutive whitespace characters into one.
• Lexical analysis proper is the more complex portion, which produces tokens from the output of
the scanner.

32 /
2.0 Lexical Analysis
Why Separate Lexical Analysis & Syntactic Analysis (Parsing)?
• The primary benefits of doing so include significantly simplified jobs for the
subsequent syntactical analysis,
• which would otherwise have to expect whitespace and comments all over
the place.
• It also classify input tokens into types like INTEGER or IDENTIFIER or WHILE-
keyword or OPENINGBRACKET.
• Another benefit of the lexing phase is that it greatly compresses the input by
about 80%.
• A lexer is essentially taking care of the first layer of a regular language view on
the input language.

33 /
2.0 Lexical Analysis
Tokens, Patterns, and Lexemes
• Three related but distinct terms are often used:
• Token: A token is a group of characters having collective meaning:
• Typically, a word or punctuation mark, separated by a lexical analyzer and passed
to a parser.
• The token names are the input symbols that the parser processes.
• Pattern: A rule that describes the set of strings associated to a token.
• Often expressed as a regular expression and describing how a particular token
can be formed.
• A lexeme is a sequence of characters in the source program that matches the pattern
for a token and is identified by the lexical analyzer as an instance of that token.

34 /
2.0 Lexical Analysis
Tokens, Patterns, and Lexemes

• Note: In the Pascal statement:


const pi = 3.1416
the substring pi is a lexeme for the token “identifier”
35 /
2.0 Lexical Analysis
Tokens, Patterns, and Lexemes
• When more than one pattern matches a lexeme, the lexical analyzer
must provide additional information about the particular lexeme.
• It is essential for the code generator to know what string was actually matched.

• The lexical analyzer collects information about tokens into their


associated attributes.
• In practice: A token has usually only a single attribute - a pointer to the
symbol-table entry in which the information about the token is kept
such as:
• the lexeme, the line number on which it was first seen, etc.

36 /
2.0 Lexical Analysis
Tokens, Patterns, and Lexemes
• Example: The token names and associated attribute values for the
Fortran statement are written as a sequence of pairs

37 /
2.0 Lexical Analysis
Lexical Errors
• Lexical analyzers mostly cannot detect errors without the aid of other
components.
• For instance, if the string fi is encountered for the first time in a C
program in the context: fi ( a == f(x)) …
• It cannot tell whether fi is a misspelling of the keyword if or an
undeclared function identifier.
• In general, when an error is found, the lexical analyzer stops (but other
actions are also possible).
• The simplest recovery strategy is “panic mode" recovery.

38 /
2.0 Lexical Analysis
Assignment
• Divide the following C++ program into appropriate lexemes.
• Which lexemes should get associated lexical values?
• What should those values be?

39 /
2.0 Lexical Analysis
Stages of a lexical analyzer
• Scanner
• Based on a finite state machine.
• If it lands on an accepting state, it takes note of the type and position of the
acceptance, and continues.
• When the lexical analyzer lands on a dead state, it is done.
• The last accepting state is the one that represent the type and length of the
longest valid lexeme.
• The "extra" non valid character is "returned" to the input buffer.

40 /
2.0 Lexical Analysis
Stages of a lexical analyzer
• Evaluator
• Goes over the characters of the lexeme to produce a value.
• The lexeme’s type combined with its value is what properly constitutes a
token, which can be given to a parser.
• Some tokens such as parentheses do not really have values, and so the
evaluator function for these can return nothing.
• The evaluators for integers, identifiers, and strings can be considerably more
complex.
• Sometimes evaluators can suppress a lexeme entirely, concealing it from
the parser, which is useful for whitespace and comments.

41 /
2.0 Lexical Analysis
Stages of a lexical analyzer
• Example

42 /
2.0 Lexical Analysis
Input Buffering
• The LA scans the characters of the source program one at a time to
discover tokens.
• Often, many characters beyond a token may have to be examined
before it can be determined.
• For instance, we cannot be sure we've seen the end of an identifier
until we see a character that is not a letter or digit or underscore.
• Also, in C, single-character operators like - , = , or < could also be the
beginning of a two-character operator like -> , == , or <= .

43 /
2.0 Lexical Analysis
Input Buffering
• For this reasons, it is desirable for the lexical analyzer to read its input
from an input buffer.
• Because, large amount of time can be consumed scanning characters,
• specialized buffering techniques have been developed to reduce the amount of
overhead required to process an input character.

• Buffering techniques:
• Buffer pairs
• Sentinels

44 /
2.0 Lexical Analysis
Input Buffering: Buffer pairs
• Each buffer is of the same size N,
• N is usually the size of a disk block, e.g., 4096 bytes.

• Using one system read command N characters can be read into a


buffer, rather than using one system call per character.
• If fewer than N characters remain in the input file,
• then a special character, represented by eof, marks the end of the source.

45 /
2.0 Lexical Analysis
Input Buffering: Buffer pairs

• Two pointers to the input are maintained:


• Pointer lexemeBegin , marks the beginning of the current lexeme.

• Pointer forward scans ahead until a pattern match is found.

• Once the next lexeme is determined, forward is set to the character at its right
end.
• After recording the lexeme lexemeBegin is set to the character immediately
after the lexeme just found.
46 /
2.0 Lexical Analysis
Input Buffering: Sentinels
• Checks are made each time we advance forward, to ensure that we have not
moved off one of the buffers;
• if so, then another buffer must be reloaded.

• Thus, for each character read, two tests are made:


• one for the end of the buffer using a sentinel,
• and one to determine what character is read.

• eof can be used as a sentinel and end of file.

47 /
2.0 Lexical Analysis
Regular Expressions (REGEX)
• REGEX provide a concise and flexible means of describing patterns in text, which
is essential for tasks like lexical analysis in compiler design.
• They are sequences of characters that define search patterns, primarily used for
string matching within texts.
• These patterns can be as simple as;
• matching a single character,
• Or complex, involving combinations of various characters, special symbols, and
operators.

• The grammar defined by regular expressions is known as regular grammar.


• The language defined by regular grammar is known as regular language.
48 /
2.0 Lexical Analysis
Regular Expressions (REGEX): Basic Syntax
• Literals: The simplest form of a regular expressions is a literal, which matches the
exact character. For example, the regex a will match the character 'a' in a text.
• Concatenation: Two regular expressions can be concatenated, meaning they must
appear in sequence. For example, ab will match the sequence 'ab’.
• Alternation: The alternation operator | allows for matching one of several patterns.
For example, a|b matches either 'a' or ‘b’ and (a|b)c matches either 'ac' or 'bc'.
• Repetition Operators: * (Kleene Star) matches zero or more occurrences of the
preceding element. For example, a*b matches any number of 'a's followed by a
'b' (e.g., b, ab, aaab).
• + matches one or more occurrences.
• ? matches zero or one occurrence.
49 /
2.0 Lexical Analysis
Regular Expressions
• Components of regular expression..

50 /
2.0 Lexical Analysis
Regular Expressions
• Here are the rules that define the regular expression over alphabet .
• ϵ is a regular expression denoting {ϵ }, that is, the language containing only
the empty string.
• For each ‘a’ in Σ, is a regular expression denoting { a }, the language with
only one string consisting of the single symbol ‘a’ .
• If R and S are regular expressions denoting the languages L(r) and L(s)
• then (R) | (S) means L(r) U L(s)
• R.S means L(r).L(s)
• R* denotes L(r*)

51 /
2.0 Lexical Analysis
Regular Definitions
• For notational convenience, names may be given to certain regular expressions
and used in subsequent expressions, as if the names were themselves symbols.
• E.g. C identifiers are strings of letters, digits, and underscores.
• Here is a regular definition for the language of C identifiers.

52 /
2.0 Lexical Analysis
Recognition of Tokens
• We will consider how to take the patterns for all the needed tokens and
• build a piece of code that examines the input string and finds a prefix that is a lexeme matching
one of the patterns.

• The following example of a grammar for a branching statement and conditional


expressions will be used.

53 /
2.0 Lexical Analysis
Recognition of Tokens
• For relop, comparison operators like =, <>, >=, <= are considered.
• The terminals of the grammar, which are if, then, else, relop, id , and number,
are the names of tokens as far as the lexical analyzer is concerned.
• The patterns for these tokens are described using regular definitions as below:

54 /
2.0 Lexical Analysis
Recognition of Tokens
• For this language, the lexical analyzer will recognize the keywords if, then, and
else ,
• as well as lexemes that match the patterns for relop, id, and number.

• The lexical analyzer is assigned the job of stripping out whitespaces, by


recognizing the “token" ws defined by:

blank, tab, and newline are abstract symbols.


• Token ws is not returned to the parser when recognised, but rather restart the
lexical analysis from the character that follows.
• It is the token that follows that gets returned to the parser.
55 /
2.0 Lexical Analysis
Recognition of Tokens
• The goal for the lexical analyzer is summarized below.
• The table shows, for each lexeme or family of lexemes, the token name returned
to the parser and what attribute value is returned.

56 /
2.0 Lexical Analysis
Transition Diagram
• Transition Diagram has a collection of nodes or circles, called states.
• Each state represents a condition that could occur during the process of
scanning the input looking for a lexeme that matches one of several patterns .
• Edges are directed from one state of the transition diagram to another.
• Each edge is labeled by a symbol or set of symbols.
• If we are in one state s, and the next input symbol is a,
• we look for an edge out of state s labeled by a.
• if we find such an edge, we advance the forward pointer and enter the state of the transition
diagram to which that edge leads.

57 /
2.0 Lexical Analysis
Transition Diagram: Some Important Conventions
i. One state is designated the start state, or initial state;
• It is indicated by an edge, labeled start,“.
• The transition diagram always begins in the start state before any input symbols is read.

ii. Certain states are said to be accepting or final.


• These states indicates that a lexeme has been found, although the actual lexeme may not
consist of all positions between the lexemeBegin and forward pointers.
• We always indicate an accepting state by a double circle.

iii. If it is necessary to retract the forward pointer one position


• i.e., the lexeme does not include the symbol that got us to the accepting state, then we shall
additionally place a * near that accepting state.
• A number of *s can be placed depending on the number of retractions.

58 /
2.0 Lexical Analysis
Transition Diagram: Example: Transition diagram for relop
• Lexemes that match the pattern for relop; <, <>, <= , >, >=, =.

59 /
2.0 Lexical Analysis
Transition Diagram: Recognition of Reserved Words and Identifiers
• Recognizing keywords and identifiers presents a problem.
• Usually, keywords like if or then are reserved so they are not identifiers even
though they look like them.

• This diagram will also recognize the keywords if, then, and else of our running
example.
60 /
2.0 Lexical Analysis
Transition Diagram: Recognition of Reserved Words and Identifiers
• There are two ways that we can handle reserved words that look like identifiers:
• Install the reserved words in the symbol table initially.
• A field of the symbol-table entry indicates that these strings are never ordinary
identifiers, and tells which token they represent.
• Create separate transition diagrams for each keyword;
• an example for the keyword then is shown below.

61 /
2.0 Lexical Analysis
Transition Diagram: Recognition of unsigned Numbers
• While in instate 13:
• if anything, but a digit, dot, or E, is seen, then an integer; e.g 123 is found. – enter state 20
• If a dot, then we have an “optional fraction". – enter state 14
• If an E then we have an “optional exponent" – enter state 16

62 /
2.0 Lexical Analysis
Transition Diagram: Recognition of Whitespace
• Here we look for one or more “whitespace" characters, represented by delim;
• Typically, blank, tab, newline, and perhaps other characters that are not considered by the language
design to be part of any token.

63 /
2.0 Lexical Analysis
Finite Automaton (FA)
• At the heart of the transition is the formalism known as finite automata.
• Finite automata are recognizers;
• A recognizer for a language is a program that takes a string x, and answers “yes” if x is a sentence
of that language, and “no” otherwise.

• A FA can be: deterministic (DFA) or non-deterministic (NFA)


• This means that we may use DFA or NFA as a lexical analyzer.
• They both recognize regular sets.

• Which one to use?


• Deterministic – faster recognizer than non-deterministic, but may take more space.
• Deterministic automatons are widely used lexical analyzers.

64 /
2.0 Lexical Analysis
FA: Nondeterministic Finite Automata (NFA)
• An NFA is a mathematical model that consists of:
1. A finite set of states, S.
2. A set of input symbols Σ, the input alphabet .
• We assume that ϵ, the empty string, is never a member of Σ.
3. A transition function that gives, for each state, and for each symbol in Σ,
a set of next states.
4. A state s0 from S that is distinguished as the start state or initial state.
5. A set of states F, a subset of S, that are distinguished as the accepting
states or final states.

65 /
2.0 Lexical Analysis
FA: Nondeterministic Finite Automata (NFA)
• NFA or DFA can be represented by a transition graph,
• where the nodes are states and the labelled edges represent the transition function.

• This graph is very much like a transition diagram, except:


• The same symbol can label edges from one state to several different states, and
• An edge may be labelled by ϵ, instead of, or in addition to, symbols from the input
alphabet.

• NFA accepts a string x, if and only if there is a path from the starting
state to one of the accepting states such that edge labels along this path
spell out x.

66 /
2.0 Lexical Analysis
FA: Nondeterministic Finite Automata (NFA) – E.g.
• The transition graph for an NFA recognizing the language of regular
expression (a|b)*abb – for any abstract string ending in abb.
• Thus, the only strings getting to the accepting state are those that end in abb.

67 /
2.0 Lexical Analysis
Nondeterministic Finite Automata (NFA) – Transition Tables
• NFA can also be represented by a transition table,
• whose rows correspond to states, and
• columns to the input symbols and ϵ.

• Transition table for (a|b)*abb – for any abstract string ending in abb.

68 /
2.0 Lexical Analysis
Nondeterministic Finite Automata (NFA) – Transition Tables
• The transition table has the advantage that we can easily find the transitions
on a given state and input.
• Its disadvantage is that it takes a lot of space, when the input alphabet is
large, yet most states do not have any moves on most of the input symbols.

69 /
2.0 Lexical Analysis
Deterministic Finite Automata (DFA)
• A deterministic finite automaton (DFA) is a special case of an NFA where:
• No state has ϵ- transition, and
• For each symbol a and state s, there is at most one labelled edge a leaving s.

• i.e. transition function is from pair of state-symbol to state (not set of states)

70 /
2.0 Lexical Analysis
Deterministic Finite Automata (DFA) – E.g.
• The transition graph of a DFA accepting (a|b)*abb
• For an abstract input string ababb,
• this DFA enters the sequence of states 0;1;2;1;2;3 and returns “yes."

71 /
2.0 Lexical Analysis
Implementation of Lexical Analysers
• There are two broad approaches to lexical analyser construction.
• The first approach tackles the problem directly by coding the analyser in an
appropriate implementation language – Direct Implementation
• Lexical analysers in many compilers are written this way.

• The second takes a more formal approach – Tool Based.


• It works directly with the syntactic definition of tokens expressed as regular
expressions.
• Lexical analysers are generated by transforming these regular expressions to code using
software.
• Many lexical analyser generator packages exists:
• enabling complete lexical analysers to be generated from formal specification
of the tokens.

72 /
2.0 Lexical Analysis
Implementation of Lexical Analysers
• Direct Implementation

• How will individual tokens be recognised

• Define the types of tokens to be recognised

• Define the subroutines for this purpose.

• In most cases the following categories of tokens must be considered:

• White Space, Single Character Tokens, Other Short Tokens, Identifiers and
Reserved Words, Integer Constants, Comments.
73 /
2.0 Lexical Analysis
Implementation of Lexical Analysers
• Tool-Based Implementation

• The Unix-based program lex and its descendants (flex and JLex) are some of the most
popular lexical analyser generating tools.
• Lex takes as input a set of regular expressions defining the tokens, each being
associated with some C code indicating the action to be performed when each token is
recognised.
• The output of the lex program is itself a C program that recognises instances of the
regular expressions and acts according to the corresponding action specifications.
• flex like the lex is used to generate analysers in C for C while JLex generates for Java.

• Most of these tools share largely a common format for the specification of the regular
expressions.

74 /
2.0 Lexical Analysis
Implementation of Lexical Analysers
• Advantages of Direct Implementation
• There may be no compatible lexical analyser generating tools available for the programming
language being used to implement the compiler – Programming by hand may be the only
option.
• The process of compiler construction is less dependent on the availability of other software
tools, and there is no need to learn the language of a new software tool.
• E.g., a newcomer may find specifying complex regular expressions in flex (and other, similar
tools) rather difficult.
• There are no real constraints on the techniques used for recognising tokens because it is easy to add ad-hoc code to
deal with awkward analysis tasks.
• The memory requirements of the lexical analyser can be modest.

• A lexical analyser generated automatically may be more memory hungry despite the effect of
techniques for internal table compression.
• Performance of the lexical analyser produced in this way can be very good.

75 /
2.0 Lexical Analysis
Implementation of Lexical Analysers
• Advantages of Tool-Based Implementation
• The code is more likely to be correct, particularly for lexical tokens with a complex structure.
• E.g., writing code by hand for recognising a floating-point constant as found in traditional
programming languages is a daunting task.
• But once a regular expression has been constructed, the software tool should generate
accurate code without a problem.
• Regular expression specifications for the lexical tokens may be available in the language
specification or documentation.

• Transcribing them into a flex-acceptable form should be easy.


• The task of writing the lexical analyser is potentially much simpler because much less code has to be
written.
• The lexical analyser is easy to modify.
• Performance of the lexical analyser produced in this way can also be very good.

76 /
77 /

You might also like