0% found this document useful (0 votes)
13 views31 pages

Chapter 2

The document discusses lexical analysis in compilers, focusing on the goal of partitioning input strings into tokens while removing comments and whitespace. It explains the concept of tokens as the smallest units of syntax in programming languages and outlines the design and implementation of a lexical analyzer. Additionally, it covers regular languages, finite automata, and the differences between deterministic and nondeterministic finite automata, along with examples and notations for regular expressions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views31 pages

Chapter 2

The document discusses lexical analysis in compilers, focusing on the goal of partitioning input strings into tokens while removing comments and whitespace. It explains the concept of tokens as the smallest units of syntax in programming languages and outlines the design and implementation of a lexical analyzer. Additionally, it covers regular languages, finite automata, and the differences between deterministic and nondeterministic finite automata, along with examples and notations for regular expressions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Compilers

Lexical Analysis
Lexical Analysis
• What is the goal?
if (i ==0)
z=0;
else
z=1;

• The input is just a string of characters:


• If (i==0)\n\tz=0;\nelse\n\tz=1;
• Goal: Partition input string into substrings, remove comments and
whitespaces
• where the substrings are tokens (lexemes)
Token
• Words which are the smallest unit above letters.
• Is the minimal syntax category.
• English: noun, verb, adjective …
• Programming language: Identifier, integer, keyword, whitespace, …
• Tokens correspond to sets of strings
• Identifier: strings of letters or digits, starting with a letter
• Integer: a non-empty string of digits
• Keyword: ”else” or “if” …
• Whitespace: a non-empty sequence of blanks, newlines and tabs.
Contd…
• Tokens classify program substrings according to its role
• The output of a lexical analysis is a stream of tokens.
• Parser relies on token distinction.
• Identifier, is treated differently than a keyword
Designing a lexical analyser
• Define a finite set of tokens
• Tokens describe all items of interest
• Choice of tokens depends on language, design of parser …
• Recall
• \tif (i == j)\n\t\tz = 0;\n\telse\n\t\tz = 1;
• Useful tokens for this expression:
• Integer, Keyword, Relation, Identifier, Whitespace, (, ), =, ;
• N.B., (, ), =, ; are tokens, not characters, here
• Next step is to Describe which substrings belong to each token.
Implementation
• An implementation is responsible for two things.
• Recognize substrings corresponding to tokens accurately
• Return the value or lexeme (substring) of the token.
• First it discards unneeded tokens which won’t contribute to parsing
• Whitespaces and comments.

if (i ==0) //if clause


z=0;
if (i == 0)\n\tz=0;\nelse\n\tz=1;
else /*else clause is located here*/
z=1;
Some examples
• C++
• Most are easily done.
• In Template syntax : Foo<Bar>
• Stream syntax: Cin >> var;
• When there is nested templates occur, there is a conflict: FOO<Bar<Bazz>>
• Is if two variables I and f?
• Is == two equal signs = = or ?
Solution
• Left-to-right scan
• lookahead sometimes required.
Regular languages
• Are one of the several formalisms for specifying tokens.
• Regular languages are simple and useful theory
• Easy to understand
• Efficient implementation
• Definition: Let Σ be a set of characters. A language over Σ is a set of
strings of characters drawn from Σ.
Examples of languages

English Programming language


• Alphabet = characters • Alphabet = ASCII
• Language = Sentences • Language = programs
Notations
• Languages are sets of strings.

• Need some notation for specifying which sets we want

• The standard notation for regular languages is regular expressions.


Regular expressions
• Single character : ‘c’ ={“c”}
• Epsilon: ε ={“”}
• Union A+B ={ s| s ∈A or s ∈B}
• Concatenation AB = {ab | a ∈A and b ∈A}
• Iteration A* = where = AAA… i times.
Regular expressions
• Definition: The regular expressions over Σ are the smallest set of
expressions including
• ε
• ‘c’ where c ∈ Σ
• A+B where A, B are rexp over Σ
• AB “ “ “
• A* Where A is a rexp over Σ
• A? Zero or one instance of A
• A+ One or more instance
Examples
• Keywords: “else” or “if” or …
• ‘else’ + ‘if’ …
• ‘else’ abbreviates as ‘e’ ‘l’ ‘s’ ‘e’
• Integer: a non-empty string of digits
• Digit = ‘0’ +'1’ +'2’ +'3’ +'4’ +'5’ +'6’ +'7’ +'8’ +’9’
• Integer = digit digit*
• Abbreviation: = AA*
• Identifier: strings of letters or digits, starting with a letter
• Letter = ‘A’ + … + ‘z’ +’a’+….+’z’
• Identifier = letter (letter + digit)*
• Whitespace: a non empty sequence of blanks, newlines, and tabs
Examples
• Phone Number
• +251-911-00 00 00
• Σ = digits U { -, +, ‘ ‘}
• Email Address
[email protected]

• There are regular expressions everywhere.


• Everything discussed so far is Syntax not semantics (meaning).
Last lecture
• What is Lexical Analysis?
• What are Tokens?
• Why did we need to have regular languages?
• Write a regular expression for your ID Numbers.
Finite Automata
• Is a simple idealized machine used to recognize patterns within input taken
from some character set.
• Also known as Transition table.
• The job of a FA is to accept or reject an input depending on whether the
pattern defined by the FA occurs in the input.
• Consists of
• An input alphabet Σ
• A set of states S
• A start state
• A set of accepting states F ⊆ S
• A set of transitions
Finite Automata
• Transition S1 a S2
• In state S1 on input “a” go to state S2
• If end of input and in accepting state => accept
• Otherwise reject
Finite state graphs
• A state

• The start state

• An accepting state

a
• A transition
Example
• A finite state automata accepting any number of 1’s followed by a
single 0
Epsilon moves
• Another kind of transition: ε-moves

• Machine can move from state A to B without reading an input.


NFA and DFA
• There are two types
• Deterministic finite automata (DFA)
• Have for every state, exactly one leaving edge with a given non empty input.
• i.e one transition per input per state and no ε-move
• Is completely determined by input
• Nondeterministic finite automata (NFA)
• Have no restrictions on the labels of their edges.
• A state can label several edges out of the same state and ε-move is possible
• Machine can choose whether to make ε-move, which of multiple transitions
of a single input to take.
NFA and DFA (Cont.)

NFA DFA
Reg to FA
• Some additional notations in Reg ex
• Union A+B = A|B
• Option (zero or one): A+ ε = A?
• Range ‘a’+’b’+…+’z’ = [a-z]
• Excluded range: complement of [a-z]= [^a-z]
• Two ways of implementing.
• Regular expression => NFA = > DFA => Table-driven implementation
• Can be done intuitively
First method
• For each kind of rexp, define an NFA notation
• For ε

• For input a

• For AB

• For A+B

• For A*
Example
• Perform the following for the regExp -> NFA
• (1+0)*1
NFA to DFA
• Each state of DFA is a non-empty state of states in NFA
• Start state
• Set of NFA states reachable through ε-moves from NFA start state
• Add a transition S S’ ato DFA iff
• S’ is the set of NFA states reachable from any state in S after seeing the input
a, considering ε-moves as well
• Note that NFA may be in many states at any time.
Example
Second method
• What does the following rexp represent
• Digit = 0|1|…|9
• Digits = digit+
• Digits(.digits)?(e[+|-]?digits)?
• Perform the DFA imperically
Solution

digit
digit digit

digit *
digit . digit E +|- Other
1 2 3 4 5 6 7 8

digit

E
Reading assignment
• Error recovery
• Buffered I/O for token detection and Buffered I/O with Sentinels
• 2D table implementation of a DFA

You might also like