0% found this document useful (0 votes)
9 views11 pages

Lecture 3 (30-1-23)

The document discusses lexical analysis in compilers. Lexical analysis scans input programs to identify valid tokens by removing comments and whitespace. It breaks the input into tokens using patterns and passes them to the parser. Regular expressions and finite state automata are used to specify valid tokens and recognize patterns in lexical analysis.

Uploaded by

Tahsk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views11 pages

Lecture 3 (30-1-23)

The document discusses lexical analysis in compilers. Lexical analysis scans input programs to identify valid tokens by removing comments and whitespace. It breaks the input into tokens using patterns and passes them to the parser. Regular expressions and finite state automata are used to specify valid tokens and recognize patterns in lexical analysis.

Uploaded by

Tahsk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Lexical Analysis

• Scan input program to identify valid words, removes comments, extra


white space.
• How do we specify the valid words of a language?
• Regular expression.
• How do we check if sequence of character matches the valid
words of a language?
• Finite Automata.

• token (also called word) –> set of strings defining an atomic element
with a defined meaning
• pattern -> a rule describing a set of string (specified using regular
expression)
• lexeme -> a sequence of characters that match some pattern
• symbol -> the recognized token
• At the first occurrence of the symbol, entry is made in symbol
table
• Additional information (attributes) about the symbol may be
added by the parser
Examples

Token Pattern Sample


Lexeme
while while while

relation_op = | != | < | > <

integer (0-9)* 42

string Characters “hello”


between “ “
Tokens

• Keywords, operators, identifiers (names), constants, literal strings, punctuation symbols such as
parentheses, brackets, commas, semicolons, and colons, etc.
• A unique integer representing the token is passed by LA to the parser
• Attributes for tokens (apart from the integer representing the token)
• identifier: the lexeme of the token, or a pointer into the symbol table where the lexeme is
stored by the LA
• intnum: the value of the integer (similarly for floatnum, etc.)
• string: the string itself
• The exact set of attributes are dependent on the compiler designer
Challenges in lexical analysis
• Certain languages do not have any reserved words, e.g., while, do, if, else, etc., are reserved in ’C’, but not in
PL/1
Example of using do loop in FORTRAN
• In FORTRAN, some keywords are context-dependent

• In the statement, DO 10 I = 10.86


• DO10I is an identifier, and DO is not a keyword
• But in the statement, DO 10 I = 10, 86
• DO is a keyword
• Such features require substantial look ahead for resolution
• Example above -> we cannot be sure until we see the comma (after 10) that DO is a keyword
• Blanks are not significant in FORTRAN and can appear in the midst of identifiers, but not so in ’C’
• Lexical analysis cannot catch any significant errors except for simple errors such as, illegal symbols, etc.
• In such cases, lexical analysis skips characters in the input until a well-formed token is found
Languages
• Symbol: An abstract entity, not defined
• Examples: letters {a,b,c,…,z} and digits {0,1,..,9}
• String: A finite sequence of symbols
• abcb, caba are strings over the symbols {a,b,c}
• |w| is the length of the string w, and is the #symbols in it
• ∊ is the empty string and is of length 0
• Alphabet: A finite set of symbols (e.g., {a,b,c,…,z}, {0,1,..,9} )
• Language: A set of strings of symbols from some alphabet
• Φ (empty language) and {∊} (set with empty string) are languages
• The set of palindromes over {0,1} is an infinite language
• The set of strings, {01, 10, 111} over {0,1} is a finite language
• If Σ is an alphabet, Σ∗ is the set of all strings over Σ
• We need a ‘finite representation’ (encoded by finite string) for a language
• Regular language (or type-3) is represented by Regular expression
• Context-free language (or type-2) is represented by a Context-free grammar
• Context-sensitive language (or type-1) is represented by a Context-sensitive grammar
• type-0 language is represented by type-0 grammar
Regular Expressions (REs)
• Let Σ be an alphabet. The REs over Σ and the languages they denote (or generate) are defined as
below
• φ (empty language/set, not even empty string) is an RE. L(φ) = φ
• ∊ (empty string) is an RE. L(∊) = {∊}
• For each a ∈ Σ, a is an RE. L(a) = {a}
• E.g., Σ ={1,2,3,4} then each 1,2,3,4 are REs, L(1)={1}, L(2)={2},…
• If r and s are REs (not symbol) denoting the languages R and S, respectively
• Concatenation: (rs) is an RE, L(rs) = R.S = {xy | x ∈ R ∧ y ∈ S}
• Union: (r + s) (or (r|s)) is an RE, L(r + s) = R ∪ S
• Kleene closure/closure: (r∗) is an RE, L(r∗) = R∗ = ⋃ 𝑅𝑖

(L∗ is called the Kleene closure or closure of L)


Examples of Regular Expressions
• Given L = set of all strings of 0’s and 1’s, what is the RE?
• r = (0 + 1)* or (0|1)*
• How do we generate the string 101 ?
• (0 + 1) ∗ ⇒ (0 + 1)(0 + 1)(0 + 1) ⇒ 101
• Given L = set of all strings of 0’s and 1’s, with at least two consecutive 0’s, what is the RE?
• r = (0 + 1) ∗00(0 + 1) ∗
• Given L = {w ∈ {0, 1} ∗ | w has two or three occurrences of 1, the first and second of which are not
consecutive}, what is the RE?
• r = 0∗10∗010∗ (10∗+ ∊)

• Given r = (1 + 10)∗, what is the language?


• L = set of all strings of 0’s and 1’s, beginning with 1 and not having two consecutive 0’s
• Given r = (0 + 1)∗011, what is the language?
• L = set of all strings of 0’s and 1’s ending in 011
Examples of Regular Expressions
• Given r = c∗(a + bc∗)∗ , what is the language?
• L = set of all strings over {a,b,c} that do not have the substring ac
• Given L = {w | w ∈ {a, b}∗ ∧ w ends with a} , what is the RE?
• r = (a + b)∗a
• Given L = {if, then, else, while, do, begin, end}, what is the RE?
• r = if + then + else + while + do + begin + end
Automata
• Automata are machines (abstract machines) that accept languages
• Finite State Automata accept RLs (corresponding to REs)
• Pushdown Automata accept CFLs (corresponding to CFGs)
• Linear Bounded Automata accept CSLs (corresponding to CSGs)
• Turing Machines accept type-0 languages (corresponding to type-0 grammars)
• Applications of Automata
• Switching circuit design
• Lexical analyzer in a compiler
• String processing (grep, awk), etc.
• State charts used in object-oriented design
• Modelling control applications, e.g., elevator operation
• Parsers of all types
• Compilers
Finite State Automaton (FSA)
• An FSA is an acceptor or recognizer of regular languages
• An FSA is a 5-tuple, (Q, Σ, δ, q0, F), where
• Q is a finite set of states
• Σ is the input alphabet
• δ is the transition function, δ : Q × Σ → Q
• That is, δ(q, a) is a state for each state q and input symbol a
• q0 is the start state
• F is the set of final or accepting states
• In one move from some state q, an FSA reads an input symbol, changes the state based on δ, and
gets ready to read the next input symbol
• An FSA accepts its input string, if starting from q0, it consumes the entire input string, and
reaches a final state
• If the last state reached is not a final state, then the input string is rejected
FSA example
• Q = {q0, q1, q2, q3} -> finite set of states
• Σ = {a, b, c} -> the input alphabet
• q0 is the start state and F = {q0, q2} (F -> set
of final or accepting states)
• The transition function δ is defined by the
table below

• Language accepted by the FSA?


• is the set of all strings beginning with an
’a’ and ending with a ’c’ ( is also
accepted)

You might also like