Lecture 3 (30-1-23)

The document discusses lexical analysis in compilers. Lexical analysis scans input programs to identify valid tokens by removing comments and whitespace. It breaks the input into tokens using patterns and passes them to the parser. Regular expressions and finite state automata are used to specify valid tokens and recognize patterns in lexical analysis.

Uploaded by

Tahsk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views11 pages

Lecture 3 (30-1-23)

Uploaded by

Tahsk

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Lexical Analysis

• Scan input program to identify valid words, removes comments, extra

white space.
• How do we specify the valid words of a language?
• Regular expression.
• How do we check if sequence of character matches the valid
words of a language?
• Finite Automata.

• token (also called word) –> set of strings defining an atomic element
with a defined meaning
• pattern -> a rule describing a set of string (specified using regular
expression)
• lexeme -> a sequence of characters that match some pattern
• symbol -> the recognized token
• At the first occurrence of the symbol, entry is made in symbol
table
• Additional information (attributes) about the symbol may be
added by the parser
Examples

Token Pattern Sample

Lexeme
while while while

relation_op = | != | < | > <

integer (0-9)* 42

string Characters “hello”

between “ “
Tokens

• Keywords, operators, identifiers (names), constants, literal strings, punctuation symbols such as
parentheses, brackets, commas, semicolons, and colons, etc.
• A unique integer representing the token is passed by LA to the parser
• Attributes for tokens (apart from the integer representing the token)
• identifier: the lexeme of the token, or a pointer into the symbol table where the lexeme is
stored by the LA
• intnum: the value of the integer (similarly for floatnum, etc.)
• string: the string itself
• The exact set of attributes are dependent on the compiler designer
Challenges in lexical analysis
• Certain languages do not have any reserved words, e.g., while, do, if, else, etc., are reserved in ’C’, but not in
PL/1
Example of using do loop in FORTRAN
• In FORTRAN, some keywords are context-dependent

• In the statement, DO 10 I = 10.86

• DO10I is an identifier, and DO is not a keyword
• But in the statement, DO 10 I = 10, 86
• DO is a keyword
• Such features require substantial look ahead for resolution
• Example above -> we cannot be sure until we see the comma (after 10) that DO is a keyword
• Blanks are not significant in FORTRAN and can appear in the midst of identifiers, but not so in ’C’
• Lexical analysis cannot catch any significant errors except for simple errors such as, illegal symbols, etc.
• In such cases, lexical analysis skips characters in the input until a well-formed token is found
Languages
• Symbol: An abstract entity, not defined
• Examples: letters {a,b,c,…,z} and digits {0,1,..,9}
• String: A finite sequence of symbols
• abcb, caba are strings over the symbols {a,b,c}
• |w| is the length of the string w, and is the #symbols in it
• ∊ is the empty string and is of length 0
• Alphabet: A finite set of symbols (e.g., {a,b,c,…,z}, {0,1,..,9} )
• Language: A set of strings of symbols from some alphabet
• Φ (empty language) and {∊} (set with empty string) are languages
• The set of palindromes over {0,1} is an infinite language
• The set of strings, {01, 10, 111} over {0,1} is a finite language
• If Σ is an alphabet, Σ∗ is the set of all strings over Σ
• We need a ‘finite representation’ (encoded by finite string) for a language
• Regular language (or type-3) is represented by Regular expression
• Context-free language (or type-2) is represented by a Context-free grammar
• Context-sensitive language (or type-1) is represented by a Context-sensitive grammar
• type-0 language is represented by type-0 grammar
Regular Expressions (REs)
• Let Σ be an alphabet. The REs over Σ and the languages they denote (or generate) are defined as
below
• φ (empty language/set, not even empty string) is an RE. L(φ) = φ
• ∊ (empty string) is an RE. L(∊) = {∊}
• For each a ∈ Σ, a is an RE. L(a) = {a}
• E.g., Σ ={1,2,3,4} then each 1,2,3,4 are REs, L(1)={1}, L(2)={2},…
• If r and s are REs (not symbol) denoting the languages R and S, respectively
• Concatenation: (rs) is an RE, L(rs) = R.S = {xy | x ∈ R ∧ y ∈ S}
• Union: (r + s) (or (r|s)) is an RE, L(r + s) = R ∪ S
• Kleene closure/closure: (r∗) is an RE, L(r∗) = R∗ = ⋃ 𝑅𝑖

(L∗ is called the Kleene closure or closure of L)

Examples of Regular Expressions
• Given L = set of all strings of 0’s and 1’s, what is the RE?
• r = (0 + 1)* or (0|1)*
• How do we generate the string 101 ?
• (0 + 1) ∗ ⇒ (0 + 1)(0 + 1)(0 + 1) ⇒ 101
• Given L = set of all strings of 0’s and 1’s, with at least two consecutive 0’s, what is the RE?
• r = (0 + 1) ∗00(0 + 1) ∗
• Given L = {w ∈ {0, 1} ∗ | w has two or three occurrences of 1, the first and second of which are not
consecutive}, what is the RE?
• r = 0∗10∗010∗ (10∗+ ∊)

• Given r = (1 + 10)∗, what is the language?

• L = set of all strings of 0’s and 1’s, beginning with 1 and not having two consecutive 0’s
• Given r = (0 + 1)∗011, what is the language?
• L = set of all strings of 0’s and 1’s ending in 011
Examples of Regular Expressions
• Given r = c∗(a + bc∗)∗ , what is the language?
• L = set of all strings over {a,b,c} that do not have the substring ac
• Given L = {w | w ∈ {a, b}∗ ∧ w ends with a} , what is the RE?
• r = (a + b)∗a
• Given L = {if, then, else, while, do, begin, end}, what is the RE?
• r = if + then + else + while + do + begin + end
Automata
• Automata are machines (abstract machines) that accept languages
• Finite State Automata accept RLs (corresponding to REs)
• Pushdown Automata accept CFLs (corresponding to CFGs)
• Linear Bounded Automata accept CSLs (corresponding to CSGs)
• Turing Machines accept type-0 languages (corresponding to type-0 grammars)
• Applications of Automata
• Switching circuit design
• Lexical analyzer in a compiler
• String processing (grep, awk), etc.
• State charts used in object-oriented design
• Modelling control applications, e.g., elevator operation
• Parsers of all types
• Compilers
Finite State Automaton (FSA)
• An FSA is an acceptor or recognizer of regular languages
• An FSA is a 5-tuple, (Q, Σ, δ, q0, F), where
• Q is a finite set of states
• Σ is the input alphabet
• δ is the transition function, δ : Q × Σ → Q
• That is, δ(q, a) is a state for each state q and input symbol a
• q0 is the start state
• F is the set of final or accepting states
• In one move from some state q, an FSA reads an input symbol, changes the state based on δ, and
gets ready to read the next input symbol
• An FSA accepts its input string, if starting from q0, it consumes the entire input string, and
reaches a final state
• If the last state reached is not a final state, then the input string is rejected
FSA example
• Q = {q0, q1, q2, q3} -> finite set of states
• Σ = {a, b, c} -> the input alphabet
• q0 is the start state and F = {q0, q2} (F -> set
of final or accepting states)
• The transition function δ is defined by the
table below

• Language accepted by the FSA?

• is the set of all strings beginning with an
’a’ and ending with a ’c’ ( is also
accepted)

Automata and Complexity Theory
100% (4)
Automata and Complexity Theory
18 pages
Chunking in RAG
No ratings yet
Chunking in RAG
11 pages
C CppEd
No ratings yet
C CppEd
284 pages
Compiler-Lexical Analysis
100% (1)
Compiler-Lexical Analysis
59 pages
Chapter 3 - Lexical Analysis
100% (3)
Chapter 3 - Lexical Analysis
51 pages
Chapter 3 - Lexical Analysis
100% (1)
Chapter 3 - Lexical Analysis
51 pages
Theory of Computation
67% (3)
Theory of Computation
24 pages
SAS Slides 12: Macros
100% (3)
SAS Slides 12: Macros
51 pages
System Programming Unit-2 by Arun Pratap Singh
100% (1)
System Programming Unit-2 by Arun Pratap Singh
82 pages
Lexical Analyzer (Compiler Contruction)
100% (1)
Lexical Analyzer (Compiler Contruction)
6 pages
Assembler: Advantages
No ratings yet
Assembler: Advantages
2 pages
CD ch2
No ratings yet
CD ch2
104 pages
Scanner (Lexical Analyzer) : The Structure of A Compiler
No ratings yet
Scanner (Lexical Analyzer) : The Structure of A Compiler
109 pages
22CS1106
No ratings yet
22CS1106
2 pages
Topic 3
No ratings yet
Topic 3
66 pages
09 SchemeSyllabus BTech DS 5th6th
No ratings yet
09 SchemeSyllabus BTech DS 5th6th
74 pages
Chapter 2
No ratings yet
Chapter 2
99 pages
Challenges and Applications of Large Language Models: Desi GN Behavior
No ratings yet
Challenges and Applications of Large Language Models: Desi GN Behavior
72 pages
The Genetic Code of All Languages; Part-7 (Korean Hangul Alphabets)
From Everand
The Genetic Code of All Languages; Part-7 (Korean Hangul Alphabets)
Moni Kanchan Panda
No ratings yet
FLAT Unit 1 August 2023
No ratings yet
FLAT Unit 1 August 2023
69 pages
Lecture 4 - Source Code Analysis
No ratings yet
Lecture 4 - Source Code Analysis
52 pages
Chapter 7 Lexical Analysis
No ratings yet
Chapter 7 Lexical Analysis
61 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
55 pages
Ian Talks Regex A-Z
From Everand
Ian Talks Regex A-Z
Ian Eress
No ratings yet
Chapter Two (3) (Autosaved)
No ratings yet
Chapter Two (3) (Autosaved)
29 pages
CD ppt1
No ratings yet
CD ppt1
62 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
56 pages
TOA Lecture 03
No ratings yet
TOA Lecture 03
63 pages
ch3 M.PPTX - 0
No ratings yet
ch3 M.PPTX - 0
46 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
51 pages
UNIT-I - Lexical Analysis
No ratings yet
UNIT-I - Lexical Analysis
51 pages
Compilers CH 3
No ratings yet
Compilers CH 3
58 pages
Chapter 3 - Lexical Analysis
No ratings yet
Chapter 3 - Lexical Analysis
52 pages
M2 Main
No ratings yet
M2 Main
41 pages
Chapter 2
No ratings yet
Chapter 2
27 pages
Lecture 3
No ratings yet
Lecture 3
31 pages
Lexical Analysis
No ratings yet
Lexical Analysis
47 pages
CD - Unit1 - Lecture4 5 6 7
No ratings yet
CD - Unit1 - Lecture4 5 6 7
50 pages
CH 2
No ratings yet
CH 2
36 pages
Chapter 2
No ratings yet
Chapter 2
56 pages
Chapter 2
No ratings yet
Chapter 2
31 pages
Compiler 2
No ratings yet
Compiler 2
38 pages
Chapter 2
No ratings yet
Chapter 2
39 pages
Lecture 2
No ratings yet
Lecture 2
20 pages
Compiler Course: Lexical Analysis
No ratings yet
Compiler Course: Lexical Analysis
50 pages
CP 324 Lexical Analysis l2
No ratings yet
CP 324 Lexical Analysis l2
26 pages
SLD 2
No ratings yet
SLD 2
67 pages
Evaluation Scheme & Detailed Syllabus of MCA - Integrated - 4 - Year
No ratings yet
Evaluation Scheme & Detailed Syllabus of MCA - Integrated - 4 - Year
20 pages
Practical File: Be (Cse) 6 Semester
No ratings yet
Practical File: Be (Cse) 6 Semester
54 pages
R Fundamentals (Hadley Wickham - Rice Univ)
No ratings yet
R Fundamentals (Hadley Wickham - Rice Univ)
66 pages
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
Compilers - Week 2
No ratings yet
Compilers - Week 2
14 pages
CS 346: Compilers: Lexical Analyzer Lexical Analyzer
No ratings yet
CS 346: Compilers: Lexical Analyzer Lexical Analyzer
52 pages
Lec 03 - Finite Languages
No ratings yet
Lec 03 - Finite Languages
29 pages
MIT 6.035 Specifying Languages With Regular Expressions and Context-Free Grammars
No ratings yet
MIT 6.035 Specifying Languages With Regular Expressions and Context-Free Grammars
75 pages
Jku Computer Science Model Exam 2
No ratings yet
Jku Computer Science Model Exam 2
18 pages
Dr. Fouzia Jabeen: Theory of Automata
No ratings yet
Dr. Fouzia Jabeen: Theory of Automata
24 pages
TCS Theory Questions
No ratings yet
TCS Theory Questions
7 pages
CS8602 - Compiler Design
No ratings yet
CS8602 - Compiler Design
5 pages
Translation Rules and ANN Based Model For English To Urdu Machine Translation
No ratings yet
Translation Rules and ANN Based Model For English To Urdu Machine Translation
12 pages
Lect 03
No ratings yet
Lect 03
19 pages
Theory of Automata Lecture#3: by Riaz Ahmad Ziar R.ziar@kardan - Edu.af
No ratings yet
Theory of Automata Lecture#3: by Riaz Ahmad Ziar R.ziar@kardan - Edu.af
19 pages
Compiler 2
No ratings yet
Compiler 2
10 pages
Languages, Grammar and Recognizers
No ratings yet
Languages, Grammar and Recognizers
17 pages
Lexical Analysis: S. M. Farhad
No ratings yet
Lexical Analysis: S. M. Farhad
28 pages
2 Lex
No ratings yet
2 Lex
45 pages
Introduction to Formal Languages
From Everand
Introduction to Formal Languages
György E. Révész
2/5 (1)
Compiler
No ratings yet
Compiler
60 pages
Mehedy Et Al. - 2003 - Bangla Syntax Analysis A Comprehensive Approach
No ratings yet
Mehedy Et Al. - 2003 - Bangla Syntax Analysis A Comprehensive Approach
7 pages
The Genetic Code of All Languages,(Part-1; An Overview)
From Everand
The Genetic Code of All Languages,(Part-1; An Overview)
Moni Kanchan Panda
No ratings yet
Cs3723 Intro
No ratings yet
Cs3723 Intro
20 pages
Compiler Design Lab
No ratings yet
Compiler Design Lab
2 pages
The Genetic Code of All Languages; Part-5 (Hebrew)
From Everand
The Genetic Code of All Languages; Part-5 (Hebrew)
Moni Kanchan Panda
No ratings yet
M.Suhaib Khalid PDF
No ratings yet
M.Suhaib Khalid PDF
10 pages
Acd Notes - 2
No ratings yet
Acd Notes - 2
32 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
CS1303 Theory of Computation
No ratings yet
CS1303 Theory of Computation
25 pages
Lexical Analysis All Token List and Diffence
No ratings yet
Lexical Analysis All Token List and Diffence
4 pages
Parsing
No ratings yet
Parsing
6 pages
Compiler Design Case Study
No ratings yet
Compiler Design Case Study
3 pages
The Genetic Code of All Languages,(Part 2.1; Numerals)
From Everand
The Genetic Code of All Languages,(Part 2.1; Numerals)
Moni Kanchan Panda
No ratings yet
System Programming MCQs
No ratings yet
System Programming MCQs
4 pages
PCD - Algo
No ratings yet
PCD - Algo
8 pages
Conundrum: Crack the Ultimate Cipher Challenge
From Everand
Conundrum: Crack the Ultimate Cipher Challenge
Brian Clegg
No ratings yet
Learning Translation Rules From Bilingual English - Filipino Corpus
No ratings yet
Learning Translation Rules From Bilingual English - Filipino Corpus
10 pages
Open Ended Problem: System Programming (2150708)
No ratings yet
Open Ended Problem: System Programming (2150708)
4 pages
Compiler Design - Lexical Analysis: University of Salford, UK
No ratings yet
Compiler Design - Lexical Analysis: University of Salford, UK
1 page
Compiler Design Visue: Q.1 What Is The Challenges of Compiler Design?
No ratings yet
Compiler Design Visue: Q.1 What Is The Challenges of Compiler Design?
16 pages
Moss
No ratings yet
Moss
2 pages
Maltparser: A Language-Independent System For Data-Driven Dependency Parsing
No ratings yet
Maltparser: A Language-Independent System For Data-Driven Dependency Parsing
42 pages

Lecture 3 (30-1-23)

Uploaded by

Lecture 3 (30-1-23)

Uploaded by

Lexical Analysis

• Scan input program to identify valid words, removes comments, extra

Token Pattern Sample

relation_op = | != | < | > <

string Characters “hello”

• In the statement, DO 10 I = 10.86

(L∗ is called the Kleene closure or closure of L)

• Given r = (1 + 10)∗, what is the language?

• Language accepted by the FSA?

You might also like