0% found this document useful (0 votes)

90 views53 pages

Compiler Construction: Chapter # 2 - Lexical Analysis Instructor: Ms. Raazia Sosan

This document discusses lexical analysis, which is the first phase of compiler construction. It breaks down a source program into lexical units called tokens through two main processes - scanning and lexical analysis. Scanning strips out comments and whitespace, while lexical analysis groups character sequences into lexemes and produces tokens. Regular expressions are used to specify patterns that define valid lexemes. Input is buffered to improve efficiency. The lexical analyzer identifies lexemes and produces a token stream that is passed to the parser for syntax analysis.

Uploaded by

Muhammad Rohaan Rehan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

90 views53 pages

Compiler Construction: Chapter # 2 - Lexical Analysis Instructor: Ms. Raazia Sosan

Uploaded by

Muhammad Rohaan Rehan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Compiler Construction

Chapter # 2 - Lexical Analysis

Instructor: Ms. Raazia Sosan
Introduction
• Two Phases
– Analysis Phase
• Breaks up source program into constituent pieces
• Produces an internal representation known as intermediate code
– Synthesis Phase
• Translates intermediate code into target program

2
Compiler – Front End

3
Analysis Phase
• Syntax
– Proper form of program
• Semantics
– What a program means; what each program does when it executes

4
The Role of the Lexical Analyzer

5
The Role of t he Lexical Analyzer
• The main task of the lexical analyzer is to
– read the input characters of the source program, group them into lexemes, and
– produce as output a sequence of tokens for each lexeme in the source program.
– The stream of tokens is sent to the parser for syntax analysis.
– It is common for the lexical analyzer to interact with the symbol table as well.
– When the lexical analyzer discovers a lexeme constituting an identifier, it needs
to enter that lexeme into the symbol table.
– it may perform certain other tasks besides identification of lexemes. One such
task is stripping out comments and whitespace
– Another task is correlating error messages generated by the compiler with the
source program.

6
Lexical Analyzer - Process
• Lexical analyzers are divided into a cascade of two processes:
– Scanning consists of the simple processes that do not require
tokenization of the input, such as deletion of comments and
compaction of consecutive whitespace characters into one.
– Lexical analysis proper is the more complex portion, where the
scanner produces the sequence of tokens as output.

7
Tokens
• A token is a pair consisting of a token name and an optional
attribute value.
• The token name is an abstract symbol representing a kind of
lexical unit, e.g., a particular keyword, or a sequence of input
characters denoting an identifier.
• The token names are the input symbols that the parser
processes.
• In what follows, we shall generally write the name of a token in
boldface. We will often refer to a token by its token name.
8
Patterns
• A pattern is a description of the form that the lexemes of a
token may take.
• In the case of a keyword as a token, the pattern is just the
sequence of characters that form the keyword. For identifiers
and some other tokens, the pattern is a more complex
structure that is matched by many strings.

9
Lexemes
• A lexeme is a sequence of characters in the source program
that matches the pattern for a token and is identified by the
lexical analyzer as an instance of that token.

10
Example 3.1
printf ( "Total = %d\n" , score ) ;
• both printf and score are lexemes matching the pattern for
token id, and " Total = %d\n" is a lexeme matching literal.
• In many programming languages, the following classes cover
most or all of the tokens:

11
Example 3.1 contd.
• One token for each keyword. The pattern for a keyword is the
same as the keyword itself.
• Tokens for the operators, either individually or in classes such
as the token comparison mentioned in Fig. 3.2.
• One token representing all identifiers.
• One or more tokens representing constants, such as numbers
and literal strings .
• Tokens for each punctuation symbol, such as left and right
parentheses,comma, and semicolon.
12
Example 3.2
• Write the token names and associated attribute values for the
Fortran statement
E = M * C ^ 2

13
Input Buffering
• Because of the amount of time taken to process characters and
the large number of characters that must be processed during
the compilation of a large source program, specialized
buffering techniques have been developed to reduce the
amount of overhead required to process a single input
character.

14
Input Buffering - Buffer Pairs
• Each buffer is of the same size N , and N is usually the size of a
disk block, e.g., 4096 bytes.
• Using one system read command we can read N characters
into a buffer, rather than using one system call per character.
• If fewer than N characters remain in the input file, then a
special character, represented by eof, marks the end of the
source file and is different from any possible character of the
source program.
• Two pointers to the input are maintained:
– Pointer lexemeBegin, marks the beginning of the current lexeme, 15
Input Buffering - Buffer Pairs

16
Input Buffering - Buffer Pairs
• Once the next lexeme is determined, forward is set to the
character at its right end. Then, after the lexeme is recorded as
an attribute value of a token returned to the parser,
lexemeBegin is set to the character immediately after the
lexeme just found.
• In Fig. 3.3, we see forward has passed the end of the next
lexeme and must be retracted one position to its left.
• Advancing forward requires that we first test whether we have
reached the end of one of the buffers, and if so, we must
reload the other buffer from the input, and move forward to 17
Input Buffering - Sentinels
• we must check, each time we advance forward, that we have not
moved off one of the buffers; if we do, then we must also reload
the other buffer.
• Thus, for each character read, we make two tests: one for the end
of the buffer, and one to determine what character
• is read (the latter may be a multiway branch) .
• We can combine the buffer-end test with the test for the current
character if we extend each buffer to hold a sentinel character at
the end.
• The sentinel is a special character that cannot be part of the source
program, and a natural choice is the character eof. 18
Input Buffering - Sentinels

19
Challenge Task
• Write a program for input buffer with the desired pointers.

20
Lexical Analyzer - Process
Lexical Analyzer

Specification of Recognition of
Tokens Tokens

Thomson
Regular Expression Construction
(RE -> NFA)

Subset
Construction
(NFA -> DFA)

21
Specification of Tokens
• Regular expressions are an important notation for specifying
lexeme patterns.

22
Strings and Languages
• An alphabet is any finite set of symbols. Typical examples of
symbols are letters , digits, and punctuation. The set {a, 1 } is
the binary alphabet. ASCII is an important example of an
alphabet ; it is used in many software systems.
• A string over an alphabet is a finite sequence of symbols drawn
from that alphabet. In language theory, the terms "sentence"
and "word" are often used as synonyms for "string." The length
of a string s, usually written |s| , is the number of occurrences
of symbols in s. For example, banana is a string of length six.
The empty string, denoted , is the string of length zero. 23
Strings and Languages

24
Operations on Languages

25
Kleene star

26
Kleene plus

27
Example of Kleene closure
• Example of Kleene star applied to set of strings:
– {"ab","c"}* = {ε, "ab", "c", "abab", "abc", "cab", "cc", "ababab", "ababc",
"abcab", "abcc", "cabab", "cabc", "ccab", "ccc", ...}.
• Example of Kleene star applied to set of characters:
– {"a", "b", "c"}* = { ε, "a", "b", "c", "aa", "ab", "ac", "ba", "bb", "bc", "ca",
"cb", "cc", "aaa", "aab", ...}.
• Example of Kleene star applied to the empty set:
– ∅* = {ε}.
• Example of Kleene plus applied to the empty set:
– ∅+ = ∅ ∅* = { } = ∅
28
Example 3.3
• Let L be the set of letters {A, B , . . . , Z , a, b, . . . , z} and let D
be the set of digits {a, 1 , . . . 9}. We may think of L and D in
two, essentially equivalent, ways. One way is that L ?>nd D are,
respectively, the alphabets of uppercase and lowercase letters
and of digits. The second way is that L and D are languages, all
of whose strings happen to be of length one. Here are some
other languages that can be constructed from languages L and
D

29
Example 3.3 - Solution
• L U D is the set of letters and digits - strictly speaking the
language with 62 strings of length one, each of which strings is
either one letter or one digit.
• LD is the set df 520 strings of length two, each consisting of
one letter followed by one digit.
• L4 is the set of all 4-letter strings.
• L * is the set of ail strings of letters, including E, the empty
string.
• L(L U D)* is the set of all strings of letters and digits beginning
30
with a letter.
Regular Expressions
• A regular expression is a formula for representing a (complex)
language in terms of “elementary” languages combined using
the three operations union, concatenation and Kleene closure.
• Refer the book
Title: Compiler Construction: Principles and Practice
Author(s)/Editor(s): Kenneth C. Louden
Publisher: PWS Publishing Company
ISBN: 0-534-93972-4
Pages: 43-52 and 73-83
31
Thompson Construction
• Discussed in class
• Refer the book
Title: Compiler Construction: Principles and Practice
Author(s)/Editor(s): Kenneth C. Louden
Publisher: PWS Publishing Company
ISBN: 0-534-93972-4
Pages: 43-52 and 73-83

32
Subset Construction
• DIY
• Refer the book
Title: Compiler Construction: Principles and Practice
Author(s)/Editor(s): Kenneth C. Louden
Publisher: PWS Publishing Company
ISBN: 0-534-93972-4
Pages: 43-52 and 73-83

33
Introduction to FLEX
• FLEX (Fast LEXical analyzer generator) is a tool for generating
scanners. In stead of writing a scanner from scratch, you only
need to identify the vocabulary of a certain language (e.g.
Simple), write a specification of patterns using regular
expressions (e.g. DIGIT [0-9]), and FLEX will construct a scanner
for you.

34
Environment Setting with FLEX

35
Flex regular expressions
• s string s literally
• \c character c literally, where c would normally be a lex
operator
• [s] character class
• ^ indicates beginning of line
• [^s] characters not in character class
• [s-t] range of characters
• s? s occurs zero or one time
• . any character except newline 36
Flex regular expressions
• s+ one or more occurrences of s
• r|s r or s
• (s) grouping
• $ end of line
• s/r s iff followed by r (not recommended) (r is *NOT*
consumed)
• s{m,n} m through n occurences of s

37
Examples of regular expressions in flex
• a* zero or more a’s
• .* zero or more of any character except newline
• .+ one or more characters
• [a-z] a lowercase letter
• [a-zA-Z] any alphabetic letter
• [^a-zA-Z] any non-alphabetic character a.b a followed by any
character followed by b
• rs|tu rs or tu
38
Examples of regular expressions in flex
• a(b|c)d abd or acd
• ^start beginning of line with then the literal characters start
• END$ the characters END followed by an end-of-line.

39
A flex Input File
• flex input files are structured as follows:
%{
Declarations
%}
Definitions
%%
Rules
%%
User subroutines 40
A flex Input File
• The optional Declarations and User
subroutines sections are used for ordinary C code that you wa
nt copied verbatim to the generated C file. Declarations are co
pied to the top of the file, user subroutines to the bottom. The
optional Definitions section is where you specify options for t
he scanner and can set up definitions to give names to regular
expressions as a simple substitution mechanism that allows for
more readable entries in the Rules section that follows. The r
equired Rules section is where you specified the patterns that
identify your tokens and the action to perform upon recognizi 41
flex Global Variables
• The token-
grabbing function yylex takes no arguments and returns an inte
ger. Often more information is needed about the token just re
ad than that one integer code. The usual way information abo
ut the token is communicated back to the caller is by having th
e scanner set the contents of a global variable which can be rea
d by the caller. After counseling you for years that globals are a
bsolute evil, we reluctantly sanction their limited use here, bec
ause our tools require we use them.
42
flex Global Variables -yytext
• yytext is a null-
terminated string containing the text of the lexeme just recogn
ized as a token. This global variable is declared and managed in
the lex.yy.c file. Do not modify its contents. The buffer is over
written with each 4
subsequent token, so you must make your own copy of a lexem
e you need to store more permanently

43
flex Global Variables -yyleng
• yyleng is an integer holding the length of the lexeme stored in
yytext. This global variable is declared and managed in the lex.
yy.c file. Do not modify its contents.

44
flex Global Variables -yylval
• yylval is the global variable used to store attributes about the t
oken, e.g. for an integer lexeme it might store the value, for a s
tring literal, the pointer to its characters and so on. This variab
le is declared to be of type YYSTYPE, and is usually a union of a
ll the various fields needed for different token types. If you are
using a parser generator (such as yacc or bison), it will define t
his type for you, otherwise, you must provide the definition yo
urself. Your scanner actions should appropriately set the cont
ents of the variable for each token.
45
flex Global Variables -yylloc
• yylloc is the global variable that is used to store the location (li
ne and column) of the token. This variable is declared to be of
type YYLTYPE. Again, the parser generator can provide this or
it may be your responsibility. Your scanner actions should app
ropriately set the contents of the variable for each token.

46
Example 1 – ex1.lex
/* either indent or +=yyleng;}
use %{ %} */ . ++num_chars;
%{ %%
int num_lines = 0; int main(int argc,
int num_chars = 0; char **argv)
int num_words=0; {
#define yywrap() 1 yylex();
%} printf("# of lines =
%% %d, # of chars = %d, #
\n ++num_lines; of words = %d\n", 47
Compiling .lex file
• On cmd execute the following commands
flex ex1.lex
• This will generate lex.yy.c file
gcc lex.yy.c –o ex1
• Then to execute your scanner execute
ex1.exe

48
Example 2
%{ yytext);
#define yywrap() 1 ’.’ printf("found
%} character: {%s}\n",
digits [0-9] yytext);
ltr [a-zA-Z] . { /* absorb others
*/ }
alphanum [a-zA-Z0-9]
%%
%%
int main(int argc,
(-|\+)*{digits}+ char **argv)
printf("found number:
’%s’\n", yytext); { 49
Example 3
%{ {
#define yywrap() 1 yylex();
%} return 0;
%% }
[0-9]+ printf("?");
%%
int main()

50
Reading Assignment
• 3. 1 . 1 Lexical Analysis Versus Parsing
• 3.1.4 Lexical Errors
• 3 . 3 . 5 Extensions of Regular Expressions

51
Assignment # 1
• Due Date: 6th July, 2017

52
THANK YOU

Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
No ratings yet
Compiler Design: Ambo University School of Informatics and Electrical Engineering Department of Computer Science
35 pages
Chapter 2 - Lexical Analysis
100% (1)
Chapter 2 - Lexical Analysis
69 pages
ProCash NDC V2000 ProConsult NDC V2000 UserGuide en
100% (1)
ProCash NDC V2000 ProConsult NDC V2000 UserGuide en
420 pages
Gate Pass Managment System
100% (3)
Gate Pass Managment System
26 pages
Lexical Analysis: Risul Islam Rasel
No ratings yet
Lexical Analysis: Risul Islam Rasel
148 pages
Chapter 3 Lexical Analyser
No ratings yet
Chapter 3 Lexical Analyser
29 pages
Chapter 3 Lexical Analysis
No ratings yet
Chapter 3 Lexical Analysis
5 pages
IOT Fundamentals
No ratings yet
IOT Fundamentals
104 pages
Dsls 17022013 SSQ Setup
No ratings yet
Dsls 17022013 SSQ Setup
11 pages
Self Defense Japonese Jiu Jitsu PDF
No ratings yet
Self Defense Japonese Jiu Jitsu PDF
137 pages
Compiler Design: Lexical Analysis
No ratings yet
Compiler Design: Lexical Analysis
68 pages
Compiler Design Chapter 2
No ratings yet
Compiler Design Chapter 2
14 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
39 pages
Unit 1 (B)
No ratings yet
Unit 1 (B)
69 pages
Instruct MP3 Firmware Install 05-06
No ratings yet
Instruct MP3 Firmware Install 05-06
7 pages
Unit 2 Lexical Analysis - Part 1: Harshita Sharma
No ratings yet
Unit 2 Lexical Analysis - Part 1: Harshita Sharma
55 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
17 pages
CD Unit I Part II Lexical Analysis
No ratings yet
CD Unit I Part II Lexical Analysis
58 pages
Lexical Analysis: Deterministic Finite Automata
No ratings yet
Lexical Analysis: Deterministic Finite Automata
37 pages
Compiler Design
No ratings yet
Compiler Design
65 pages
Lexical Analysis
No ratings yet
Lexical Analysis
36 pages
Lexical Analysis and Lexical Analyzer Generators: COP5621 Compiler Construction
No ratings yet
Lexical Analysis and Lexical Analyzer Generators: COP5621 Compiler Construction
52 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
38 pages
03 Lex Analysis
No ratings yet
03 Lex Analysis
61 pages
Ch2 - Lexical Analysis
No ratings yet
Ch2 - Lexical Analysis
71 pages
Unit 01 - PART 2
No ratings yet
Unit 01 - PART 2
25 pages
Lexical Analysis 3
No ratings yet
Lexical Analysis 3
27 pages
2.1 Constituents of Lexical Analysis
No ratings yet
2.1 Constituents of Lexical Analysis
10 pages
Programing ASiMon V3 G2 en
No ratings yet
Programing ASiMon V3 G2 en
413 pages
Compiler - 2
No ratings yet
Compiler - 2
15 pages
Unit 2 Lexical Analysis
No ratings yet
Unit 2 Lexical Analysis
94 pages
Chapter 3 - Lexical Analysis and Lexical Analyzer Generators
No ratings yet
Chapter 3 - Lexical Analysis and Lexical Analyzer Generators
52 pages
Data Engineering Quick Reference
No ratings yet
Data Engineering Quick Reference
9 pages
Comp Chap2
No ratings yet
Comp Chap2
36 pages
Lexical Analysis
No ratings yet
Lexical Analysis
44 pages
CD Aii Partb Ans
No ratings yet
CD Aii Partb Ans
8 pages
Ch2 Lexical Analysis
No ratings yet
Ch2 Lexical Analysis
11 pages
Compiler Design Lexical Analysis
No ratings yet
Compiler Design Lexical Analysis
24 pages
HW 31712
No ratings yet
HW 31712
22 pages
Lexical Analysis
No ratings yet
Lexical Analysis
45 pages
Lexical Analyzer
No ratings yet
Lexical Analyzer
18 pages
2.1 - Lexical Analysis
No ratings yet
2.1 - Lexical Analysis
102 pages
Ch2+3 Compiler
No ratings yet
Ch2+3 Compiler
21 pages
SE Compiler Chapter 2
No ratings yet
SE Compiler Chapter 2
16 pages
UNIT 2 Compiler Design
No ratings yet
UNIT 2 Compiler Design
23 pages
Chapter 2 - Lexical Analyser
No ratings yet
Chapter 2 - Lexical Analyser
40 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part2
62 pages
Chapter 2 Lexical Analysis
No ratings yet
Chapter 2 Lexical Analysis
14 pages
Chapter 2
No ratings yet
Chapter 2
77 pages
Chapter 2 - Lexical Analysis
No ratings yet
Chapter 2 - Lexical Analysis
48 pages
2 - Lexical Analysis
No ratings yet
2 - Lexical Analysis
52 pages
Ch3 Modified
No ratings yet
Ch3 Modified
80 pages
ch-2 Compiler Design
No ratings yet
ch-2 Compiler Design
9 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
@CD - ch2 Compiler Design
No ratings yet
@CD - ch2 Compiler Design
26 pages
WINSEM2023-24 CSI2005 TH VL2023240501823 2024-01-08 Reference-Material-I
No ratings yet
WINSEM2023-24 CSI2005 TH VL2023240501823 2024-01-08 Reference-Material-I
23 pages
CD 1
No ratings yet
CD 1
92 pages
Compiler - Design - Module2-Print
No ratings yet
Compiler - Design - Module2-Print
16 pages
Lec 02
No ratings yet
Lec 02
17 pages
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part1
No ratings yet
21CS51 ATCD MODULE 2 - 2 Lexical Analyser Part1
63 pages
Lexical Analysis
No ratings yet
Lexical Analysis
57 pages
UNIT II-Lexical Analyis
No ratings yet
UNIT II-Lexical Analyis
56 pages
Unit 2 Lexical Analyzer
No ratings yet
Unit 2 Lexical Analyzer
30 pages
Lexical Analysis
No ratings yet
Lexical Analysis
62 pages
Lexical Analysis
No ratings yet
Lexical Analysis
121 pages
Lexical Analysis
No ratings yet
Lexical Analysis
6 pages
4g CDR New Site Mocn 12jkb0166 in - Rsudcengkareng - Ae (20221006) - Swap H3i
No ratings yet
4g CDR New Site Mocn 12jkb0166 in - Rsudcengkareng - Ae (20221006) - Swap H3i
45 pages
Computer-Forensics Unit 2
No ratings yet
Computer-Forensics Unit 2
56 pages
Lec2 LexicalAnalyser
No ratings yet
Lec2 LexicalAnalyser
30 pages
Kirk Whalum-I Must Tell Jesus
0% (1)
Kirk Whalum-I Must Tell Jesus
5 pages
Lab 2
No ratings yet
Lab 2
10 pages
CMS3.0 User Manual
No ratings yet
CMS3.0 User Manual
29 pages
Maximus DVR Quick Start Guide v1
No ratings yet
Maximus DVR Quick Start Guide v1
4 pages
Grove Temperature and Humidity Sensor Sen11301p
No ratings yet
Grove Temperature and Humidity Sensor Sen11301p
9 pages
SC 900
No ratings yet
SC 900
8 pages
Manual
No ratings yet
Manual
72 pages
AUTOSAR SWS Persistency
No ratings yet
AUTOSAR SWS Persistency
96 pages
Compiler Construction: Chapter # 1 - Introduction Instructor: Ms. Raazia Sosan
No ratings yet
Compiler Construction: Chapter # 1 - Introduction Instructor: Ms. Raazia Sosan
24 pages
Code Generation I: Compiler Construction
No ratings yet
Code Generation I: Compiler Construction
28 pages
Rcon-Pc Iai
No ratings yet
Rcon-Pc Iai
40 pages
UNIT-3 Structural Patterns: Intent
No ratings yet
UNIT-3 Structural Patterns: Intent
16 pages
Automation Technology AppNote Matlab Image Acquisition Toolbox
No ratings yet
Automation Technology AppNote Matlab Image Acquisition Toolbox
15 pages
ASCENTZ IEEE Titles 2021 - 2022
No ratings yet
ASCENTZ IEEE Titles 2021 - 2022
41 pages
Angel International School - Manipay: Information & Communication Technology
No ratings yet
Angel International School - Manipay: Information & Communication Technology
4 pages
Ravi Kukretis Resume
No ratings yet
Ravi Kukretis Resume
1 page
Financial Accounting: Term Report On ICI Pakistan Limited
No ratings yet
Financial Accounting: Term Report On ICI Pakistan Limited
9 pages
ICDL Professional Modules - Computational - Using Databases
No ratings yet
ICDL Professional Modules - Computational - Using Databases
10 pages
Key Midterm Exam v2
No ratings yet
Key Midterm Exam v2
7 pages
Hello World
No ratings yet
Hello World
7 pages
Cheat Sheet For Windows Privilege Escalation
No ratings yet
Cheat Sheet For Windows Privilege Escalation
5 pages
IT Class 9 Questions
No ratings yet
IT Class 9 Questions
5 pages
Lista - Inglês para Concurso
No ratings yet
Lista - Inglês para Concurso
2 pages
Assignment 1 DCN
No ratings yet
Assignment 1 DCN
1 page
Vindicator V5 ACS Data Sheet PDF
No ratings yet
Vindicator V5 ACS Data Sheet PDF
2 pages
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)

Compiler Construction: Chapter # 2 - Lexical Analysis Instructor: Ms. Raazia Sosan

Uploaded by

Compiler Construction: Chapter # 2 - Lexical Analysis Instructor: Ms. Raazia Sosan

Uploaded by

Compiler Construction

Chapter # 2 - Lexical Analysis

You might also like