0% found this document useful (0 votes)
67 views21 pages

Lexical Analysis: Programming Languages Translators

The document discusses lexical analysis in programming language translators. Lexical analysis involves breaking the source code text into tokens through processes like recognizing keywords, identifiers, numbers, operators, and punctuation. It creates a stream of tokens from the character input by grouping characters into meaningful units like identifiers, numbers, strings, and punctuation symbols. The lexical analyzer represents each unique token with a numeric code to simplify later parsing.

Uploaded by

Anwar Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views21 pages

Lexical Analysis: Programming Languages Translators

The document discusses lexical analysis in programming language translators. Lexical analysis involves breaking the source code text into tokens through processes like recognizing keywords, identifiers, numbers, operators, and punctuation. It creates a stream of tokens from the character input by grouping characters into meaningful units like identifiers, numbers, strings, and punctuation symbols. The lexical analyzer represents each unique token with a numeric code to simplify later parsing.

Uploaded by

Anwar Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Programming Languages Translators

Lexical Analysis
Tasks of a Scanner

recognizes the keywords of the language


these are the reserved words that have a special meaning
in the language, such as the word class in Java
recognizes special characters, such as ( and ), or
groups of special characters, such as := and ==
recognizes identifiers, integers, reals, decimals,
strings, etc
ignores whitespaces (tabs, blanks, etc) and
comments
recognizes and processes special directives (such as
the #include "file" directive in C) and macros
Lexical Analysis
A scanner groups input characters into tokens
input: x = x * (acc+123)
token value
identifier x
equal =
identifier x
star*
left-paren (
identifier acc
plus +
integer 123
right-paren )
Tokens are typically represented by numbers
Lexical Analysis

Lexical analyzer splits it into tokens


Token = sequence of characters (symbolic
name) representing a single terminal symbol
Identifiers: myVariable
Literals: 123 5.67 true
Keywords: char sizeof
Operators: + - * /
Punctuation: ; , } {
Discards whitespace and comments
Examples of Tokens in C
Tokens Lexemes
identifier Age, grade,Temp, zone, q1
number 3.1416, -498127,987.76412097
string A cat sat on a mat., 90183654
open parentheses (
close parentheses )
Semicolon ;
reserved word if IF, if, If, iF
Examples of Tokens in C

Lexical analyzer usually represents each token


by a unique integer code
+ { return(PLUS); } // PLUS = 401
- { return(MINUS); } // MINUS = 402
* { return(MULT); } // MULT = 403
/ { return(DIV); } // DIV = 404
Some tokens require regular expressions
[a-zA-Z_][a-zA-Z0-9_]* { return (ID); } // identifier
[1-9][0-9]* { return(DECIMALINT); }
0[0-7]* { return(OCTALINT); }
(0x|0X)[0-9a-fA-F]+ { return(HEXINT); }
slide 6
Example

Token Informal description Sample lexemes


if Characters i, f if
else Characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=

id Letter followed by letter and digits pi, score, D2


number Any numeric constant 3.14159, 0, 6.02e23
literal Anything but sorrounded by core dumped

printf(total = %d\n, score);


Redefining Identifiers can be
dangerous

program confusing;
const true = false;
begin
if (a<b) = true then
f(a)
else
Whitespace

Whitespace is any space, tab, end-of-line


character (or characters), or character
sequence inside a comment
No token may contain embedded whitespace
(unless it is a character or string literal)
Example:
>= one token
> = two tokens
Reserved Keywords in C

auto, break, case, char, const, continue,


default, do, double, else, enum, extern, float,
for, goto, if, int, long, register, return, short,
signed, sizeof, static, struct, switch, typedef,
union, unsigned, void, volatile, wchar_t, while
C++ added a bunch: bool, catch, class,
dynamic_cast, inline, private, protected,
public, static_cast, template, this, virtual and
others
Each keyword is mapped to its own token
slide 10
Lexical Analysis

The process of converting a character stream into a


corresponding sequence of meaningful symbols
(called tokens or lexemes) is called tokenizing, lexing
or lexical analysis. A program that performs this
process is called a tokenizer, lexer, or scanner.
In Scheme, we tokenize (set! x (+ x 1)) as
( set! x ( + x 1 ) )
Similarly, in Java, we tokenize
System.out.println("Hello World!"); as
System . out . println ( "Hello
World!" ) ;
Parsing Process
Call the scanner to get tokens

Build a parse tree from the stream of tokens


A parse tree shows the syntactic structure of the
source program.

Add information about identifiers in the symbol


table
Report error, when found, and recover from the
error
12
Parsing
Parsing is a process that constructs a syntactic
structure (i.e. parse tree) from the stream of
tokens.
We already learn how to describe the syntactic
structure of a language using (context-free)
grammar.
So, a parser only need to do this?

Stream of tokens
Parser Parse tree
Context-free grammar
Sentinels

E = M eof * C * * 2 eof eof

Switch (*forward++) {
case eof:
if (forward is at end of first buffer) {
reload second buffer;
forward = beginning of second buffer;
}
else if {forward is at end of second buffer) {
reload first buffer;\
forward = beginning of first buffer;
}
else /* eof within a buffer marks the end of input */
terminate lexical analysis;
break;
cases for the other characters;
}
Transition diagrams

Transition diagram for relop


Transition diagrams (cont.)

Transition diagram for reserved words and


identifiers
Transition diagrams (cont.)

Transition diagram for unsigned numbers


Recognition

state = 0;
while ( (c = next_char() ) != EOF ) {
switch (state) {
case 0: if ( c == a ) state = 1;
break;
case 1: if ( c == b ) state = 2;
break;
case 2: if ( c == c ) state = 3;
break;
case 3: if ( c == a ) state = 1;
else { ungetchar(); return (TRUE); }
break;
default:
error();
}
}
if ( state == 3 ) return (TRUE) else return (FALSE);
Finite Automata for the Lexical Tokens

a- z a- z
i f 0-9
2 0-9
1 2 3 1 2
1
0-9

IF ID NUM

0-9 0-9
0-9
. 1 - 2 - 3
\n
4
a- z
1 2 3 1 2
any but \n
. blank, etc.
5 blank, etc.
4 0-9 5 0-9

REAL White space error


(and comment starting with - -)

(Appel, pp. 21)


LEXICAL ANALYSIS

Lexical Errors
Deleting an extraneous character
Inserting a missing character
Replacing an incorrect character by a correct
character
Transposing two adjacent characters(such as ,
fi=>if)
Pre-scanning
Tokens / Patterns / Regular Expressions
Lexical Analysis - searches for matches of lexeme to pattern
Lexical Analyzer returns:<actual lexeme, symbolic identifier of token>

For Example: Token Symbolic ID


if 1
then 2
else 3
>,>=,<, 4
Set of all regular := 5
expressions plus
id 6
symbolic ids plus
analyzer define required int 7
functionality. real 8

algs algs
REs --- NFA --- DFA (program for simulation)

You might also like