0% found this document useful (0 votes)
18 views26 pages

Lexical Analysis (Scanner)

Uploaded by

Hadeer Anwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views26 pages

Lexical Analysis (Scanner)

Uploaded by

Hadeer Anwar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Lexical Analysis (scanner)

Source
Program Tokens
Lexical
(Character analyzer
Stream)

▪ Lexical Analysis is
also known as
lexical scanner.
Lexical Analysis (scanner)
token & attributes
source lexical Syntax
program analyzer analyzer
get next
token

Symbol
Table

The purpose of lexical analyzers:


▪ Lexical analysis takes a stream of input characters and decode
them into higher level tokens that a syntax analyzer (parser) can
understand.
▪ The lexical analyzer breaks these syntaxes into a series of tokens,
by removing any whitespace or comments in the source code.
Lexical Analysis (scanner)
..
.
count = 1
position
.
= initial + rate * 60
..

• The lexical analyzer reads the stream of characters making


up the source program and groups the characters into
meaningful sequences called lexemes (tokens).
• For each lexeme, the lexical analyzer produces as output a
token of the form (token-attribute pair):
(Token-type, attribute-value)

.(Token-type, attribute-value) : ‫يتم التعامل مع الوحدة اللفظية على أنها ثنائية مكونة من جزأين‬
TOKENS, PATTERNS, AND LEXEMES:
I learn compiler
In English language: design
noun, verb, adjective, …
In a programming language:
Identifier, Integer, Keyword, Whitespace,…
TOKENS, PATTERNS, AND LEXEMES:
Example: The program statement
count = 1

Sequence of token-attribute pairs:

Lexical (id,1) (assign,=) (int,1)


count = 1
analyzer
Example: The program statement
position = initial + rate * 60

o Position is a lexeme that would be mapped into a token (id, 1),


where id is an abstract symbol standing for identifier
and 1 points to the symbol-table entry for position.
o The assignment symbol = is a lexeme that is mapped into the token (=)
o Initial is a lexeme that is mapped into the token (id, 2)
o + is a lexeme that is mapped into the token (+)
o rate is a lexeme that is mapped into the token (id, 3)
o * is a lexeme that is mapped into the token (*)
o 60 is a lexeme that is mapped into the token (60)
Example: The program statement

y := 31 + 28 * x

y := 31 + 28 * x Lexical analyzer

<id, “y”> <assign, > <num, 31> <+, > <num, 28> <*, > <id, “x”>

token
Token value
Parser
(token attribute)
Source
Program Tokens
Lexical
(Character analyzer
Stream)

In this phase build


table for all identifier
in source program
called symbol table
Symbol Table Manager:

Symbol table is an important data structure


created and maintained by compilers in order to
store all identifiers used in the source program,
information about the occurrence of various
entities such as:
variable names,
function names,
objects,
classes,
interfaces, etc.
The symbol table for C++ Code

Example:
The symbol table for C++ Code

Example:
Lexical Tokens

Function of lexical analyzer:


▪ Read source code (a stream of characters)
▪ Return a stream of tokens, e.g.:

1) Keywords :- while, if ,for,….

2) Identifiers :- Declared by programmer

3) Operators :- +,-,*,/,==,=,….

4) Numeric constant :- numbers such 12,35.5,.9E-23,etc


5) Character constant :- single strings of characters enclosed in quotes.
6) special characters :- characters used as delimiters such as . ( ) , ; :
7) Comments :- Ignored by subsequent phases.
Lexical Tokens
Examples of words are:

(1) keywords - while, if, else, for, ...

• These are words which may have a particular predefined


meaning to the compiler.

• Reserved words are keywords which are not available to


the programmer for use as identifiers.

• In most programming languages, such as Java and C, all


keywords are reserved.
Lexical Tokens

(2) identifiers
• words that the programmer constructs to attach a
name to a construct, Identifiers may be used to identify
variables, classes, constants, functions, etc.

(3) operators
• symbols used for arithmetic, character, or
logical operations, such as +,- ,=,!=, etc.
Lexical Tokens
(4) numeric constants
• numbers such as 124, 12.35, 0.09E-23, etc.
• Numeric constants may be stored in a table.

(5) character constants


• single characters or strings of characters enclosed
in quotes.

(6) special characters


• characters used as delimiters such as .,(,),{,},; these
are generally single-character words.
Lexical Tokens
(7) comments
• Though comments must be detected in the lexical
analysis phase, they are not put out as tokens to the
next phase of compilation.

(8) white space


• Spaces and tabs are generally ignored by the
compiler, except to serve as delimiters in most
languages, and are not put out as tokens.
(9) new line
• In languages with free format, newline characters
should also be ignored .
An example of source input:

• in C language, the variable declaration line


int value = 100;

contains the tokens:


int (keyword)
value (identifier)
= (assignment operator)
100 (integer)
; (symbol (end of statement))
An example of source input:
An example of source input:

average = (sum/count)

average identifier
= Assignment operator
( open parenthesis
sum identifier
/ Division operator
count Identifier
) Close parenthesis
An example of source input:

Count number of tokens?


An example of source input:
How many tokens are there in in this C code :
printf ("i = %d, &I = %x", i, &i) ;
output:
printf
(
"i = %d, &I = %x“
,
i
,
&
i
)
;
Count number of tokens

package tokencount;
import java.util.*;
public class tokeniz {
public static void main(String[] args) {
StringTokenizer st = new StringTokenizer("Aslamu alikum all");
System.out.println("Total tokens : " + st.countTokens());
}
}
Extract tokens:

StringTokenizer st = new StringTokenizer("this is a test");


while (st.hasMoreTokens()) {
System.out.println(st.nextToken());
}

output:
This
is
a
test
The following example illustrates how the String.split
method can be used to break up a string into its basic
tokens:

String[] result = "this is a test".split("\\s");


for (int x=0; x<result.length; x++)
System.out.println(result[x]);

output:
this
Is
a
test

You might also like