0% found this document useful (0 votes)
44 views7 pages

Lexical Analysis: 2/24/2018 John Roberts

The document discusses lexical analysis and the lexer project code. It describes how lexical analysis works by reading a stream of characters and generating a stream of tokens. It then provides an overview of the lexer project code, including how the tokens are defined in a tokens file and how the Symbol and Token classes are used. It explains how the lexer initializes known tokens and then lexes the user's program to generate symbols for identifiers and integers not already in the symbol table.

Uploaded by

eveswan2008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views7 pages

Lexical Analysis: 2/24/2018 John Roberts

The document discusses lexical analysis and the lexer project code. It describes how lexical analysis works by reading a stream of characters and generating a stream of tokens. It then provides an overview of the lexer project code, including how the tokens are defined in a tokens file and how the Symbol and Token classes are used. It explains how the lexer initializes known tokens and then lexes the user's program to generate symbols for identifiers and integers not already in the symbol table.

Uploaded by

eveswan2008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

1

8. Lexical Analysis

2/24/2018

John Roberts

2
Overview
• Lexical Analysis

• Assignment Two

• Project Code Overview

• Lexing

3
Lexical Analysis

• Read a stream of characters that make up the source


program, and create a stream of tokens by combining the
characters appropriately

• tokens are sometimes also referred to as lexical units,


or lexemes

• Example: the characters ’t’, ’h’, ‘e’, ’n’ will be combined


to build the then token

• Example: the characters ‘1’, ‘2’, ‘4’, ‘7’ will be combined


to form an integer token with a value of 1247
4
Overview
• Lexical Analysis

• Assignment Two

• Project Code Overview

• Lexing

5
Overview
• Lexical Analysis

• Assignment Two

• Project Code Overview

• Lexing

6
The Lexer

• We will be working with the lexer package

• Recall that the responsibility is to generate Tokens


7 Generating tokens
Token Categories

Category Tokens

Reserved Words program int boolean if then else while return

Identifiers <the same as Java identifiers>

Integers <a sequence of digits>

Operators = == != < <= + - * / | &

Separators {}(),

Comments // until end of line

Whitespace <spaces> <newlines> and other Java whitespace


characters

We’ll see how we use this shortly


1 Program program
2 Int int 8
3 BOOLean boolean
tokens file 4
5
If if
Then then
6 Else else
7 While while
• The tokens are defined in a tokens 8 Function function
9 Return return
file 10 Identifier <id>
11 INTeger <int>
12 LeftBrace {
• Each line in the file will have two 13 RightBrace }
strings: 14
15
LeftParen (
RightParen )
16 Comma ,
17 Assign =
• The Symbolic constant we will use 18 Equal ==
in the compiler for the token 19
20
NotEqual !=
Less <
21 LessEqual <=
22 Plus +
• The actual token 23 Minus -
24 Or |
25 And &
26 Multiply *
27 Divide /
28 Comment //

9
Token Setup

• TokenSetup.java will read tokens, and automatically


generate the files Tokens.java and
TokenType.java

• The Tokens enum is actually a class - you can add


methods, instance fields, and a constructor that can only
be used to construct the enumerated values

• Values are accessed as Tokens.If, etc.


10
TokenSetup.java

• Examine code to ensure we understand how it works

• Execute TokenSetup and inspect Tokens.java and


TokenTypes.java

11
SourceReader.java

• Examine code to ensure we understand how it works

• Note that we will be updating this file to generate better


output (which we’ll see in a minute when we run Lexer)

12
Token.java

• Each Token contains four pieces of information

• String of Token found in source

• TokenType

• Starting column from source file

• Ending column

• The first two items are grouped as a Symbol


13 Note we’ve seen this hash pattern before…
Symbol.java

• String from the source, and TokenType

• All Strings (corresponding to tokens) found in the source


program will be placed into the hash table in the Symbol
class (the Symbol table)

• Before we begin, we place all Tokens in the Symbol


hash table

• Each String will (should) be inserted exactly once

1 program { int j int k 14


2 j = j + k
Symbol.java example 3 }

Token( Symbol( "program", Tokens.Program ), 1, 7 )


Token( Symbol( "{", Tokens.LeftBrace ), 9, 9 )
Token( Symbol( "int", Tokens.Int ), 11, 13 )
Token( Symbol( "j", Tokens.Identifier ), 15, 15 )
Token( Symbol( "int", Tokens.Int ), 17, 19 )
Token( Symbol( "k", Tokens.Identifier ), 21, 21 )
Token( Symbol( "j", Tokens.Identifier ), 2, 2 )
Token( Symbol( "=", Tokens.Assign ), 4, 4 )
Token( Symbol( "j", Tokens.Identifier ), 6, 6 )
Token( Symbol( "+", Tokens.Plus ), 8, 8 )
Token( Symbol( "k", Tokens.Identifier ), 10, 10 )
Token( Symbol( "}", Tokens.RightBrace ), 1, 1 )

• Symbol( String s, Tokens kind ) - insert


s into the hash table with value given by kind; if the
entry is already in the table, then just return the entry

15
Symbol.java example

• Note that we repeated a Symbol three times



Symbol( “j”, Tokens.Identifier )

• For efficiency, we only want to create one instance of


each Symbol, so we use the hash table to check if the
Symbol has already been created. If so, re-use, if not,
create a new instance.

• Logic encapsulated in Symbol class


16
Overview
• Lexical Analysis

• Assignment

• Project Code Overview

• Lexing

17
Performing Lexical Analysis

• Prior to processing the user’s program, we’ll create


Symbol instances for all reserved words, operators, etc.
so we can find them later (see TokenType.java)

• Once the lexer starts processing the user’s program, the


only new symbols that will be created (added to the hash
map) will be identifiers and numbers - all other symbols
would have been created before

18
Initializing

• Insert all token in HashMap<String, Symbol>

• tokens HashMap in TokenType holds all of the known Token/


Symbol pairs, e.g.

tokens.put(

Tokens.Program, 

Symbol.symbol(“program",Tokens.Program)

);

• Each of these are stored in the symbol table as they are


generated (see implementation for Symbol.symbol)

• At this point, Symbol.symbols.get( “program” ) yields


Symbol( “program”, Tokens.Program )
19
Lexing

• Scan the program line by line (character by character), and


insert symbols not already in the the symbols table (identifiers
and ints)

• If we look up an identifier in the symbols:

• Reserved word (e.g. program) - found and


Symbol( “program”, Tokens.Program ) returned

• User id not already in symbols - we don’t find it, so we put


a new entry return the new Symbol

• User id already in symbols - return the entry

20
Lexing

• If we look up other tokens in symbols

• Numbers - put new entry, if not already there

• Not found - don’t do anything

• e.g. = vs. == vs. !=, / vs. // - these are either one or


two character tokens

• e.g x =abc + y - we can key on the character =,


and save the a for the start of the next token (the abc
identifier)

You might also like