100% found this document useful (1 vote)
228 views

1 Scanning: COMP 3512 Assignment 2

This document describes an assignment to implement a simple scanner that takes source code as input and outputs the tokens with their types and locations. The scanner handles 6 types of tokens - strings, integers, identifiers, keywords, operators, and separators. Keywords and operators are specified in a configuration file. The program is run with a configuration file and outputs each token on a line with its type and location information.

Uploaded by

andyGILL
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
228 views

1 Scanning: COMP 3512 Assignment 2

This document describes an assignment to implement a simple scanner that takes source code as input and outputs the tokens with their types and locations. The scanner handles 6 types of tokens - strings, integers, identifiers, keywords, operators, and separators. Keywords and operators are specified in a configuration file. The program is run with a configuration file and outputs each token on a line with its type and location information.

Uploaded by

andyGILL
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

COMP 3512

Assignment 2
When a program is compiled, typically the first step is to group the characters in the program text into
tokens. This process is called scanning or lexing. In this assignment, we’ll write a simple scanner.

1 Scanning
We’ll use C as an example language. When the following program is scanned:
int main() { /* hello world program */
printf("hello, world\n");
return 0;
}
the scanner groups the characters into the following tokens:
int
main
(
)
{
printf
(
"hello, world\n"
)
;
return
0
;
}
Note that comments are skipped. The scanner may also tag each token with information about its location
(its line & character numbers) & its type. (The location information may be useful for diagnostic messages.)
For example, int is a keyword & it starts at line 1, character 1, main is an identifier & it starts at line 1,
character 5, etc.

2 Types of Tokens
For simplicity, we’ll only handle the following 6 types of tokens (with examples from C):
1. string constants, e.g., "hello"
2. integer constants, e.g., 123
3. identifier, e.g., sum, n
4. keywords, e.g., int, if
5. operators, e.g., + - ++
6. separators, e.g., [ ] ;
The language we are going to scan is C-like except that some things can be configured:
• String constants: these are enclosed in double quotes just as in C

1
• Integer constants: they are the same as those in C
• Identifiers: same as those in C — an identifier is a sequence of alphabets, digits & underscores & must
not begin with a digit. Some examples are: n2, , is valid. An invalid identifier is 2upper
• Keywords: these are identifiers reserved by the language. To make the scanner more flexible, instead
of hard-coding the keywords, they are specified in a configuration file. (See §3.)
• Separators: we’ll use the following as separators
( ) [ ] { } , ;
Note that this is a subset of those in C.
• Operators: as the number of operators may be quite large, they are specified in the configuration file.
(See §3.)
• Comments: they are not tokens & are skipped. We allow the same 2 styles of comments as C++:
– comments that start with /* (not within a string constant) & terminated by the next */
– comments that start with // (not within a string constant) & continue to the end of the line
Note that for simplicity, we only deal with integer constants. We don’t handle floating-point constants at
all.

3 Configuration File
The configuration file basically lists the keywords & the operators. Hence it has 2 kinds of sections – keyword
sections & operator sections. Note that there can be multiple keyword & operator sections.
• A keyword section lists keywords. It is started by the word KEYWORDS: All words that follow are
regarded as keywords until the start of another section or until the end of file.
• An operator section lists operators. It is started by the word OPERATORS: and, similar to a keyword
section, lasts until the start of another section or until the end of file.
The following is an example configuration file:
KEYWORDS:
int static const
OPERATORS:
+ - * / %
+= -= *= /= %=
++ --
KEYWORDS:
if else
while
for
Technically, everything can be on one line. The above example uses a more readable format.
Note that there are restrictions on keywords & operators. Keywords must satisfy the requirements for
identifiers (they are basically “reserved identifiers”), i.e., each is a sequence of alphabets, digits & underscores
& must not begin with a digit. An invalid keyword should be rejected & a warning message indicating the
invalid keyword should be printed (to standard error).
An operator cannot contain whitespaces or characters that are separators or that can be used in an
identifier. For example, a+, .=, +2 are not valid operators. As with most languages, control & other
non-printable characters are not allowed anywhere in the program text & hence can’t be used in operators.

2
4 Additional Information
Tokens are typically delimited by whitespaces, but that is not always the case. This is evident from the line

int main(){

In the above, the 2 separators ( and ) are not surrounded by whitespaces.


Consider the line:

n+++m;

Assuming that only + and ++ are valid operators, how should the line be tokenized?
We have the following possibilities:

n + + + m ;
n ++ + m ;
n + ++ m ;

Just as in C, our scanner is “greedy” — it will try to “consume” as many characters as possible & still come
up with a valid token. This means that the scanner will come up with the second possibility. Similarly,

n+++++m;

will be tokenized as:

n ++ ++ + m ;

Note that although it can be tokenized, it is not a valid C statement (for other reasons).
As another example, +2 will be tokenized as: + 2
You’ll need to implement a Token class & a Scanner class that has a getToken method that returns the
next token.

5 The Program
The program must be invoked with the name of a configuration file as a command-line argument. It reads
the program text from standard input & outputs the tokens together with location & type information.
For the hello world program in §1 & with the sample configuration file in §3, the output is:

int KEYWORD (1,1)


main IDENTIFIER (1,5)
( SEPARATOR (1,9)
) SEPARATOR (1,10)
{ SEPARATOR (1,12)
printf IDENTIFIER (2,3)
( SEPARATOR (2,9)
hello, world\n STRING (2,11)
) SEPARATOR (2,26)
; SEPARATOR (2,27)
return KEYWORD (3,3)
0 INT (3,10)
; SEPARATOR (3,11)
} SEPARATOR (4,1)

Note that the string token shown above is different from the one in §1 — the double quotes around the string
have been stripped. This is because the token is already tagged with the information that it is a string.
Hence the double quotes are really not necessary.

3
The 2 numbers separated by a comma within brackets are the line number & character number. Both
are counted from 1. The 3 parts of each line are separated by tabs. (In the above output, a tab has a width
of 8 characters.)
The above output doesn’t show an operator. For an operator, the word OPERATOR would be printed.
Note that if an invalid character or token is encountered, the program should print an error message that
includes the character or token & its location before exiting.

6 Additional Requirements
Do not use external variables. (This include global variables.)
You’ll need to implement any class or function that you use that is not in the standard C/C++ library.
We’ll be comparing the output of your program with expected output. Make sure your output adheres
to the specification. Sample input, configuration & output files may be provided.

7 Submission & Grading


This assignment is due at noon, Wednesday, March 26, 2008. You’ll need to submit your assignment via
subversion. Further information will be provided.
If your program does not compile, you may receive zero for the assignment. Otherwise, the grade
breakdown is approximately as follows:

Code clarity 10%

Handling configuration file 10%

Tokens (excluding type & location) 40%

Type information 20%

Location information 20%

You might also like