100% found this document useful (1 vote)

257 views4 pages

1 Scanning: COMP 3512 Assignment 2

This document describes an assignment to implement a simple scanner that takes source code as input and outputs the tokens with their types and locations. The scanner handles 6 types of tokens - strings, integers, identifiers, keywords, operators, and separators. Keywords and operators are specified in a configuration file. The program is run with a configuration file and outputs each token on a line with its type and location information.

Uploaded by

andyGILL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

257 views4 pages

1 Scanning: COMP 3512 Assignment 2

Uploaded by

andyGILL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

COMP 3512

Assignment 2
When a program is compiled, typically the first step is to group the characters in the program text into
tokens. This process is called scanning or lexing. In this assignment, we’ll write a simple scanner.

1 Scanning
We’ll use C as an example language. When the following program is scanned:
int main() { /* hello world program */
printf("hello, world\n");
return 0;
}
the scanner groups the characters into the following tokens:
int
main
(
)
{
printf
(
"hello, world\n"
)
;
return
0
;
}
Note that comments are skipped. The scanner may also tag each token with information about its location
(its line & character numbers) & its type. (The location information may be useful for diagnostic messages.)
For example, int is a keyword & it starts at line 1, character 1, main is an identifier & it starts at line 1,
character 5, etc.

2 Types of Tokens
For simplicity, we’ll only handle the following 6 types of tokens (with examples from C):
1. string constants, e.g., "hello"
2. integer constants, e.g., 123
3. identifier, e.g., sum, n
4. keywords, e.g., int, if
5. operators, e.g., + - ++
6. separators, e.g., [ ] ;
The language we are going to scan is C-like except that some things can be configured:
• String constants: these are enclosed in double quotes just as in C

1
• Integer constants: they are the same as those in C
• Identifiers: same as those in C — an identifier is a sequence of alphabets, digits & underscores & must
not begin with a digit. Some examples are: n2, , is valid. An invalid identifier is 2upper
• Keywords: these are identifiers reserved by the language. To make the scanner more flexible, instead
of hard-coding the keywords, they are specified in a configuration file. (See §3.)
• Separators: we’ll use the following as separators
( ) [ ] { } , ;
Note that this is a subset of those in C.
• Operators: as the number of operators may be quite large, they are specified in the configuration file.
(See §3.)
• Comments: they are not tokens & are skipped. We allow the same 2 styles of comments as C++:
– comments that start with /* (not within a string constant) & terminated by the next */
– comments that start with // (not within a string constant) & continue to the end of the line
Note that for simplicity, we only deal with integer constants. We don’t handle floating-point constants at
all.

3 Configuration File
The configuration file basically lists the keywords & the operators. Hence it has 2 kinds of sections – keyword
sections & operator sections. Note that there can be multiple keyword & operator sections.
• A keyword section lists keywords. It is started by the word KEYWORDS: All words that follow are
regarded as keywords until the start of another section or until the end of file.
• An operator section lists operators. It is started by the word OPERATORS: and, similar to a keyword
section, lasts until the start of another section or until the end of file.
The following is an example configuration file:
KEYWORDS:
int static const
OPERATORS:
+ - * / %
+= -= *= /= %=
++ --
KEYWORDS:
if else
while
for
Technically, everything can be on one line. The above example uses a more readable format.
Note that there are restrictions on keywords & operators. Keywords must satisfy the requirements for
identifiers (they are basically “reserved identifiers”), i.e., each is a sequence of alphabets, digits & underscores
& must not begin with a digit. An invalid keyword should be rejected & a warning message indicating the
invalid keyword should be printed (to standard error).
An operator cannot contain whitespaces or characters that are separators or that can be used in an
identifier. For example, a+, .=, +2 are not valid operators. As with most languages, control & other
non-printable characters are not allowed anywhere in the program text & hence can’t be used in operators.

2
4 Additional Information
Tokens are typically delimited by whitespaces, but that is not always the case. This is evident from the line

int main(){

In the above, the 2 separators ( and ) are not surrounded by whitespaces.

Consider the line:

n+++m;

Assuming that only + and ++ are valid operators, how should the line be tokenized?
We have the following possibilities:

n + + + m ;
n ++ + m ;
n + ++ m ;

Just as in C, our scanner is “greedy” — it will try to “consume” as many characters as possible & still come
up with a valid token. This means that the scanner will come up with the second possibility. Similarly,

n+++++m;

will be tokenized as:

n ++ ++ + m ;

Note that although it can be tokenized, it is not a valid C statement (for other reasons).
As another example, +2 will be tokenized as: + 2
You’ll need to implement a Token class & a Scanner class that has a getToken method that returns the
next token.

5 The Program
The program must be invoked with the name of a configuration file as a command-line argument. It reads
the program text from standard input & outputs the tokens together with location & type information.
For the hello world program in §1 & with the sample configuration file in §3, the output is:

int KEYWORD (1,1)

main IDENTIFIER (1,5)
( SEPARATOR (1,9)
) SEPARATOR (1,10)
{ SEPARATOR (1,12)
printf IDENTIFIER (2,3)
( SEPARATOR (2,9)
hello, world\n STRING (2,11)
) SEPARATOR (2,26)
; SEPARATOR (2,27)
return KEYWORD (3,3)
0 INT (3,10)
; SEPARATOR (3,11)
} SEPARATOR (4,1)

Note that the string token shown above is different from the one in §1 — the double quotes around the string
have been stripped. This is because the token is already tagged with the information that it is a string.
Hence the double quotes are really not necessary.

3
The 2 numbers separated by a comma within brackets are the line number & character number. Both
are counted from 1. The 3 parts of each line are separated by tabs. (In the above output, a tab has a width
of 8 characters.)
The above output doesn’t show an operator. For an operator, the word OPERATOR would be printed.
Note that if an invalid character or token is encountered, the program should print an error message that
includes the character or token & its location before exiting.

6 Additional Requirements
Do not use external variables. (This include global variables.)
You’ll need to implement any class or function that you use that is not in the standard C/C++ library.
We’ll be comparing the output of your program with expected output. Make sure your output adheres
to the specification. Sample input, configuration & output files may be provided.

7 Submission & Grading

This assignment is due at noon, Wednesday, March 26, 2008. You’ll need to submit your assignment via
subversion. Further information will be provided.
If your program does not compile, you may receive zero for the assignment. Otherwise, the grade
breakdown is approximately as follows:

Code clarity 10%

Handling configuration file 10%

Tokens (excluding type & location) 40%

Type information 20%

Location information 20%

DLP-MIL-Q2-Week 9 Day 1. Cite An Example of An Issue Showing The Power of Media and Information To Affect Change by George P. Lumayag
100% (3)
DLP-MIL-Q2-Week 9 Day 1. Cite An Example of An Issue Showing The Power of Media and Information To Affect Change by George P. Lumayag
6 pages
Laboratory Manual For Compiler Design: Robb T. Koether
No ratings yet
Laboratory Manual For Compiler Design: Robb T. Koether
194 pages
Introduction To C Programming
No ratings yet
Introduction To C Programming
82 pages
Textbook of Veterinary Systemic Pathology
100% (1)
Textbook of Veterinary Systemic Pathology
602 pages
C Tutorial
No ratings yet
C Tutorial
91 pages
CD Lab Manual PDF
No ratings yet
CD Lab Manual PDF
83 pages
Reference Manual
No ratings yet
Reference Manual
61 pages
DEEPAKSOLANKI Numerology
No ratings yet
DEEPAKSOLANKI Numerology
22 pages
Compiler
No ratings yet
Compiler
81 pages
Compiler Design Book Final
No ratings yet
Compiler Design Book Final
75 pages
CD Lab Manual
No ratings yet
CD Lab Manual
83 pages
Compiler Design File
No ratings yet
Compiler Design File
32 pages
Getting Started With C++
No ratings yet
Getting Started With C++
34 pages
System Programming Lab: LEX: Lexical Analyser Generator
No ratings yet
System Programming Lab: LEX: Lexical Analyser Generator
33 pages
1 Compiled Languages and C++: 1.1 Why Use A Language Like C++?
No ratings yet
1 Compiled Languages and C++: 1.1 Why Use A Language Like C++?
17 pages
Content Evaluation Tool: Annex 1.1
100% (1)
Content Evaluation Tool: Annex 1.1
10 pages
Language
No ratings yet
Language
66 pages
Communication Media Channels
No ratings yet
Communication Media Channels
27 pages
Getting Started With C++
No ratings yet
Getting Started With C++
34 pages
How To Tell A Story Workbook: Jennifer Aaker Stanford GSB
No ratings yet
How To Tell A Story Workbook: Jennifer Aaker Stanford GSB
26 pages
A Crash Course in C Ver 2
No ratings yet
A Crash Course in C Ver 2
63 pages
Lab Manual For System Software, VTU
No ratings yet
Lab Manual For System Software, VTU
34 pages
Compileeers
No ratings yet
Compileeers
53 pages
Your First C Program: Which One Is Best?
No ratings yet
Your First C Program: Which One Is Best?
16 pages
CC Projectgroup 2
No ratings yet
CC Projectgroup 2
28 pages
Computer Project Work: Getting Started With C++ Guided By:-Prepared By
No ratings yet
Computer Project Work: Getting Started With C++ Guided By:-Prepared By
35 pages
CSE-3528 Manual For AU20
No ratings yet
CSE-3528 Manual For AU20
22 pages
1 Compiled Languages and C++: 1.1 Why Use A Language Like C++?
No ratings yet
1 Compiled Languages and C++: 1.1 Why Use A Language Like C++?
10 pages
Compiler Design Lab
100% (1)
Compiler Design Lab
15 pages
Get To Know Google Ads
No ratings yet
Get To Know Google Ads
6 pages
System Programming Lab CSE
No ratings yet
System Programming Lab CSE
35 pages
CSF-401 LabRecord AnkushNegi 1000015073
No ratings yet
CSF-401 LabRecord AnkushNegi 1000015073
16 pages
CHAPTER 2 C Language
No ratings yet
CHAPTER 2 C Language
6 pages
CD 1
No ratings yet
CD 1
18 pages
NCKU Compiler 2021 Assignment 1 Paged
No ratings yet
NCKU Compiler 2021 Assignment 1 Paged
6 pages
Communicative Language Teaching
100% (7)
Communicative Language Teaching
32 pages
CD Lab Prgms Final
No ratings yet
CD Lab Prgms Final
43 pages
Program No. - 3: Write A Program To Find Different Tokens in A Program
No ratings yet
Program No. - 3: Write A Program To Find Different Tokens in A Program
3 pages
GRAMMAR Present Simple
No ratings yet
GRAMMAR Present Simple
5 pages
02 00 LexicalIssues
No ratings yet
02 00 LexicalIssues
7 pages
Proposal Rachmatul Nadya Revisi Bab I Dan Bab II
No ratings yet
Proposal Rachmatul Nadya Revisi Bab I Dan Bab II
126 pages
Board Installation SOP .
No ratings yet
Board Installation SOP .
4 pages
CD Lab Prgms Final
No ratings yet
CD Lab Prgms Final
42 pages
7.BSBLDR502 Assessment 2
100% (5)
7.BSBLDR502 Assessment 2
16 pages
Rhectorical Analysis
No ratings yet
Rhectorical Analysis
9 pages
TOKENS
No ratings yet
TOKENS
9 pages
21bai1724 Lab-02
No ratings yet
21bai1724 Lab-02
10 pages
Session 2 - C Character Set and C Tokens
No ratings yet
Session 2 - C Character Set and C Tokens
8 pages
Lesson Plan Review Rubric
No ratings yet
Lesson Plan Review Rubric
2 pages
CD File - Merged
No ratings yet
CD File - Merged
52 pages
CD Lab File
No ratings yet
CD Lab File
45 pages
Ownership of English
No ratings yet
Ownership of English
13 pages
Cara Menghafal Tenses 16 Mudah Cepat
No ratings yet
Cara Menghafal Tenses 16 Mudah Cepat
6 pages
C Notes
No ratings yet
C Notes
85 pages
Compiler Satyam
No ratings yet
Compiler Satyam
43 pages
Step-By-Step Guidelines To Your Scholarship Application Process
No ratings yet
Step-By-Step Guidelines To Your Scholarship Application Process
7 pages
Lesson Plan - Screencastify Interland
No ratings yet
Lesson Plan - Screencastify Interland
2 pages
CD Lab Programs
No ratings yet
CD Lab Programs
9 pages
The Application of Input Hypothesis and Affective Filter Hypothesis in Colleges English Listening Teaching
No ratings yet
The Application of Input Hypothesis and Affective Filter Hypothesis in Colleges English Listening Teaching
5 pages
Personal Router Configuration Guide - Tikona Digital Networks
No ratings yet
Personal Router Configuration Guide - Tikona Digital Networks
3 pages
CD Lab Manual
No ratings yet
CD Lab Manual
68 pages
Workshop Rubric
No ratings yet
Workshop Rubric
2 pages
Introduction To Effective Communication
No ratings yet
Introduction To Effective Communication
9 pages
Programming in C Unit-1
No ratings yet
Programming in C Unit-1
7 pages
C Tokens
No ratings yet
C Tokens
18 pages
Final Lab Manual CC
No ratings yet
Final Lab Manual CC
42 pages
Generic Privacy Policy Template
No ratings yet
Generic Privacy Policy Template
2 pages
Analysis of Businesses of E-Commerce and Organised Retailers in Thane District
No ratings yet
Analysis of Businesses of E-Commerce and Organised Retailers in Thane District
6 pages
Programm 1
No ratings yet
Programm 1
8 pages
Emily
No ratings yet
Emily
2 pages
Updated - Zoom Case Study
No ratings yet
Updated - Zoom Case Study
2 pages
Annex Basic Elements of C Language
No ratings yet
Annex Basic Elements of C Language
35 pages
R20 CD Lab Manual
No ratings yet
R20 CD Lab Manual
43 pages
Stapel
No ratings yet
Stapel
1 page
Reading Enrichment
No ratings yet
Reading Enrichment
4 pages
Write A C Program To Identify Different Types of Tokens in A Given Program
No ratings yet
Write A C Program To Identify Different Types of Tokens in A Given Program
6 pages
San Salvador - Actividad 2
No ratings yet
San Salvador - Actividad 2
3 pages
Compiler Design Lab Manual
No ratings yet
Compiler Design Lab Manual
84 pages
C Characters
No ratings yet
C Characters
4 pages
Harley Davidson: Case Study
No ratings yet
Harley Davidson: Case Study
4 pages
Aadi 366 CD File
No ratings yet
Aadi 366 CD File
29 pages
C Grammar
No ratings yet
C Grammar
8 pages
Ccfile
No ratings yet
Ccfile
44 pages
CD515
No ratings yet
CD515
55 pages
Compiler Design Lab
No ratings yet
Compiler Design Lab
43 pages
1 To 10
No ratings yet
1 To 10
16 pages
CD File
No ratings yet
CD File
22 pages
Introduction To C Programming
No ratings yet
Introduction To C Programming
6 pages
C Programming Language
From Everand
C Programming Language
Younish Pathan
No ratings yet
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)

1 Scanning: COMP 3512 Assignment 2

Uploaded by

1 Scanning: COMP 3512 Assignment 2

Uploaded by

COMP 3512

In the above, the 2 separators ( and ) are not surrounded by whitespaces.

will be tokenized as:

int KEYWORD (1,1)

7 Submission & Grading

Code clarity 10%

Handling configuration file 10%

Tokens (excluding type & location) 40%

Type information 20%

Location information 20%

You might also like